uglytoad/pdfpig

This project allows users to read and extract text and other content from PDF files

x? Use this guide: Or from the package manager console:

This project allows users to read and extract text and other content from PDF files

PdfPig

This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes.

This project aims to port PDFBox to C#.

Migrating to 0.1.6 from 0.1.x? Use this guide: migration to 0.1.6.

Installation

The package is available via the releases tab or from Nuget:

#404

Or from the package manager console:

> Install-Package PdfPig

While the version is below 1.0.0 minor versions will change the public API without warning (SemVer will not be followed until 1.0.0 is reached).

Get Started

The simplest usage at this stage is to open a document, reading the words from every page:

using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
{
    foreach (Page page in document.GetPages())
    {
        string pageText = page.Text;

        foreach (Word word in page.GetWords())
        {
            Console.WriteLine(word.Text);
        }
    }
}

An example of the output of this is shown below:

Where for the PDF text ("Write something in") shown at the top the 3 words (in pink) are detected and each word contains the individual letters with glyph bounding boxes.

To create documents use the class PdfDocumentBuilder. The Standard 14 fonts provide a quick way to get started:

PdfDocumentBuilder builder = new PdfDocumentBuilder();

PdfPageBuilder page = builder.AddPage(PageSize.A4);

// Fonts must be registered with the document builder prior to use to prevent duplication.
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);

page.AddText("Hello World!", 12, new PdfPoint(25, 700), font);

byte[] documentBytes = builder.Build();

File.WriteAllBytes(@"C:\git\newPdf.pdf", documentBytes);

The output is a 1 page PDF document with the text "Hello World!" in Helvetica near the top of the page:

Each font must be registered with the PdfDocumentBuilder prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.

Usage

The PdfDocument class provides access to the contents of a document loaded either from file or passed in as bytes. To open from a file use the PdfDocument.Open static method:

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf"))
{
    int pageCount = document.NumberOfPages;

    // Page number starts from 1, not 0.
    Page page = document.GetPage(1);

    decimal widthInPoints = page.Width;
    decimal heightInPoints = page.Height;

    string text = page.Text;
}

PdfDocument should only be used in a using statement since it implements IDisposable (unless the consumer disposes of it elsewhere).

Encrypted documents can be opened by PdfPig. To provide an owner or user password provide the optional ParsingOptions when calling Open with the Password property defined. For example:

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf",  new ParsingOptions { Password = "password here" }))

You can also provide a list of passwords to try:

using (PdfDocument document = PdfDocument.Open(@"C:\file.pdf", new ParsingOptions
{
    Passwords = new List<string> { "One", "Two" }
}))

The document contains the version of the PDF specification it complies with, accessed by document.Version:

decimal version = document.Version;

Document Creation (0.0.5)

The PdfDocumentBuilder creates a new document with no pages or content.

For text content, a font must be registered with the builder. This library supports Standard 14 fonts provided by Adobe by default and TrueType format fonts.

To add a Standard 14 font use:

public AddedFont AddStandard14Font(Standard14Font type)

Or for a TrueType font use:

AddedFont AddTrueTypeFont(IReadOnlyList<byte> fontFileBytes)

Passing in the bytes of a TrueType file (.ttf). You can check the suitability of a TrueType file for embedding in a PDF document using:

bool CanUseTrueTypeFont(IReadOnlyList<byte> fontFileBytes, out IReadOnlyList<string> reasons)

Which provides a list of reasons why the font cannot be used if the check fails. You should check the license for a TrueType font prior to use, since the compressed font file is embedded in, and distributed with, the resultant document.

The AddedFont class represents a key to the font stored on the document builder. This must be provided when adding text content to pages. To add a page to a document use:

PdfPageBuilder AddPage(PageSize size, bool isPortrait = true)

This creates a new PdfPageBuilder with the specified size. The first added page is page number 1, then 2, then 3, etc. The page builder supports adding text, drawing lines and rectangles and measuring the size of text prior to drawing.

To draw lines and rectangles use the methods:

void DrawLine(PdfPoint from, PdfPoint to, decimal lineWidth = 1)
void DrawRectangle(PdfPoint position, decimal width, decimal height, decimal lineWidth = 1)

The line width can be varied and defaults to 1. Rectangles are unfilled and the fill color cannot be changed at present.

To write text to the page you must have a reference to an AddedFont from the methods on PdfDocumentBuilder as described above. You can then draw the text to the page using:

IReadOnlyList<Letter> AddText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Where position is the baseline of the text to draw. Currently only ASCII text is supported. You can also measure the resulting size of text prior to drawing using the method:

IReadOnlyList<Letter> MeasureText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Which does not change the state of the page, unlike AddText.

Changing the RGB color of text, lines and rectangles is supported using:

void SetStrokeColor(byte r, byte g, byte b)
void SetTextAndFillColor(byte r, byte g, byte b)

Which take RGB values between 0 and 255. The color will remain active for all operations called after these methods until reset is called using:

void ResetColor()

Which resets the color for stroke, fill and text drawing to black.

Document Information

The PdfDocument provides access to the document metadata as DocumentInformation defined in the PDF file. These tend not to be provided therefore most of these entries will be null:

PdfDocument document = PdfDocument.Open(fileName);

// The name of the program used to convert this document to PDF.
string producer = document.Information.Producer;

// The title given to the document
string title = document.Information.Title;
// etc...

Document Structure (0.0.3)

The document now has a Structure member:

UglyToad.PdfPig.Structure structure = document.Structure;

This provides access to tokenized PDF document content:

Catalog catalog = structure.Catalog;
DictionaryToken pagesDictionary = catalog.PagesDictionary;

The pages dictionary is the root of the pages tree within a PDF document. The structure also exposes a GetObject(IndirectReference reference) method which allows random access to any object in the PDF as long as its identifier number is known. This is an identifier of the form 69 0 R where 69 is the object number and 0 is the generation.

Page

The Page contains the page width and height in points as well as mapping to the PageSize enum:

PageSize size = Page.Size;

bool isA4 = size == PageSize.A4;

Page provides access to the text of the page:

string text = page.Text;

There is a new (0.0.3) method which provides access to the words. This uses basic heuristics and is not reliable or well-tested:

IEnumerable<Word> words = page.GetWords();

You can also (0.0.6) access the raw operations used in the page's content stream for drawing graphics and content on the page:

IReadOnlyList<IGraphicsStateOperation> operations = page.Operations;

Consult the PDF specification for the meaning of individual operators.

There is also an early access (0.0.3) API for retrieving the raw bytes of PDF image objects per page:

IEnumerable<XObjectImage> images = page.ExperimentalAccess.GetRawImages();

This API will be changed in future releases.

Letter

Due to the way a PDF is structured internally the page text may not be a readable representation of the text as it appears in the document. Since PDF is a presentation format, text can be drawn in any order, not necessarily reading order. This means spaces may be missing or words may be in unexpected positions in the text.

To help users resolve actual text order on the page, the Page file provides access to a list of the letters:

IReadOnlyList<Letter> letters = page.Letters;

These letters contain:

  • The text of the letter: letter.Value.
  • The location of the lower left of the letter: letter.Location.
  • The width of the letter: letter.Width.
  • The font size in unscaled relative text units (these sizes are internal to the PDF and do not correspond to sizes in pixels, points or other units): letter.FontSize.
  • The name of the font used to render the letter if available: letter.FontName.
  • A rectangle which is the smallest rectangle that completely contains the visible region of the letter/glyph: letter.GlyphRectangle.
  • The points at the start and end of the baseline StartBaseLine and EndBaseLine which indicate if the letter is rotated. The TextDirection indicates if this is a commonly used rotation or a custom rotation.

Letter position is measured in PDF coordinates where the origin is the lower left corner of the page. Therefore a higher Y value means closer to the top of the page.

Annotations (0.0.5)

Early support for retrieving annotations on each page is provided using the method:

page.ExperimentalAccess.GetAnnotations()

This call is not cached and the document must not have been disposed prior to use. The annotations API may change in future.

Bookmarks (0.0.10)

The bookmarks (outlines) of a document may be retrieved at the document level:

bool hasBookmarks = document.TryGetBookmarks(out Bookmarks bookmarks);

This will return false if the document does not define any bookmarks.

Forms (0.0.10)

Form fields for interactive forms (AcroForms) can be retrieved using:

bool hasForm = document.TryGetForm(out AcroForm form);

This will return false if the document does not contain a form.

The fields can be accessed using the AcroForm's Fields property. Since the form is defined at the document level this will return fields from all pages in the document. Fields are of the types defined by the enum AcroFieldType, for example PushButton, Checkbox, Text, etc.

Hyperlinks (0.1.0)

A page has a method to extract hyperlinks (annotations of link type):

IReadOnlyList<UglyToad.PdfPig.Content.Hyperlink> hyperlinks = page.GetHyperlinks();

TrueType (0.1.0)

The classes used to work with TrueType fonts in the PDF file are now available for public consumption. Given an input file:

using UglyToad.PdfPig.Fonts.TrueType;
using UglyToad.PdfPig.Fonts.TrueType.Parser;

byte[] fontBytes = System.IO.File.ReadAllBytes(@"C:\font.ttf");
TrueTypeDataBytes input = new TrueTypeDataBytes(fontBytes);
TrueTypeFont font = TrueTypeFontParser.Parse(input);

The parsed font can then be inspected.

Embedded Files (0.1.0)

PDF files may contain other files entirely embedded inside them for document annotations. The list of embedded files and their byte content may be accessed:

if (document.Advanced.TryGetEmbeddedFiles(out IReadOnlyList<EmbeddedFile> files)
    && files.Count > 0)
{
    var firstFile = files[0];
    string name = firstFile.Name;
    IReadOnlyList<byte> bytes = firstFile.Bytes;
}

Merging (0.1.2)

You can merge 2 or more existing PDF files using the PdfMerger class:

var resultFileBytes = PdfMerger.Merge(filePath1, filePath2);
File.WriteAllBytes(@"C:\pdfs\outputfilename.pdf", resultFileBytes);

API Reference

If you wish to generate doxygen documentation, run doxygen doxygen-docs and open docs/doxygen/html/index.html.

Issues

Please do file an issue if you encounter a bug.

However in order for us to assist you, you must provide the file which causes your issue. Please host this in a publically available place.

Credit

This project wouldn't be possible without the work done by the PDFBox team and the Apache Foundation.

Issues

Quick list of the latest Issues we found

Gerardo-Sista

Gerardo-Sista

Icon For Comments1

Hello, I need PdfPig to open a pdf file from memorystream. I tried with

`using var stream = file.OpenReadStream(); using var ms = new MemoryStream(); await stream.CopyToAsync(ms); using (PdfDocument document = PdfDocument.Open(ms)) { foreach (UglyToad.PdfPig.Content.Page page in document.GetPages()) { string pageText = page.Text;

foreach (Word word in page.GetWords()) { Console.WriteLine(word.Text); } } }`

but it always gives this error: UglyToad.PdfPig.Core.PdfDocumentFormatException: 'Could not find the version header comment at the start of the document.'

If I load the same file from path it works good.

Any help?

Thank you

massonib

massonib

Icon For Comments0

image image

In these examples, I am drawing a box around the Letter's glyph rectangle, but it does not line up with the actual text of the PDF. Horizontal Scaling (Tz) is being using a lot within this document, but PDFPig seems to rebuild the document fine as shown in the image. The letter's font says it is "QYVNFR+CourierNewPSMT", but I am pretty sure it is supposed to be Arial MT - so this is likely the issue.

From the pdf stream: 11 0 obj <</Ascent 1021 /CapHeight 571 /Descent -680 /Flags 34 /FontBBox [ -122 -680 623 1021 ] /FontFamily (Courier New) /FontFile2 10 0 R /FontName /QYVNFR+CourierNewPSMT /FontStretch /Normal /FontWeight 400 /ItalicAngle 0 /StemV 40 /Type /FontDescriptor /XHeight 423 >> endobj 12 0 obj <</BaseFont /QYVNFR+CourierNewPSMT /Encoding /WinAnsiEncoding /FirstChar 0 /FontDescriptor 11 0 R /LastChar 255 /Subtype /TrueType /Type /Font /Widths [ ...numbers... ] >> endobj 13 0 obj <</Ascent 905 /CapHeight 716 /Descent -212 /Flags 32 /FontBBox [ -665 -325 2000 1040 ] /FontFamily (Arial) /FontName /Arial /FontWeight 400 /ItalicAngle 0 /StemV 0 /Type /FontDescriptor >> endobj 14 0 obj <</BaseFont /Arial /Encoding /WinAnsiEncoding /FirstChar 32 /FontDescriptor 13 0 R /LastChar 121 /Subtype /TrueType /Type /Font /Widths [ ...numbers... ] >> endobj

I cannot share this document unfortunately, but the commands leading up to the first placement of text with the error are:

0.454545 0 0 0.454545 75.272728 36 cm BT 111.1111 Tz Neither Tr /F0 1 Tf 17.688 0 0 17.688 2248.8 6.586 Tm 122.2222 Tz 9.36 0 0 9.36 29.6 110.8 Tm TEXT WITH ISSUE EXTRA WIDE Tj 88.8889 Tz -0.14 -1.8 Td (move to next line) TEXT NOT QUITE AS WIDE, BUT STILL WITH ISSUE Tj ET

There are no Tc or Tw operations in use.

var page = document.GetPages().First(); PdfDocumentBuilder builder = new PdfDocumentBuilder(); PdfPageBuilder newPage = builder.AddPage(document, 1); newPage.SelectContentStream(0); newPage.NewContentStreamBefore(); newPage.SetStrokeColor(255, 0, 0); foreach (var letter in page.Letters) { newPage.DrawRectangle(letter.GlyphRectangle.BottomLeft, (decimal)letter.GlyphRectangle.Width, (decimal)letter.GlyphRectangle.Height, 1, false); } File.WriteAllBytes(@"D:\Desktop\Drawings\PdfPigTest.pdf", builder.Build());
mhfarrell

mhfarrell

Icon For Comments2

PdfPig 0.1.6 Nuget Package.

I am following the guide within the github repo but i keep getting an error about the font not containing a character. I have attached a few lines of code. This particular string is throwing the error even though it does not contain the full stop character.

System.InvalidOperationException HResult=0x80131509 Message=The font does not contain a character: ’. Source=UglyToad.PdfPig StackTrace: at UglyToad.PdfPig.Writer.PdfPageBuilder.DrawLetters(NameToken name, String text, IWritingFont font, TransformationMatrix fontMatrix, Decimal fontSize, TransformationMatrix textMatrix) at UglyToad.PdfPig.Writer.PdfPageBuilder.AddText(String text, Decimal fontSize, PdfPoint position, AddedFont font) etc (cant share more as it reveals clients)

grinay

grinay

Icon For Comments1

I have one exactly document , which when I start reading
using var pigPdf = PdfDocument.Open(file); foreach (var pigPage in pigPdf.GetPages())

as a first page it read the last page of document. I downgrade version to 0.1.5 and it reads everything correct. I can't provide you that document unfortunately it's under NDA. Would it be possible to check what is the possible reason for that?

joanfercho

joanfercho

Icon For Comments0

hello good morning, I like your bookstore but I don't know how to read a pdf horizontally, since currently it reads vertically and I have a document that I need to read horizontally, if you could help me I would be grateful, since they are not always in order. documento para mostrar.pdf .

tuohaibei

tuohaibei

Icon For Comments1

When I read the following pdf,it have the error message,could you help to check, System.ArgumentOutOfRangeException:“Index was out of range. Must be non-negative and less than the size of the collection. Arg_ParamName_Name”

I use nuget package: https://www.nuget.org/packages/PdfPig/0.1.7-alpha-20220622-fc71a

the pdf is over 25M,i coule not upload it.If you want to the pdf,could you tell me how i send you the file?

unwork-ag

unwork-ag

question
Icon For Comments12

I have been starting to look for alternatives to iText7 for text extraction since iText has issues with some pdf documents that I need to handle which seem to have an uncommon (but still valid) encoding.

PdfPig can handle these files and overall provides pretty pleasing extraction results. However, when I run a benchmark using some of my example files, I see that there are some significant outliers. The file that I put [here] (https://drive.google.com/file/d/1-NZfDcUJvbpVUzAb9buCtUs3MWsYUGVT/view?usp=sharing) takes about 4 ms in iText7 but >1 second with PdfPig. I'm using the NearestNeighbourWordExtractor - but the results are pretty similar with the DefaultWordExtractor. For other files I have pretty reasonable results (28ms for 5 pages).

Any idea if I could configure the word extractor somehow to speed up the processing of this and similar files (using a Filter or FilterPivot delegate)?

Zeek2

Zeek2

question
Icon For Comments4

Hi, I've tried a similar library, PdfSharp, but it displayed each emoji as 2 little square boxes :( A blocker for me :(

I gather PdfJet displays 2 byte emojis ok. So does PdfPig display emojis correctly (4 bytes, as well as 2 bytes ones)??

robtomasi

robtomasi

bug
Icon For Comments1

i have this error on open some pdf, on this statement foreach (var page in document.GetPages()) of class OpenDocumentAndExtractWords in the public static string Run(string filePath) any ideas?

thanks a lot

pclancysc

pclancysc

bug
Icon For Comments1

Looks like I may have triggered a bug in the parsing of a specific PDF I need to process.

The exception is UglyToad.PdfPig.Core.PdfDocumentFormatException

The message is this.. (there is no "key", it's an empty string)

Expected name as dictionary key, instead got:

Any help much appreciated, I can provide the PDF on request.

SitAnko

SitAnko

enhancement
Icon For Comments0

In Parser\PageFactory.cs the media-box gets clipped to int values: mediaBox = new MediaBox(mediaboxArray.ToIntRectangle(pdfScanner)). In real-life files, MediaBox values are not always integers. An easy fix would be to use ToRectangle() instead of ToIntRectangle(): mediaBox = new MediaBox(mediaboxArray.ToRectangle(pdfScanner)).

asbjornu

asbjornu

bug
Icon For Comments2

As mentioned in https://github.com/UglyToad/PdfPig/issues/441#issuecomment-1108396669, it would make cross-platform PDF generation much easier to verify if PdfPig was able to automatically break extracted text at a configurable number of characters. Something like this:

ContentOrderTextExtractor.Options.BreakLinesAt could be an int? and its default null value would mean line breaks aren't enforced, but rather extracted from the PDF as is.

alexneblett

alexneblett

question
Icon For Comments2

Hi,

Thank you for this amazing component. I have run into an issue extracting text from the attached pdf. The text from page.text is garbled, but if I open the pdf in Adobe Acrobat Reader, select all, then copy paste into notepad, the pasted text is what you see in the pdf (usually both are garbled with font mapping issues, etc.). This gives me hope (hopefully not false hope) that perhaps there is a way to extract the text. To be fair, I tried a few other components and they extracted the same garbled text.

Cheers,

Alex

fc30326e-64a0-4a6e-895a-c3d4aeae2974.pdf

Pikoh

Pikoh

Icon For Comments0

Hi!,

I'd like to ask if it would be possible to keep the signature image when merging documents. I know the digital signature would render invalid when merging, as it should be, but i'd like to keep the signature image in the output pdf. I've tried serveral alternatives and itextsharp and e.g. ghostscript does this.

Thank you.

asbjornu

asbjornu

enhancement
Icon For Comments3

As mentioned in https://github.com/UglyToad/PdfPig/issues/439#issuecomment-1100861460, for some PDFs, ContentOrderTextExtractor.GetText(page) extracts spaces as tabs. A PDF file that illustrates this issue has been sent to

When using PdfPig to extract text for Verification testing (as explained in https://github.com/VerifyTests/Verify/issues/507#issuecomment-1091523647), this causes the extracted text to be considered unequal across operating systems.

If the current behavior of extracting tabs is not a bug, I would love to see a feature added to ContentOrderTextExtractor where GetText() takes an option that allows for space normalization or something similar, ensuring the coalescing and conversion of all space and control characters into 0x20 and 0x0A.

zlangner

zlangner

Icon For Comments0

I'm attempting to use PdfPig to open an existing PDF document and embed any fonts that are not currently embedded but can be found among the system's fonts. The algorithm I've used in iTextSharp is as follows:

Open existing pdf file and iterate through the pages. Identify the fonts being used on each page.

  • if the font that was found is embedded add it to the collection of embedded fonts and continue the iteration (moving to the next font on the page)
  • if the font that was found is not embedded attempt to embed it.
    • if successful, add it to the collection of embedded fonts
    • if unsuccessful, add it to the collection of nonembedded fonts

It appears that this kind of font support has not yet been developed in PdfPig but other open issues are asking for similar functionality:

  • #211
  • #281

Please let me know if what I'm asking for is possible today or if I need to wait for further enhancements.

Thank you for the awesome work on this project, it meets all my other needs.

rajasekarshanmugam

rajasekarshanmugam

enhancement
Icon For Comments0

When processing a few documents, we noticed "index" exceptions in the below file/line -

\PdfPig\src\UglyToad.PdfPig\Graphics\ReflectionGraphicsStateOperationFactory.cs

adding a simple operands length check avoids the exception.

return !operands.Any() ? null : new AppendRectangle(OperandToDecimal(operands[0]),

Is this check needed for operators? Not too sure.

Please confirm, can generate a PR if needed for the above.

ivan-perezhogin-sto

ivan-perezhogin-sto

Icon For Comments5

Hi. First of all, I want to thank you for such a great library and your hard work, it really helps me a lot. I have a document with a lot of tabulation (\t) inside text with Type1Standard14Font (and it doesn't contain this character). And I don't know why, but you are setting 2.5 width for such unknown symbols, and this leads to unreliable bounding boxes for other characters in this text operator. All the renders I've tested and all readers don't show such characters at all (so the width for such characters is 0), so I think it's better to use zero-width or at least to add an option (delegate or just property) for this case inside ParsingOption. Sorry that I've not attached the file, but it contains some personal info and I can't do it. Here is a screenshot of how such a text operators are looked at inside the stream and how coordinates looked on the rendered page TabulationWidth

famda

famda

question
Icon For Comments5

Hello,

I'm using this library for a while and let me say: It's awesome.

Is there any way to get image labels from a pdf (maybe using the DocumentLayoutAnalysis)?

Thanks in advance.

rajasekarshanmugam

rajasekarshanmugam

question
Icon For Comments2

Is there any way to remove all annotations or a specific annotation?

We could also use a pagebuilder and try to copy the page manually to a copy of the same PDF, post which if annotations are accessible, then it can be removed. Is there any other harder way (lookup any dictionary to process those), please let know.

dyster

dyster

bug
Icon For Comments3

The parameter "characterCode" passed to the function has a value of 1 There are two segments, 0 to 65535m delta 0, offset 4 and 65535 to 65535, delta 1, offset 0 On the first segment it iterates through it passes the if checks and arrives at an offset of 3. so adding together the offset of 3, the segment.count of 2 and the iterator of 0 the return line is looking inside the GlyphIds at position 1, but this array is empty thus getting the argument out of range.

In the constructor for Format4CMapTable the glyphids are checked for null, but not checked for being empty. I don't know if it should be checked for emptiness or if it is not meant to be empty.

The pdf that triggered this problem is proprietary so I cannot share it on github or somewhere public, but I could potentially share it with a developer privately.

bunchofcoders

bunchofcoders

bug
Icon For Comments0

Looks like the function below returns bytes with value 1 instead of 255 which produces near black png. for all other type of filters it works fine.

Filter: FlateDecode ColorSpace: DeviceGray BitsPerComponent: 1

public static byte[] Convert(ColorSpaceDetails details, IReadOnlyList decoded, int bitsPerComponent, int imageWidth, int imageHeight);

dehyang

dehyang

enhancement
Icon For Comments7

I don't know how to operate

page.GetImages() ... After?

Help me, thank you

cremor

cremor

bug
Icon For Comments2

I found a Jpeg image file that can not be added to a PDF by calling PdfPageBuilder.AddJpeg(). The resulting image in the PDF just shows a mostly black area.

I debugged the problem a bit and it seems like there are at least two problems:

The image has a width of 3024 pixel and a height of 4032 pixel. But the internal PdfPig call to JpegHandler.GetInformation(fileStream) returns a width of 160 and a height of 120. So the values are not only wrong but also represent a wrong aspect ratio.

The wrong aspect ratio seems to be because the image contains an EXIF orientation tag that specifies a 90 degree rotation. (EXIF tag id 0x112 and value 6.) It seems like PdfPig doesn't check this EXIF tag to get the image dimensions.

I've then used the Windows image viewer to resave the image. This seems to remove the EXIF orientation tag and instead save the image already "correctly" orientated. But even after that the image can't be added via PdfPig. JpegHandler.GetInformation(fileStream) then returns a width of 192 and a height of 256. So the aspect ratio (orientation) is now correct, but the image dimensions are still way off.

From what I could figure out with https://cyber.meme.tips/jpdump/ the problem might be that PdfPig uses the wrong bytes to get the dimensions. The bytes that represent the dimension values returned by PdfPig start at offset 0x1b8. But accoring to jpdump the correct header starts only at offset 0x299b. So it seems like PdfPig finds the "header marker" value of 0xffc0 in an earlier part of the file that isn't actually the header yet. According to jpdump that early part of the file contains various "applicaton segments" and the first of that "application segment" seems to be quite big and contain a thumbnail of the image.

Sadly, I don't know if I can provide the problematic image file. It contains personal data so I'd have to check that with the owner of the data. And if I try to modify the image to redact the personal data the problem doesn't happen any more. But I can provide specific bytes of the files if that helps. Or I could provide you with more information from https://cyber.meme.tips/jpdump/

CourseAve-JF

CourseAve-JF

bug
Icon For Comments4

.NET 4.6.2 C# console app PdfPig 0.1.5 installed via NuGet

I'm working on an accessibility checker tool, which has to locate information in /StructTreeRoot /ParentTree. I have several 1.6 PDFs where this is working as expected. However, I have a 1.7 PDF that was generated from a Word doc saving to PDF. The /StructTreeRoot element appears to be there, however when I try to resolve the indirect reference, I get an error: 'Could not find the object with reference: 12 0.' Sure enough, when I view the PDF in PDFAnalyzer, it shows a similar thing. Curious thing is ... that in the Cross-Reference table, no object 12 exists.

When I view the PDF in a text editor, it appears that there is an xref table, along with a couple of xref streams:

xref 0 71 0000000012 65535 f 0000000017 00000 n 0000000166 00000 n 0000000222 00000 n 0000000495 00000 n 0000001778 00000 n 0000001939 00000 n 0000002165 00000 n 0000002218 00000 n 0000002271 00000 n 0000002438 00000 n 0000002670 00000 n 0000000013 65535 f 0000000014 65535 f 0000000015 65535 f

0000000063 65535 f 0000000064 65535 f 0000000065 65535 f 0000000000 65535 f 0000003793 00000 n 0000004050 00000 n 0000004265 00000 n 0000007350 00000 n 0000007395 00000 n trailer <</Size 71/Root 1 0 R/Info 11 0 R/ID[<197D8B74949BED4F93394F748EB48C61><197D8B74949BED4F93394F748EB48C61>] >> startxref 7770 %%EOF xref 0 0 trailer <</Size 71/Root 1 0 R/Info 11 0 R/ID[<197D8B74949BED4F93394F748EB48C61><197D8B74949BED4F93394F748EB48C61>] /Prev 7770/XRefStm 7395>> startxref 9346 %%EOF

Adobe Acrobat seems to open this PDF and parse out the structure information just fine. So is this some sort of edge case in the 1.7 specification, or what? Is there a way to make adjustments so that PdfPig can read the structure info? I can provide the whole doc if need be ... nothing special in it.

Versions

Quick list of the latest released versions

v0.1.6 - Apr 25, 2022

Mainly bug fixes. There are some compatibility changes in the document layout analysis API. See here: https://github.com/UglyToad/PdfPig/wiki/Migration-to-0.1.6

  • Fix transparency being applied for PDF/A-1
  • Fixes to string handling
  • .NET 6.0 support
  • Handle null rather than missing encryption data
  • Fixes bug with size of JPG files in documents created by PdfPig
  • Better handling for unusual Type1 fonts
  • Support for invisible/hidden text in document builder
  • Fixes stack overflow when parsing page tree for some documents
  • Fixes bug in some glyph bounding boxes for Type2 fonts
  • Handle non-contiguous xref ranges when building a document
  • Better location of version headers for non-compliant documents

v0.1.5-alpha002 - May 09, 2021

Some more bug-fixes:

  • Fix for object streams in files which require brute force searching.
  • Handle NullToken presence when creating documents.
  • Support for PDFs where the filters are defined as indirect references (against specification).
  • Support for CMYK when generating PNG images from IPdfImage.
  • Support for indexed ColorSpaces where palette is stored in a string.
  • Handle UTF16 strings in encrypted document dictionaries.
  • Handle documents with a XMP metadata stream instead of an information dictionary.
  • CCITTFaxDecode filter support.
  • Tweaks to DefaultWordExtractor to try and detect word gap size based on preceding text instead of a global gap threshold.

Note that changes to DefaultWordExtractor may change the output of calls to Page.GetWords() in this version.

v0.1.5-alpha001 - Feb 28, 2021

First alpha version of 0.1.5

  • Fix glyph bounding boxes and paths for Type1 fonts using flexpoints.
  • Fix stack overflow when merging some documents.
  • Support loading existing documents into PdfDocumentBuilder.
  • Performance improvements for multithreaded scenarios.
  • Fix checked value for AcroForm checkboxes where the checked state is appearance only.
  • New page.GetOptionalContents() partial optional content retrieval support.
  • Partial support for colorspace details on IPdfImages.
  • Multiple bug-fixes for various font related issues.

Breaking changes:

  • PdfDocumentBuilder now implements IDisposable. This disposes the underlying stream by default but this is a MemoryStream normally so not any serious consequences if left undisposed.
  • PdfPageBuilder had the AdvancedEditing property removed. The API is now available in the ContentStream methods / properties (this was from #250).

v0.1.4 - Nov 29, 2020

  • Adds support for filling rectangles when using PdfDocumentBuilder. The DrawRectangle method now takes an optional boolean parameter, fill.
  • Fix bug recognising Standard 14 fonts with Arial MT naming.
  • Handle unusual object streams containing endobj tokens.
  • Support broken Differences arrays for encodings.
  • Support very long xref streams by making infinite loop detection more relaxed.
  • Fix issue with parsing Type0 fonts that are using indirect references.
  • Internal structure changes to support pdf to image work.

v0.1.3 - Nov 15, 2020

  • Fixes a set of bugs for font handling and PDF parsing.
  • Improves font detection on Linux systems
  • Improves calculation of PointSize for letters accounting for rotation and other transformations
  • Improves document layout analysis results in some cases
  • Fixes writing UTF strings when using document builder
  • Improvements to PDF graphics path API

v0.1.3-alpha001 - Sep 04, 2020

First alpha version of 0.1.3

v0.1.2 - Jul 04, 2020

Some new features, performance tweaks and improved Document Layout Analysis tools:

  • PDF/A compliance for PdfDocumentBuilder, use PdfDocumentBuilder.ArchiveStandard to select a PDF/A compliance level.
  • Performance improvements to parsing.
  • Clipping support for PdfPaths, now PdfSubpath. Use ParsingOptions.ClipPaths to enable clipping.
  • SVG Exporter in Document Layout Analysis
  • Improvements to Recursive XY Cut algorithm in Document Layout Analysis.
  • Fixes to PDF Merging to support more use-cases. Use PdfMerger.Merge to generate merged PDFs.
  • Proper support for letters and paths in rotated PDF documents, previous locations were incorrect when the page dictionary contained a rotation value.
  • Better support for guessing point size for letters.
  • ContentTextOrderExtractor in Document Layout Analysis uses the existing content order of text from the page's content stream to generate text as a string.
  • IPdfImage now supports TryGetBytes() instead of Bytes. TryGetBytes returns false for JPXDecode and DCTDecode image filters for which RawBytes represent a valid JPEG image.
  • Font flags such as bold and italic available on Letter.
  • Bugfix for CID fonts.
  • TextDirection is now TextOrientation, various fixes to the calculations of orientation and bounding box for Words.
  • Most Document Layout Analysis algorithms now take in a DlaOptions parameter to specify behaviour.
  • Bugfix to files with large amounts of trailing data.
  • Support for OpenType in CID fonts.

0.1.2-alpha003 - Jun 20, 2020

  • Many updates to document layout analysis algorithms
  • Bugfix for files with a large number of non-data trailing bytes
  • Bugfix for OpenType fonts
  • Paths and glyphs are now correctly rotated when the page itself has a rotation value

01.2-alpha002 - May 10, 2020

Adds letter font details and a couple of other bugfixes to the alpha version.

0.1.2-alpha001 - Apr 25, 2020

First alpha version of 0.1.2

0.1.1 - Mar 18, 2020

Many bug fixes for a whole range of document types. In addition:

  • Add support for JPG images in PdfDocumentBuilder using page.AddJpeg().
  • Access to marked content using page.GetMarkedContents()
  • Early access to PDF merging using PdfMerger.Merge()
  • Adds Doc-Comments back to the package.
  • Improvements to NearestNeighbourWordExtractor and other Document Layout Analysis classes to support rotated text.

0.1.1-alpha001 - Mar 15, 2020

A whole bunch of bug fixes and other changes.

0.1.0 - Jan 13, 2020

This version focuses on improving performance.

To enable this it replaces decimals with doubles for most of the public API. It also reorganizes the code internally to support access to font related classes.

For this reason consumers will need to update their code, see the migration guide on the wiki.

Other features:

  • Access to hyperlinks provides a convenience wrapper for retrieving annotations of type Link and their text content and destination. Use page.GetHyperlinks().
  • Bug fixes for glyph positions.
  • Access to the embedded files in the document. Use document.Advanced.TryGetEmbeddedFiles(out IReadOnlyList<EmbeddedFile> files).
  • Ability to provide a list of passwords to try when opening encrypted documents. Use ParsingOptions.Passwords to provide the list of passwords. Any password set in ParsingOptions.Password will be included in the list of passwords.
  • Many bug fixes for different documents.

0.1.0-beta002 - Jan 08, 2020

Updates the 0.1.0 beta version with many bug fixes.

0.1.0-beta001 - Jan 06, 2020

First release which moves internal numerics from decimal to double where appropriate.

Reorganises internal project structure.

See migration details in the wiki: https://github.com/UglyToad/PdfPig/wiki/Migration-0.0.X-to-0.1.0

0.0.11 - Dec 17, 2019

This release fixes a major performance regression in 0.10.0.

It also adds bug-fixes for several new issues as well as additional methods for the geometry objects PdfPath, PdfLine and PdfRectangle.

v0.0.10 - Dec 09, 2019

This release adds two main new features:

  • Access to form elements (AcroForms) such as text input, checkboxes, radio-buttons, etc. Use document.TryGetForm(out AcroForm form) to get the form for the document if it contains one.
  • Access to bookmarks which define the document structure by linking to chapters, etc. Use document.TryGetBookmarks(out Bookmarks bookmarks) to get the document's bookmarks tree if it contains one.

It also aims to improve performance for most content retrieval operations resulting in up to double speed for the smallest documents.

It also adds bug-fixes, structure analysis tools and small improvements:

  • Adds document.GetPages() as a convenience method to enumerate all pages in a document.
  • Adds hOcr, AltoXml and PageXml format exporters to export the page content to standardized formats which can be used in other tools. These exporters implement the ITextExporter interface and are used to export each page to a compatible string.
  • Improves support for retrieving images from a page. The new page.GetImages() method enumerates all images on a page, images are either InlineImages or XObjectImages.
  • Adds support for extracting text which is defined in XObject forms (distinct from AcroForms) which was previously skipped, meaning text could have been missing from the page.Text on certain document types.
  • Adds support for vertical writing mode fonts (Japanese, etc).
  • Additional bug fixes.

0.0.9 - Aug 14, 2019

This release fixes a major regression in 0.0.7 which broke consuming documents via streams. It also adds new features:

  • Document Layout Analysis: Adds the Docstrum (Doc Spectrum) algorithm for page segmentation.
  • Document segmentation approaches (Docstrum and RecursiveXYCut) implement the IPageSegmenter interface which now returns a list of TextBlocks. XYLeaf and XYNode are now internal.
  • TextEdgesExtractor is a new class which can be used to detect shared alignment in sections of text.
  • Letters now have a Color property. This is one of the types implementing IColor. These are GrayColor, RGBColor and CMYKColor, other color spaces are not currently supported and default to GrayColor.Black.
  • PdfDocument now has a TryGetXmpMetadata(out XmpMetadata metadata) method which will retrieve the XML XMP Metadata object from the document if one is present.

v0.0.7 - Aug 03, 2019

This release primarily focuses on more bug-fixing to improve stability of extracting text content. The main new features are full support for encrypted documents, Document Layout Analysis tools and early-access path information.

  • Fix a bug using DefaultWordExtractor where the Letters collection on all words would be empty.
  • Supports UTF-16 encoded strings in document content, such as document information dictionaries, and in HexToken based strings.
  • Supports all forms of document encryption up to and including revision 6 in PDF 2.0 spec.
  • Prevents crashes where PDF contains circular object references.
  • The new DocumentLayoutAnalysis namespace supports nearest-neighbour word extraction and recursive X-Y cut document segmentation. RecursiveXYCut.GetBlocks implements the Recursive X-Y cut algorithm https://en.wikipedia.org/wiki/Recursive_X-Y_cut. NearestNeighbourWordExtractor can be provided to Page.GetWords for a different word extraction technique.
  • Fix bug where some letters had a width or height of zero.
  • More tolerant search for cross-reference offsets, if the cross-reference offsets are incorrect we search for the corresponding object.
  • Handle a case where CidFonts contained hex rather than string tokens for registry-ordering-supplement information.
  • Support cross-reference tables even if they appear after the first %%EOF end of file marker.
  • Support rotated pages. Page now contains a Rotation property indicating if the page is rotated at the top level. Valid values for rotation are 0, 90, 180 and 270. The currently reported PageSize does not take rotation into account yet. This also adds support for properly rotating letters and page content.
  • Change internal letter point size calculation, Page.ExperimentalAccess.GetPointSize(Letter letter) now reports the point size with an updated calculation which handles rotated letters.
  • Map character codes directly to ASCII character values where there's no corresponding Unicode value. This matches PDFBox 1.8/9 behaviour where if no Unicode value can be found, the integer value is mapped directly to a character.
  • Expose PdfPath information from the page's content stream. Early access to path/geometry information parsed from the page's content. Use Page.ExperimentalAccess.Paths to access lines, rectangles, curves, etc declared by the page.

v0.0.6 - May 19, 2019

This release focuses on stability improvements and has been tested on far more document types than previous releases. The 2 main new features are support for full framework versions of .NET back to .NET 4.5 making this library available to more users and initial support for encrypted documents using the most basic form of document encryption.

The release may contain a bug in System Font loading which has not been replicated but may make the library crash on some systems. Please file a bug report if you encounter an error on this package version.

  • Adds the ability to access all raw operations in a page's content stream. This is the set of instructions which form the graphical features on the page. Access using page.Operations.
  • Supports defining operations on a PdfPageBuilder directly using builder.Advanced.Operations.
  • Support for full framework .NET versions back to .NET 4.5.
  • Support for Compact Font Format CID fonts.
  • Support for Standard 14 fonts which are incorrectly declared as TrueType fonts.
  • Performance improvements for System Fonts, where the document relies on fonts installed on the host operating system, only tested on Windows.
  • Many stability fixes for all font types and parsing documents.
  • Text direction added to letter and word. Indicates the rotation of the text.
  • Add support for encrypted documents, documents using the newer AES encryption will still throw but RC4 encryption is now supported. A password may be supplied in ParsingOptions.
  • Support for LZW filters which were the last filter left to be implemented.

v0.0.5 - Dec 30, 2018

Adds new document creation and provides access to per-page annotations.

v0.0.3 - Nov 27, 2018

  • Reworks the public API of Letter to provide height information. See the Letters page on the wiki.
  • Adds support for Type 1 fonts with Compact Font Format fonts and retrieving height information.
  • Bug fixes, stability improvements and performance improvements.
  • PdfDocument now has a Structure property. This is an UglyToad.PdfPig.Structure object which provides access to the tokenized content of the PDF file and the merged Cross Reference Table in the document. Any objects in the PDF file may be accessed by object reference number allowing consumers to work around missing functionality. All tokens used internally when interpreting PDF documents are available on the public API.
  • Page now has a IEnumerable GetWords() method which uses a default word extractor to attempt merging letters into words based on heuristics using letter positions. Consumers may provide their own IWordExtractor to the method to improve on the very basic approach used in this release or continue using the raw letters.

v0.0.1 - Feb 26, 2018

The first non pre-release version.

v0.0.1-alpha-002 - Jan 22, 2018

Fixes an issue where the only encoding present is embedded in the font program. Supports reading from streams.

v0.0.1-alpha-001 - Jan 10, 2018

The initial alpha release

Library Stats (Aug 05, 2022)

Subscribers: 34
Stars: 795
Forks: 124
Issues: 85

csharp-data-visualization

I've always wanted to learn how to visualize data in C#

csharp-data-visualization

C Sharp Helper Methods

Bu bir Windows Form uygulamasıdır ve içerisinde genel olarak ERP projelerinde sıkça kullanılabilecek bazı metotları ve kullanımlarını içermektedir

C Sharp Helper Methods

CSharp-Collection

Challenges, projects, educational files

CSharp-Collection

CSharp_Veri_Tipleri

byte, sbyte, short, ushort, int, uint, long, ulong, decimal, bool, char, string, var, object veri tipleri incelenmistir

CSharp_Veri_Tipleri

CSharp-CodeSnippet

Wide variety of sample code snippets from the topics related in C#

CSharp-CodeSnippet

csharp-aspnet-microservices

Course on building microservices on

csharp-aspnet-microservices

CSharp &quot;C#&quot; WAVE &quot;

Parses the audio data and the format chunk info from a WAVE-Format audio file &quot;

CSharp &quot;C#&quot; WAVE &quot;

CSharp_ChromaStreamApp

C# Chroma Stream App for Chroma RGB streaming

CSharp_ChromaStreamApp

CSharp-SMTP-Server

Simple (receive only) SMTP server library for C#

CSharp-SMTP-Server

CSharp To Mindustry Logic

This is a code transpiler that will transpile C# code to mlog

CSharp To Mindustry Logic