RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://blog.idrsolutions.com/understanding-pdf-text-objects/ below:

Understanding PDF text objects

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading. Understanding PDF text objects

Updated: May 11, 2022 1 min read

Understanding PDF Text Objects

Inside a PDF is a Postscript stream of commands which describe the page – they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.

BT (begin a block of text)

/F13 12 Tf (Choose Font F13 and set size to 12)

288 720 Td (move the location relative from where it now is

(ABC) Tj (Draw the Text ABC)

ET (End the text block)

So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding – it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDFReference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitary set of values – they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).

The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded.

However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDFs you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDFs…

Our software libraries allow you to Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4