How to extract text and anchor information from a PDF?

I am trying to generate a text file with target links information.

25th October 2015
admin-marketing

I am looking for a method to extract the text as well as anchor information using iText.

For example: the PDF content is "You can visit our website, XYZ, and do something" where XYZ is a clickable link. The output when extracting this content should be: "You can visit our website, XYZ (www.google.com) and do something".

Basically I am trying to generate a text file with target links information.

Posted on StackOverflow on Jul 10, 2014 by user985395

The static text you can see in an PDF file is stored in content streams using PDF syntax as described in Adobe's Imaging Model.

The interactive features you can see in a PDF file are stored outside the content stream of a page in so called Annotation dictionary using the Carousel Object System (COS).

You are probably making the assumption that when you see a clickable word XYZ, there is something like XYZ inside the PDF.

There isn't.

There will be something like:

/F1 12 Tf
(XYZ )Tj

somewhere in the content stream that contains the /Contents of a page.

When you inspect the /Annots of a page, you will find something like:

>
  /Subtype/Link
  /C[0 0 1]
  /Border[0 0 0]
  /Rect[36 803.52 98.03 814.62]
>>

as an object in your PDF file.

If you want to extract all the links and the corresponding text from a document, you need to loop over all the page dictionaries, get the /Annots, check which annotations are of subtype /Link, get the action (/A), and the coordinates (/Rect).

To know which text corresponds with the text, you need to uses iText text parser classes with a "region text" strategy and extract the text at the positions defined by the /Rect entry.


Share this article

Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now