White paper

pdf2Data: extract data from PDF documents

In this white paper, you'll learn how to extract data from your PDF documents with this iText 7 add-on. By creating a template with your initial document, you can then programmatically extract data from other similar documents within your workflow.

What is it?

pdf2Data allows you to extract data from PDF documents. The process is based on a framework that recognizes data inside PDF documents, based on areas that you have selected for extraction in your template. pdf2Data works best on documents that are based on the same template, such as an invoice coming from the same supplier. This makes it easier to automate document workflow, reducing human error and your processing time.

The data recognition is based on several rules, which need to be defined in advance per each template field.

 

TYPICAL RULES ARE:

  • the same (horizontal / vertical) position on the page
  • the same font size and style
  • certain text pattern (numeric, currency sign, etc)
  • certain keywords on the same as the required field
  • certain cell(s) in the table

This means that you can create a fully automated solution for data recognition in a PDF document with basic set-up on the original sample template. The template relies on dynamic field selectors such as font, style, position and text patterns to find the required fields in your data. To ensure you have the best possible results, we leverage iText text extraction, which offers a high fidelity recognition process.

Setting up pdf2Data includes integration with pure Java API with CLI (commany line interface) and REST interfaces. You also can choose between the convenient web application software package included with pdf2Data that enables you to define selectors in a more intuitive way, by installing the software on your workstation or using a PDF commenting tool such as Adobe Reader, to define the selectors.

How does it work?

RECOGNITION IS BASED ON THE FOLLOWING STEPS:

  1. Select parts of the template that correspond to your data fields using the pdf2Data web application or any PDF Viewer with commenting functionality.
  2. Define relevant rules for the correct data extraction in the comment attached to each selection.
  3. Upload the template to the web site running the template engine, and see if it recognized your fields and data inside them.
  4. Upload any other PDF document that is based on the same template and check if the software could recognize your data.

Steps 1 to 3 need to be done only once per template. Step 4 can be repeated for as many documents as needed. But they all need to be based on the same template.

Please note that although this example makes use of our own pdf2Data server, it is of course possible to use your own. Or to simply forego the web interface altogether and define your template using the Adobe Reader commenting facilities.

Continue reading the pdf2Data white paper

We hope you enjoyed this first page of the white paper, continue reading the full white paper.

Download View all white papers

White paper file
pdf2Data White Paper

Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now