iText pdf2Data for PDF processing
iText pdf2Data is a solution to easily recognize and extract data from documents. It is available for Java and C# (.NET), and as a CLI version.
It offers a framework to intelligently recognize data inside PDF documents, based on selection rules that you define in a template. This offers significant advantages over AI-based alternatives which need extensive training to recognize documents.
And thanks to its intuitive web-based template creator, anyone, from marketers to information managers to HR staff, can create and update templates. You don't need to be a developer to benefit from using iText pdfData.
How iText pdf2Data works
Many PDF documents businesses need to process, such as registration forms, invoices etc. follow a common structure. If we take the example of an invoice document, addresses, purchase order numbers and similar document elements tend to be located in one place, and only the content such as item descriptions, quantities and cost of items change from invoice to invoice.
iText pdf2Data offers an easy way to extract data from such PDF documents by defining areas and rules in a template which correspond to the content you want to extract. The template can then be visually validated with other documents to confirm data is recognized correctly, before being parsed by the pdf2Data SDK to process all subsequent documents matching that template.
Unlike AI-based alternatives, you don’t need hundreds of samples and intensive supervision to train the recognition process. The content recognition is controlled by the template you configure, meaning no training is required before you can begin extracting data. You only need one example document to enable data extraction from all subsequent documents.
AI recognition has other disadvantages too. Any changes to the required output (such as adding a new field) will require models to be retrained, and multiple language support is minimal at best. Documents using the same layout but containing content in different languages can give wildly inconsistent results.
iText pdf2Data on the other hand suffers from none of these drawbacks. Making modifications to templates is quick and easy, and it offers excellent language support.
Using the pdf2Data template creator
By using the intuitive browser-based pdf2Data template creator, it’s easy to create a template for data extraction. Simply create a template PDF based on a sample document, by defining selectors for areas of interest. Selectors are configurable rules to detect different types of content for extraction.
Many selectors are available to define, including
Price etc. enabling pdf2Data to intelligently recognize and extract data and other content. The selectors can be configured to detect:
- page range and the position on the page
- specific font styles, font color, and text patterns
- fixed keywords next to the data
- automatic recognition of table structures
The pdf2Data template creator was designed to allow non-developers such as business users or functional analysts to define and modify templates as required, enabling more collaborative workflows.
Want to try it out? We have an online demo of pdf2Data to test with an example document, or one you upload yourself.
The recognition process is based on the following steps:
Step 1. Upload a sample PDF document (this will become our template).
Step 2. Select data in the document you would like to extract and define relevant extraction rules (selectors) for the correct data extraction.
Step 3. Upload any other PDF document based on the same template and confirm your data was recognized correctly.
Step 4. Start using the template in the pdf2Data server-side component. You can integrate it into your document workflow as a Java or .NET library, or as a command-line application, enabling you to process potentially millions of documents with ease.
Using the pdf2Data SDK to extract data
Below you can see an example of using the pdf2Data SDK to parse a pre-defined template. After loading the license file required to enable pdf2Data to work, you can parse a document against your template and extract the data with just a couple of lines of code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 // Make sure to load license file before invoking any code LicenseKey.loadLicenseFile(pathToLicenseFile); // Parse template into an object that will be used later on Template template = Pdf2DataExtractor.parseTemplateFromPDF(pathToPdfTemplate); // Create an instance of Pdf2DataExtractor for the parsed template Pdf2DataExtractor extractor = new Pdf2DataExtractor(template); // Feed file to be parsed against the template. Can be called multiple times for different files ParsingResult result = extractor.recognize(pathToFileToParse); // Save result to XML or explore the ParsingResult object to fetch information programmatically result.saveToXML(pathToOutXmlFile);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 // Make sure to load license file before invoking any code LicenseKey.LoadLicenseFile(pathToLicenseFile); // Parse template into an object that will be used later on Template template = Pdf2DataExtractor.ParseTemplateFromPDF(pathToPdfTemplate); // Create an instance of Pdf2DataExtractor for the parsed template Pdf2DataExtractor extractor = new Pdf2DataExtractor(template); // Feed file to be parsed against the template. Can be called multiple times for different files ParsingResult result = extractor.Recognize(pathToFileToParse); // Save result to XML or explore the ParsingResult object to fetch information programmatically result.SaveToXML(pathToOutXmlFile);
Data is extracted in XML format, such as the example below:
<?xml version="1.0" encoding="UTF-8"?> <elements> <data name="DATE"> <text x="61.4" y="519.83" width="38.56" height="8.0" page="1">08/12/2016</text> <text x="96.28" y="477.57" width="38.56" height="8.0" page="2">16/01/2017</text> </data> <data name="END_USER_ADDRESS"> <text x="102.25" y="612.39" width="98.84" height="40.0" page="1">Angela Merkel To the att. of Angela Merkel 059-X025 KucheMacherStraße 71060 Sandelfängen Germany</text> </data> <data name="FAX"> <text x="486.04" y="727.53" width="58.16" height="9.0" page="1">+32 92 70 33 75</text> </data> </elements>
Here you will find the needed resources to install, configure and use the iText pdf2Data components. If you’re looking for a demonstration of how iText pdf2Data works, make sure to check out our online demo where you can test it with an example document, or one you upload yourself.
Why use iText 7 pdf2Data?
Data is an important commodity, and you may have more than you realize locked inside your PDF documents. Collecting this data manually could take a lot of time and resources, with the risk of input errors or security issues to consider.
With iText pdf2Data you can automate the process of extracting data in a secure way. By reviewing documents against your template to validate the recognition process is correct, you can also ensure consistent results.
If your documents are not PDF, then iText has you covered. The iText 7 add-on pdfOCR turns scanned documents and images into PDF (or PDF/A-3u if you need long-term archiving compliance) ready to be processed by iText pdf2Data.
Automate PDF data extraction from PDF invoices, forms and other documents
Extract and process data from small or large volumes of PDFs by defining the information that is important for your data processes in a template. Automate PDF data extraction with programming in Java and .NET (C#) or simply using the CLI.
Define which specific data you want to target for PDF data extraction
Easily define the desired information you want to extract in a template with the pdf2Data template creator. pdf2Data for PDF data extraction works with all PDF documents, such as invoices, forms, reports etc. and makes PDF data processing a highly efficient part of your workflow.
Integrate automated PDF data extraction into your existing document process
iText pdf2Data uses open standards to facilitate integration, which makes integrating it into existing workflows easy and fast. In addition to the easy to use pdf2Data template creator, it also includes developer-focused SDKs for Java and .NET (C#) as well as a command line interface. PDF data processing for the 21st century.
Better than AI-based alternatives?
Since the content recognition is based on selectors you define in the template, iText pdf2Data requires no prior training to recognize and extract data. The data recognition uses on a number of rules, which need to be defined in advance per each data field. Typical rules use all details from the PDF document that help to ensure the correct data extraction.
Core capabilities of iText 7 pdf2Data
pdf2Data works by defining the areas, fonts, patterns, or tables of interest in a template that is used for all PDFs created in the same format, such as an invoice or other commercial documents.
You then can define areas of interest with selectors.
Each selector uses a different way of identifying the information that is important and can be used in conjunction or alone to meet your needs.
Extract data from PDF documents
Leverage iText 7 Core content extraction, for a high fidelity recognition process of text and images for PDF data processing.
Intuitive extraction configuration
iText pdf2Data has comprehensive out of the box functionality, with the flexibility to extend and customize. Focus on easy integration and open standards.
Use templates to streamline extraction
Define areas of interest and selection rules to get exactly the content you need.
Integrate in your PDF and/or data workflow
Data is output in a structured, reusable format for further processing, with access to the page coordinates of the extracted content.
iText DITO, a data-driven template-based PDF generator
Now you’ve got data extraction through templating done and dusted, are you interested in a template-based solution for PDF creation from data?