Thank you for your interest in our data extraction add-on—pdf2Data, we hope you will enjoy using our product and share your experiences with us and the iText community. We will walk you through the installation process, from downloading iText 7 pdf2Data to adding the dependency to your Java build tool.
If you require any extra help please have a look at our FAQs or the community discussion at StackOverflow. If you are interested in getting support from our in-house developers and/or a license key for commercial iText products, you will need to acquire a commercial license.
Before you install
- Make sure you have purchased a commercial license for iText 7 Core and pdf2Data if using them for commercial purposes. All downloads we offer closed-source come with our commercial license model.
- Install iText 7 Core, you can find the installation guide here.
- Important remark: in the installation guide we use Maven as a build tool for Java: iText 7 Core
The fastest way to start with pdf2Data is to create a data field template in the online editor. Upload your template PDF file to the first step of the process, use the data field editor to mark entities to be recognized, download the template and use it in your automated environment for extracting data from a series of PDF files.
Refer to the videos section on the pdf2Data demo site to quickly get familiarized with the user interface.
If you want to use pdf2Data in your environment, you need to have a license key. The license key is an XML file which you have to load into the license key library before using any API.
If you are using other iText add-ons as well, your license keys might be stored in multiple files, especially if you purchased the add-ons separately. In this case you can load several licenses into the license key library one by one, or by passing an array of the license keys to the license key library.
Using pdf2Data in code
The preferred way to set up pdf2Data in Java is to use a build system like Maven or Gradle and download pdf2Data artifacts from the iText Artifactory located at https://repo.itextsupport.com/pdf2data/.
The groupId is
com.duallab.pdf2data, and the artifactId is
In Maven, the configuration would look similar to the example below:
<repository> <id>pdf2Data</id> <name>pdf2Data Maven Repository</name> <url>https://repo.itextsupport.com/pdf2data</url> </repository> <dependency> <groupId>com.duallab.pdf2data</groupId> <artifactId>pdf2data</artifactId> <version>2.1.3-SNAPSHOT</version> </dependency>
Example of how pdf2Data can be used in code:
// Make sure to load license file before invoking any code LicenseKey.loadLicenseFile(pathToLicenseFile); // Parse template into an object that will be used later on Template template = Pdf2DataExtractor.parseTemplateFromPDF(pathToPdfTemplate); // Create an instance of Pdf2DataExtractor for the parsed template Pdf2DataExtractor extractor = new Pdf2DataExtractor(template); // Feed file to be parsed against the template. Can be called multiple times for different files ParsingResult result = extractor.recognize(pathToFileToParse); // Save result to XML or explore the ParsingResult object to fetch information programmatically result.saveToXML(pathToOutXmlFile);
Installation instructions for the data fields editor web application (aka pdf2Data template editor)
If you want to use the editor in your environment, follow these installation instructions:
- Apache Tomcat 7 (≥ 7.0.77) or 8
- Java 8
Command Line Interface
- Download the war file of the version you are interested in from the iText Artifactory
- Create a properties file with the following contents:
# Set temporary directory for resources dir.temp=your_folder_for_resources # Path to iText license file, e.g. licensekey=/home/user/license.xml licensekey=path_to_license_file.xml
- Create an environment variable PDF2DATA_PROPERTIES and set it to the path of the file from the previous step
- Deploy the application on the installed Tomcat server. In most cases it is sufficient to copy the war file into the webapps subdirectory in the Tomcat directory
- Start the Tomcat server, if it was not running before, and you are ready to go
Command Line Interface
It is possible to use pdf2Data from the command line as long as you have Java 7 or 8 installed.
You can download the CLI application from the iText Artifactory.
The steps are similar to the ones you would typically do in code. The output format for data extraction is XML.
Creating template entity from a template PDF
java -jar cli.jar preprocess -t template.pdf -x template.xml -l license.xml
java -jar cli.jar parse -t template.xml -s file_for_parsing.pdf -p recognized.pdf -x recognized.xml
java -jar cli.jar help preprocess java -jar cli.jar help parse