Before I was hired, I knew that the iText PDF SDK had been around for over 20 years but I was unaware of just how widespread its usage was. However, after attending our customer event “Shake It, Make It” I had a much better understanding of why we have more than 125 million users worldwide, spread across almost every industry. After getting to see Michaël Demey demonstrate how developers use the iText PDF library to generate PDF documents in the previous session, I was excited to watch Israel Garcia and Pavel Chermyanin show off iText’s collaborative solutions for document generation and data extraction: iText DITO and iText pdf2Data.
I learned that iText isn’t just about generating and manipulating PDFs using code; we also develop fully-fledged document solutions. iText DITO allows designers or non-technical users to easily create documents by building data-driven templates. On the other hand, iText pdf2Data offers a way to automatically recognize and extract data from PDFs by using extraction templates. Both document solutions provide intuitive browser-based interfaces and are very versatile and user-friendly.
They also work well in combination with each other, since you can take data extracted with iText pdf2Data and repurpose it in a brand-new document using iText DITO. Pavel and Israel, who are the Product Managers for iText pdf2Data and iText DITO respectively, used their session to illustrate such a use case. To demonstrate the kind of workflow that can be accomplished with these tools, they extracted some cocktail recipe information from a recipe book and transformed the data into an attractive cocktail menu. And all in under 20 minutes!
Pavel kicked off by loading the PDF of the recipe book into iText pdf2Data’s template editor, so that he could create an extraction template. Being a template-based solution, iText pdf2Data uses these extraction templates to compare (or “parse”) against similar documents, meaning you only need one such document to begin the automated extraction of the data you require. In this case, the data in the recipe book consisted of the name of each cocktail, its ingredients, and an image of what the completed cocktail looked like. By defining each of these data elements, Pavel could extract each of the recipes in a structured JSON format, and then give the extracted data to Israel to work with.
“iText pdf2Data is an intuitive browser-based tool to help any non-tech person to create the extraction template really easily” – Pavel Chermyanin
The next step was to define the “cocktail_name” as a data element. iText pdf2Data allows you to do this using a wide range of configurable selectors which can be used either separately or in combination in a parsing pipeline. Since all the recipes in the book used a common format with names in a specific text style, it was simple to identify them using the font selector.
Pavel then defined the cocktail ingredients. Again, he did this using the font selector, but there was a complication: the recipe book showed the ingredients as a numbered list, which wasn’t something he wanted to extract along with the ingredient data. The solution for this was to add a regular expression selector (or regex) to the parsing pipeline, to trim the leading characters before extracting them.
Finally, he moved onto recognizing the images with the image selector. This didn’t require any specific boundaries or extra parameters to be defined since he wanted to extract all the images from the document.
Now that all the data elements had been defined, there was only one thing left to do. By default, the generated data file would contain the output from all the defined data fields in sequence, e.g., the cocktail names, the ingredients, and then the images. So that the data would be structured in the desired order, Pavel added a “groupBy” statement to the definitions of the ingredients and images so that they would be grouped with the cocktail name data element instead.
With this done, iText pdf2Data could use this extraction template to parse against the actual document and extract the data. Shortly after, a confirmation page was displayed where Pavel could check for any extraction errors. After confirming the data was correct, he saved the data as a JSON file.
With the extracted data now ready, Pavel passed the baton on to Israel to start assembling the custom cocktail menu in iText DITO. First, Israel used the iText DITO Manager to import the JSON file as a data collection, this way iText DITO could use this a data source to populate the template. He also added some extra images as resources, to use for decoration.
“iText DITO is an easy way for designers or users with no coding skills to create PDF templates. It is also an easy way for administrators to manage users, templates, and assets.” – Israel Garcia
Israel then created a new template and began designing the cocktail menu in iText DITO’s template editor. Using the WYSIWYG editing toolbox, he quickly added a table with an iText logo and a background image from the template resources, along with a text heading.
Next, Israel created a separate table for the actual cocktail recipes and used the data binding functionality to connect rich text elements in specific table cells to data elements in the imported JSON data. By simply binding the “cocktail_name”, “cocktail_ingredients”, and “cocktail_image” data elements to the selected table row for the first recipe, the task was completed. After saving and previewing the template, Israel showed that iText DITO had automatically filled in those cells with the required text and image for the “Dark and Stormy” cocktail from the JSON file!
That’s the power of iText pdf2Data and iText DITO. In a matter of minutes Pavel and Israel were able to identify and extract data from a source PDF and generate a new document to present the information in a customized format – all with no coding required.
The iText customer event was the perfect introduction to help me learn some in-depth details about iText and our offerings. It was enlightening to see our solutions in use, and how iText pdf2Data and iText DITO were able to transform data quickly and easily into an amazing looking document.
Next time, we’ll look at a session that focused on one of iText’s recent case studies. Our VP of Products André Lemos sat down with RadBee Co-Founder and CEO Rina Nir, to discuss how she and her development team utilized iText 7, with the pdfHTML, and pdfRender add-ons to create an innovative and disruptive solution to improve Jira workflows in the pharmaceutical industry, and other sectors.
This article was based on a talk given at iText’s 2022 customer event.