XML

Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction

ian.morris — Tue, 30 Jan 2024 09:24:54 +0000

Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction ian.morris Tue, 01/30/2024 - 10:24

We are thrilled to kick off a new year by announcing our latest data extraction release - pdf2Data 4.4. This version is focused on two major areas: enhancing user experience and refining data extraction capabilities. It is key to eliminating data silos and enabling users of all skill levels to make data-driven decisions.

How pdf2Data Can Streamline Your IDP Workflows

Many businesses need to access and reuse data trapped inside PDFs, such as invoices, statements, or contracts. Recent years have seen the rise of Intelligent Document Processing (IDP) solutions, typically relying on AI or machine learning to recognize and process documents. However, such approaches require extensive training to accurately identify documents, which can be time-consuming and expensive.

In contrast, our pdf2Data solution uses a template-based approach which requires only a single example PDF to get started. Since documents such as invoices from a common supplier will have a standardized layout with only the content changing, pdf2Data allows you to collaboratively build, manage, and reuse extraction templates for specific document types.

With pdf2Data, you can easily automate the extraction of content from PDFs and transform it into reusable, structured data. Using the wide range of selectors, you can quickly build a parsing pipeline to find and extract useful data in documents. Selectors are available to intelligently identify specific text, barcodes, dates, and even multi-page tables.

Thanks to pdf2Data’s flexible and on-premises deployment, integration into existing document workflows can be seamlessly and securely achieved. With convenient Docker deployment and a RESTful API in addition to native Java and .NET libraries, pdf2Data has comprehensive cross-platform compatibility, whatever your infrastructure.

In short, pdf2Data can save your business precious time and boost the productivity of modern IDP-focused workflows by easily allowing data in PDFs to be accessed and repurposed.

Enhanced and Intuitive User Experience

Our goal has always been to make data extraction as seamless and code-free as possible so that any authorized user can use pdf2Data with little to no IT requirements. With this release, we are doubling down on that commitment.

Users can now set up even more complex extraction pipelines in the pdf2Data Editor by using new predefined rules and selectors which require minimal to no coding skills to make full use of. Templates can then be used by the pdf2Data Parsing Engine to process your documents, and enable more efficient automated IDP workflows.

Mixing and matching selectors is especially beneficial for extracting data from documents with complex structures or formatting, and as with all pdf2Data 4.x releases, we’ve focused on further improving the access to selector functionality previously only available in expert mode. Let’s take a closer look.

More Intuitive Data Extraction

With pdf2Data 4.4, we’ve taken a significant leap in data extraction technology. The new release offers:

Advanced Parsing: The ability to handle more complex documents with ease, thanks to the new selectors.
Improved Recognition Results: We’ve revised the format of pdf2Data’s recognition results to be more logical and consistent across the different Parsing Engines and JSON/XML output formats. Data fields have been streamlined so that the results are more predictable and grouped results have also been improved. The new format also allows easier introduction of new result types and selectors in the future.

Introducing the Search Area Feature

The new Search area gives greater control over where pdf2Data applies the parsing pipeline in documents. It replaces the previous the Page and Boundary selectors which have now been deprecated. You can restrict the area by specifying a page, page range, or by selecting a specific part of the document. You can then include or exclude parts of the page by clicking on them, as shown below:

You can easily include adjacent parts of the document by clicking on them.

You can also watch a tutorial video demonstrating its usage if you prefer.

Locate Specific Data with Crop Content

The new Crop content selector allows users to define specific areas of interest in a document based on its content. It's a significant improvement for processing multi-section documents and non-static forms.

Using the Crop content selector in conjunction with the Table selector to find and recognize specific table data.

Refine and Restrict Results with the Filter Selector

The Filter selector is a powerful addition that validates and filters extracted values. You can use specific search conditions to exclude unwanted content and ensure only relevant data is captured.

Using the Filter selector to extract a needed clause from a contract.

Together, these selectors improve the parsing process, making it more efficient and user-friendly than ever before.

Why Upgrade to pdf2Data 4.4?

If you are looking to enhance your data extraction processes with minimal coding and maximum efficiency, pdf2Data 4.4 is ready for an upgrade today. If you are new to pdf2Data or just want to learn more about how data extraction makes your work easier, faster, and more accurate, reach out today.

What’s Next for pdf2Data?

We have big plans for upcoming releases. Coming soon will be a great new feature where pdf2Data will be able to recognize and extract content from images – not just PDF documents. This will significantly extend pdf2Data’s capabilities and strengthen its position as an essential part of cutting-edge, efficient IDP workflows.

Ready to Dive In?

Download pdf2Data 4.4 today and experience improved data extraction and overall user experience. For a full list of pdf2Data improvements, please see the changelog and other documentation on our website. You can also reach out for a full demonstration if you are new to data extraction solutions.

Alternatively, if you're looking for more code-based solutions you can check out our suite of developer-focused Intelligent Data Extraction capabilities on the Apryse website.

Tags

Article type

iText news Technical notes

Main image

pdf2Data 4.4 blog teaser

Introducing iText pdf2Data 4.0: Template management and much more!

ian.morris — Tue, 06 Dec 2022 09:17:11 +0000

Introducing iText pdf2Data 4.0: Template management and much more! ian.morris Tue, 12/06/2022 - 10:17

Introduction

We’re pleased to announce a new release of iText pdf2Data; our user-friendly template-based data extraction solution. We’ve been hard at work since the previous release, and as promised last time there’s a wealth of new stuff to tell you about. Think of it as an early Christmas present from us 😊.

The biggest change is the introduction of the pdf2Data Manager. This is a new component to manage your extraction templates more easily, create and manage users and workspaces, and more besides. We’ve also improved the template creation and data field editing experience to accelerate and support collaboration in document workflows.

As this is a major release, we’ve also taken the opportunity to revise the pdf2Data SDK’s API to make it clearer and more consistent, with some other additions and improvements mixed in.

Since these are some pretty significant additions and improvements to iText pdf2Data, we’re also bumping the version number to 4.0. Let’s get right into it.

What's new

pdf2Data Manager

If you’re familiar with iText DITO, our other user-friendly PDF document solution, then you’ll recognize a lot of similarities with the new pdf2Data Manager. Like iText DITO’s management component, it serves as the central environment for iText pdf2Data. All users must now log in with their credentials to help protect your data and prevent unauthorized access. They are then presented with a clear and user-friendly interface where they can access, import, and export extraction templates.

Since we want to ensure a seamless transition between management and editing, it is tightly integrated with the pdf2Data Editor where you define the data fields and parsing rules in your templates. Once you are done editing, you simply save and exit back to the Manager screen.

The pdf2Data Manager acts not only as a centralized storage for all your extraction templates but also allows the administration of users and multiple workspaces. Administrators can easily create and manage users and user roles, and also assign them to specific workspaces.

By selecting a particular template in the pdf2Data Manager you can also adjust existing parsing rules for templates, and quickly replace the reference PDFs used to verify extraction templates.

Finally, it’s now even easier to get started with iText pdf2Data with the introduction of template blueprints for extracting data from specific document types. Blueprints can reduce time when creating extraction templates, since they have predefined data fields which you can adapt for your own documents by simply replacing the sample file and adjusting the existing fields. In this release we have included an invoice blueprint, although we’ll be building more blueprints soon.

To allow iText pdf2Data to support all this new functionality, we are moving to a new more flexible and reusable format for extraction templates. Don’t worry though; you won’t need to recreate your existing templates since the pdf2Data Manager includes a tool to import and convert your legacy templates into the new format. See the migration guide for more details on converting templates and the new format.

pdf2Data Editor

As noted, the new pdf2Data Manager is integrated with the existing pdf2Data Editor so you can seamlessly switch between template editing and management. For this version though, we’ve made some improvements to the user experience when editing templates and data fields. While in earlier versions, you sometimes needed to use the expert mode to get the most out of it iText pdf2Data, that is no longer the case.

From now on, all extraction functionality is entirely available from the UI, although fans of the expert mode will be happy to know it still exists. Expert mode users now also get the benefit of a new and more convenient syntax.

pdf2Data SDK

The SDK is the key part of iText pdf2Data that manages the job of document data extraction. While template designers are won’t ever have to deal with the SDK itself, developers will be happy to know we’ve made some improvements to its API. This will mean they can read less documentation in order to integrate it into workflows. And of course, it has been updated to fully support the new template format.

Extraction updates

On top of its high-volume PDF data extraction capabilities, our built-in extraction algorithms are what makes iText pdf2Data special. These are fine-tuned to recognize common document elements such as tables, paragraphs, dates, and so on. We are adding to and improving these all the time, and this release is no exception.

Table extraction gained improved merging strategies, specifically for tables which span multiple pages. Error messages became clearer, so more useful for debugging. In addition, the overall extraction process became more stable, reducing the chance of exceptions leading to problems.

Want to know more?

As usual, you can find all the technical details in the release notes on our Knowledge Base, along with our revised installation guides and other documentation. If you’re not already an iText pdf2Data customer, you can request a free 30-day online trial to test it out for yourself, or check the product page to learn more about its data extraction capabilities.

Tags

data extraction JSON Invoices tables in PDF XML templates

Article type

iText news Technical notes

Main image

iText pdf2Data 4.0 release

Promoted to home page text

Introducing iText pdf2Data 4.0: Template management and much more!

iText pdf2Data 3.1.1 is now available!

ian.morris — Mon, 25 Jul 2022 15:36:33 +0000

iText pdf2Data 3.1.1 is now available! ian.morris Mon, 07/25/2022 - 17:36

Introduction

We are proud to announce the release of iText pdf2Data 3.1.1, the latest version of our template-based data extraction solution. iText pdf2Data intelligently recognizes data inside structured and semi-structured PDF documents and extracts them in a structured format.

iText pdf2Data consists of two main components: as the browser-based pdf2Data Editor which enables creation of extraction templates and the pdf2Data SDK (available for Java, .NET, and as a command-line interface application) that you use to automatically extract data from PDF documents. This data can then be used in customer processes such as business analytics and reporting.

Our main focus of this release is on the SDK side; adding JSON output support to simplify the process of reusing extracted data. We’ve also concentrated on improving the accuracy of our high-level extraction selectors, which help you extract data painlessly without needing any technical knowledge.

What's new

JSON output

First things first, an important innovation in iText pdf2Data 3.1.1 is the introduction of support for JSON format for output data. From now on, both the native Java and .NET SDK libraries and the CLI variant are now able to output extracted data in JSON format as well as XML. This will allow more convenient integration into workflows in microservices and cloud-based solutions, as JSON is the de-facto standard for these applications and so is especially widely used there.

JSON output can also be selected from the template editor.

For anyone who prefers to use XML though, don’t worry! This output option is still available and can be used in exactly the same way as before.

Improved data extraction

A key feature of iText pdf2Data is that to ease the process of data extraction, it provides high-level selectors which your less-technical employees can use from the intuitive template editor. The accuracy of these selectors and therefore the extraction algorithms behind them are vitally important for our customers.

In this release, we focused on tweaking two of them in particular: Date and Price. As well as being able to manually configure these selectors to improve extraction, we also improved the validation of extracted values so you will get exactly what you expect in the XML or JSON output. You can now avoid getting outputs such as “32nd of July” from the Date selector or prices in Euros when parsing US invoices.

The Price selector in action.

Special mention should be made of the improved table selector since it is a favorite selector of many customers. Indeed, iText pdf2Data features one of the best table extraction algorithms around, and so it is a significant reason our customers use iText pdf2Data. We’re always working to raise the bar for the recognition and extraction of tables in PDF though, and this release is no exception.

Bills of lading, purchase orders, invoices and similar documents often use templates which feature predefined, structured layouts, yet suppliers may need to include important notes for specific products. If there was no provision for this when the template was created the supplier would then need to write the notes directly into the table, and so you might end up with a table being split over multiple pages. In certain cases, this would lead to the table being detected as two separate tables. To prevent this, we’ve modified the table detection heuristics to support variable leading (line-spacing) between table rows.

That’s not all though, as another nice improvement to the table selector is that it can now ignore watermarks. Since watermarks tend to have different styling and don’t respect table structure, they could cause problems for the table detection in previous versions.

Watermarks such as the example shown here can now be ignored.

Improved user experience

Users can now expect a better experience while creating extraction templates, as we've been making efforts to reduce the learning curve for new users. In addition to the improved high-level selectors, we’ve also revised the messaging in the pdf2Data Editor to provide users with clearer explanations and make it easier to begin data extraction. Of course, we’ll never stop improving our iText pdf2Data documentation regardless of our release schedule, so make sure you keep tabs on our Knowledge Base.

What else?

We’ve fixed a couple of bugs in this release; one for PDFs which contain unsupported color spaces, and an out of memory exception which could occur when grouping lines with the Paragraph selector. As always, you can check our release notes for more details.

If you’re not already an iText pdf2Data customer, you can explore all its features and capabilities with a free 30-day online trial! Alternatively, check out the product page for a detailed overview of how iText pdf2Data works.

You can also visit our Knowledge Base where we have tutorials and a breakdown of all available pdf2Data selectors, including tips on how to use them effectively.

What's next?

Without giving too much away, you can expect some awesome additions and improvements to iText pdf2Data in the future, covering everything from template creation to data extraction, and perhaps even more 😉.

See you in the next quarterly release!

Tags

data extraction JSON Invoices tables in PDF XML watermark

Article type

iText news Technical notes

Main image

iText pdf2Data 3.1.1 teaser

Promoted to home page text

iText pdf2Data 3.1.1 is now available!

How to modify XFA documents before flattening

ian.morris — Thu, 16 Apr 2020 08:42:26 +0000

How to modify XFA documents before flattening ian.morris Thu, 04/16/2020 - 10:42

Intro

XFA is still widely used despite the fact that it was deprecated in PDF 2.0 (published in 2017), and the last update to the XFA specification was in 2012. If you need to flatten XFA to PDF you can use our iText 7 add-on pdfXFA, but one of the challenges XFA documents present is modifying them before flattening. While it's fairly easy to modify them in an application such as Adobe LiveCycle Designer, modifying the JavaScript through code can be challenging, as some structure information is required.

Structure

The XFA document has a few different XML files, template, localeSet, xmpmeta, datasets, config, and xfdf. When modifying the JavaScript, the important file is template. This contains the structure of the XFA document including all the information about the fields and any JavaScript used, generally stored under a script or a calculate tag. In order to see the XML structure and navigate the DOM you can use RUPS, and access this on the XFA tab in the RUPS window.

What can JavaScript Do?

Below are some basic examples of what JavaScript can do to an XFA form. Because there isn't a limit on the JavaScript, there are countless possibilities as to what can be achieved with JavaScript. Theoretically (although not very practically), entire applications could be written using just JavaScript and XFA.

Buttons

Pop-up Messages

Pop up messages can be called anytime JavaScript is executed, for example at load, a button, when a field is calculated, etc.

Modifying the XFA

Modifying the JavaScript happens in the template branch of the DOM. There are three basic steps to modifying the XFA document:

Extract the XFA XML DOM.
Modify the DOM
Write the DOM back to the PDF

Extract the DOM

The XML is stored in a w3 DOM Document Object (org.w3c.dom.Document) which can be extracted using the following code:

PdfReader reader = new PdfReader(inputFileDir + "invoice.pdf");
PdfWriter writer = new PdfWriter(destFile);
PdfDocument pdfDoc = new PdfDocument(reader, writer);
XfaForm xfa = PdfAcroForm.getAcroForm(pdfDoc, false).getXfaForm();
Document domDoc = xfa.getDomDocument();

This pulls the DOM which can be navigated a few different ways.

Modify the DOM

Once inside the DOM any modification method will work. We'll go over two methods. The first is manually navigating the DOM. The second is using a NodeFilter.

The DOM can be navigated manually as well with the methods getNextSibling() and getFirstChild(). For example, to get to the Template node you can use domDoc.getFirstChild().getFirstChild().getNextSibling().getNextSibling().getNextSibling();.

NodeFilter

A NodeFilter is an Interface that filters out unwanted nodes. In the following example we filter out anything that isn't an amount field:

public class CalcCheck implements NodeFilter {
    @Override
    public short acceptNode(Node n) {
        try {
            if (n.getLocalName().equalsIgnoreCase("calculate")) {
                if(n.getParentNode().getAttributes().getNamedItem("name").getNodeValue().equalsIgnoreCase("amount")) {
                    return NodeFilter.FILTER_ACCEPT;
                }
            }
            return NodeFilter.FILTER_SKIP;
        }
        catch (NullPointerException e){
            return NodeFilter.FILTER_SKIP;
        }
    }
}

Then you create a NodeIterator that allows for iterating over the accepted nodes:

public NodeIterator findCalc(Document doc){
    Node first = doc.getDocumentElement();
    DocumentTraversal docT = (DocumentTraversal) doc;
    return docT.createNodeIterator(first, NodeFilter.SHOW_ALL, new CalcCheck(),true);
}

Write the DOM

Writing the DOM doesn't change based on how the DOM was modified. The DOM needs to be replaced into the XfaForm and the XFA form needs to written back into the PdfDocument. Finally the document needs to be closed:

xfa.setDomDocument(domDoc);
xfa.write(pdfDoc);
pdfDoc.close();

Examples

Change the Calculate field with a Node Filter: Java or .NET

Modify the JavaScript message manually: Java or .NET

Remove the JavaScript message manually: Java or .NET

Conclusion

The biggest issue with modifying the XFA document is that because of the flexibility of the embedded JavaScript along with the different possibilities on creating fields it is difficult to automate the fixing of files. This guide relies on some internal knowledge of the XFA structure along with the specific changes that are being made. And while a specific change might fix one file, the same issue in a different PDF might not be fixed with the same solution.

After reading all of this you might be thinking:

When would I need to do this?

This is generally only required when there is no control over where the XFA document comes from, otherwise it would be easier to modify it in a prefilled document.

Should I still be making XFA documents?

There are still pros and cons to making XFA documents, while generally it should be avoided if a system is in place built around XFA Forms then it might make sense. It's important to remember that XFA Forms are not PDF 2.0 and require knowledge of JavaScript and XFA documents.

Is there an alternative?

If you want a way to process dynamic data and output as PDF then iText DITO is the alternative you're looking for. iText DITO is iText's low-code document generator that simplifies the process of creating and maintaining data-driven forms and templates.

You can learn more about iText DITO in this article, or by visiting the product page.

Tags

XFA XFA forms XML

Article type

Technical notes

Main image

XFA

Promoted to home page text

Blog: How to modify XFA documents before flattening

iText and Business Process Automation

admin-marketing — Tue, 22 Jan 2013 00:00:00 +0000

iText and Business Process Automation admin-marketing Tue, 01/22/2013 - 01:00

No matter which business you run —a large corporation with thousands of employees and ditto clientele, or an SMB with only a handful of employees and a hundred customers—, there’s one thing all companies have in common: they all implement business processes, and every business process involves data as well as documents.

The above figure shows an example how iText can automate such processes, combining machine-readable XML with human-readable PDF documents:

This is how it works: the person to the left registers data. It could be a teacher writing a proposal for a new course, a company submitting a tender for a specific assignment, and so on. On the back end, the data is stored as an XML file, because that's the best way to store data with an unpredictable, variable size. Different people will collaborate on this document. For instance: an administrator can check if the submitted content meets the requirements and return the document with a request for completion if it doesn't. This is part of a typical workflow.

XML is more or less human-readable, but you don't want to present an XML to people who aren't tech-savvy. One way to make the XML presentable, is to create a dynamic form using Adobe LiveCycle Designer, which is a product that ships with Adobe Acrobat. Such a form is based on the XML Forms Architecture (XFA) and the result is a PDF document that acts as a container for XML. With iText, you can programmatically inject your custom XML into such a template, making the many different XML files that are passing around in the workflow much easier to consume by a human being.

However, the document remains dynamic, and that may not be desirable at the end of the process, when you want to persist the document. This is when you'll use iText's XFA Worker in combination with PdfAWriter to convert the XFA form into a PDF/A document. PDF/A is an ISO standard for Archiving.

Finally, to avoid that the document is changed, you can digitally sign the document. If long-term preservation is important, you may want to use iText to add a Document Security Store (DSS) and a Document-Level Timestamp on a regular basis to allow long-term validation (LTV) as described in the PAdES-4 standard.

While iText isn't a full-blown BPM product, you can see that iText fills the gap that exists in many existing BPM solutions. The use case as described above has been successfully deployed by different iText customers. Please contact sales for more information.

The images in the figure are courtesy of stockimages, Jomphong, and twobee / FreeDigitalPhotos.net

Tags

iText 5 XML PdfAwriter XMLWorker

Article type

Technical notes

Main image

xfaworker_signing.preview_0.jpg

XML

Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction

How pdf2Data Can Streamline Your IDP Workflows

Enhanced and Intuitive User Experience

More Intuitive Data Extraction

Introducing the Search Area Feature

Locate Specific Data with Crop Content

Refine and Restrict Results with the Filter Selector

Why Upgrade to pdf2Data 4.4?

What’s Next for pdf2Data?

Ready to Dive In?

Introducing iText pdf2Data 4.0: Template management and much more!

Introduction

What's new

pdf2Data Manager

pdf2Data Editor

pdf2Data SDK

Extraction updates

Want to know more?

iText pdf2Data 3.1.1 is now available!

Introduction

What's new

JSON output

Improved data extraction

Improved user experience

What else?

What's next?

How to modify XFA documents before flattening

Intro

Structure

What can JavaScript Do?

Buttons

Pop-up Messages

Modifying the XFA

Extract the DOM

Modify the DOM

Manual Navigation

NodeFilter

Write the DOM

Examples

Conclusion

iText and Business Process Automation