XML https://itextpdf.com/ en Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction https://itextpdf.com/blog/itext-news-technical-notes/introducing-pdf2data-44-simplified-data-extraction <span property="schema:name" class="field field--name-title field--type-string field--label-hidden">Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction</span> <span rel="schema:author" class="field field--name-uid field--type-entity-reference field--label-hidden"><span lang="" about="/users/ianmorris" typeof="schema:Person" property="schema:name" datatype="">ian.morris</span></span> <span property="schema:dateCreated" content="2024-01-30T09:24:54+00:00" class="field field--name-created field--type-created field--label-hidden"><time datetime="2024-01-30T10:24:54+01:00" title="Tuesday, January 30, 2024 - 10:24" class="datetime">Tue, 01/30/2024 - 10:24</time> </span> <div property="schema:text" class="wysiwyg-content clearfix text-formatted field field--name-body field--type-text-with-summary field__items"> <p><img alt="pdf2Data 4.4 blog banner" data-entity-type="file" data-entity-uuid="c2378904-66e2-4451-9f6b-e33b9bf5413c" src="/sites/default/files/inline-images/Blog%20banner%201140x300_Pdf2Data%204.4%20release.png" width="1140" height="300" loading="lazy" /></p> <p>We are thrilled to kick off a new year by announcing our latest data extraction release - <a href="https://apryse.com/products/pdf2data">pdf2Data</a> 4.4. This version is focused on two major areas: enhancing user experience and refining data extraction capabilities. It is key to eliminating data silos and enabling users of all skill levels to make data-driven decisions.</p> <h2>How pdf2Data Can Streamline Your IDP Workflows</h2> <p>Many businesses need to access and reuse data trapped inside PDFs, such as invoices, statements, or contracts. Recent years have seen the rise of Intelligent Document Processing (IDP) solutions, typically relying on AI or machine learning to recognize and process documents. However, such approaches require extensive training to accurately identify documents, which can be time-consuming and expensive.</p> <p>In contrast, our pdf2Data solution uses a template-based approach which requires only a single example PDF to get started. Since documents such as invoices from a common supplier will have a standardized layout with only the content changing, pdf2Data allows you to collaboratively build, manage, and reuse extraction templates for specific document types.</p> <p>With pdf2Data, you can easily automate the extraction of content from PDFs and transform it into reusable, structured data. Using the wide range of selectors, you can quickly build a parsing pipeline to find and extract useful data in documents. Selectors are available to intelligently identify specific text, barcodes, dates, and even multi-page tables.</p> <p>Thanks to pdf2Data’s flexible and on-premises deployment, integration into existing document workflows can be seamlessly and securely achieved. With convenient Docker deployment and a RESTful API in addition to native Java and .NET libraries, pdf2Data has comprehensive cross-platform compatibility, whatever your infrastructure.</p> <p>In short, pdf2Data can save your business precious time and boost the productivity of modern IDP-focused workflows by easily allowing data in PDFs to be accessed and repurposed.</p> <h2>Enhanced and Intuitive User Experience</h2> <p>Our goal has always been to make data extraction as seamless and code-free as possible so that any authorized user can use pdf2Data with little to no IT requirements. With this release, we are doubling down on that commitment.</p> <p>Users can now set up even more complex extraction pipelines in the <a href="https://pdf2data.apryse.com/documentation/docs/editor-guide/Intro">pdf2Data Editor</a> by using new predefined rules and selectors which require minimal to no coding skills to make full use of. Templates can then be used by the <a href="https://pdf2data.apryse.com/documentation/docs/engine-guide/root/intro">pdf2Data Parsing Engine</a> to process your documents, and enable more efficient automated IDP workflows.</p> <p>Mixing and matching selectors is especially beneficial for extracting data from documents with complex structures or formatting, and as with all pdf2Data 4.x releases, we’ve focused on further improving the access to selector functionality previously only available in expert mode. Let’s take a closer look.</p> <h2>More Intuitive Data Extraction</h2> <p>With pdf2Data 4.4, we’ve taken a significant leap in data extraction technology. The new release offers:</p> <ul><li>Advanced Parsing: The ability to handle more complex documents with ease, thanks to the new selectors.</li> <li>Improved Recognition Results: We’ve revised the format of pdf2Data’s recognition results to be more logical and consistent across the different Parsing Engines and <a href="https://pdf2data.apryse.com/documentation/docs/engine-guide/specification/Recognition%20result%20specification/RecognitionResultSpecificationJSON">JSON</a>/<a href="https://pdf2data.apryse.com/documentation/docs/engine-guide/specification/Recognition%20result%20specification/RecognitionResultSpecificationXML">XML</a> output formats. Data fields have been streamlined so that the results are more predictable and grouped results have also been improved. The new format also allows easier introduction of new result types and selectors in the future.</li> </ul><h3>Introducing the Search Area Feature</h3> <p>The new <a href="https://pdf2data.apryse.com/documentation/docs/editor-guide/Using%20editor/Search%20area">Search area</a> gives greater control over where pdf2Data applies the parsing pipeline in documents. It replaces the previous the Page and Boundary selectors which have now been deprecated. You can restrict the area by specifying a page, page range, or by selecting a specific part of the document. You can then include or exclude parts of the page by clicking on them, as shown below:</p> <figure role="group" class="caption caption-img"><img alt="pdf2Data's Search area function allows you to easily restrict the parsing pipeline" data-entity-type="file" data-entity-uuid="42e9164b-a72a-4396-b51f-a26b775da3c5" height="677" loading="lazy" src="/sites/default/files/inline-images/subscription_revenue_4.png" width="1577" /><figcaption>You can easily include adjacent parts of the document by clicking on them.</figcaption></figure><p>You can also watch a <a href="https://pdf2data.apryse.com/documentation/docs/editor-guide/video#search-area">tutorial video</a> demonstrating its usage if you prefer.</p> <h3>Locate Specific Data with Crop Content</h3> <p>The new <a href="https://pdf2data.apryse.com/documentation/docs/editor-guide/Using%20editor/Selectors/Crop">Crop content</a> selector allows users to define specific areas of interest in a document based on its content. It's a significant improvement for processing multi-section documents and non-static forms.</p> <figure role="group" class="caption caption-img"><img alt="The Crop content selector allows defining specific areas of interest in a document" data-entity-type="file" data-entity-uuid="7fa39cdf-1089-4631-85bd-a0ea2ccb6616" height="1032" loading="lazy" src="/sites/default/files/inline-images/subscription_revenue_2.png" width="1920" /><figcaption>Using the Crop content selector in conjunction with the Table selector to find and recognize specific table data.</figcaption></figure><h3>Refine and Restrict Results with the Filter Selector</h3> <p>The <a href="https://pdf2data.apryse.com/documentation/docs/editor-guide/Using%20editor/Selectors/Filter">Filter</a> selector is a powerful addition that validates and filters extracted values. You can use specific search conditions to exclude unwanted content and ensure only relevant data is captured.</p> <figure role="group" class="caption caption-img"><img alt="The Filter selector can validate and filter extracted values" data-entity-type="file" data-entity-uuid="7464f924-df7d-4fcd-937a-6fa5b7091a1d" height="799" loading="lazy" src="/sites/default/files/inline-images/filter-selector.png" width="1304" /><figcaption>Using the Filter selector to extract a needed clause from a contract.</figcaption></figure><p>Together, these selectors improve the parsing process, making it more efficient and user-friendly than ever before.</p> <h2>Why Upgrade to pdf2Data 4.4?</h2> <p>If you are looking to enhance your data extraction processes with minimal coding and maximum efficiency, pdf2Data 4.4 is ready for an upgrade today. If you are new to pdf2Data or just want to learn more about how data extraction makes your work easier, faster, and more accurate, reach out today.</p> <h2>What’s Next for pdf2Data?</h2> <p>We have big plans for upcoming releases. Coming soon will be a great new feature where pdf2Data will be able to recognize and extract content from images – not just PDF documents. This will significantly extend pdf2Data’s capabilities and strengthen its position as an essential part of cutting-edge, efficient IDP workflows.</p> <h2>Ready to Dive In?</h2> <p><a href="https://pdf2data.apryse.com/documentation/docs/downloads">Download</a> pdf2Data 4.4 today and experience improved data extraction and overall user experience. For a full list of pdf2Data improvements, please see the <a href="https://pdf2data.apryse.com/documentation/docs/release-notes/Version%204.x/Version%204.4.0%20Release%20Notes">changelog</a> and other <a href="https://pdf2data.apryse.com/documentation/">documentation</a> on our website. You can also <a href="https://apryse.com/form/pdf2data-trial">reach out</a> for a full demonstration if you are new to data extraction solutions.</p> <p>Alternatively, if you're looking for more code-based solutions you can check out our suite of developer-focused <a href="https://apryse.com/capabilities/extraction">Intelligent Data Extraction</a> capabilities on the Apryse website.</p> </div> <div class="field field--name-field-tags field--type-entity-reference field__items"> <div class="field__label">Tags</div> <a href="/tags/data-extraction" property="schema:about" hreflang="en">data extraction</a> <a href="/tags/json" property="schema:about" hreflang="en">JSON</a> <a href="/tags/invoices" property="schema:about" hreflang="en">Invoices</a> <a href="/tags/tables-pdf" property="schema:about" hreflang="en">tables in PDF</a> <a href="/tags/xml" property="schema:about" hreflang="en">XML</a> <a href="/tags/templates" property="schema:about" hreflang="en">templates</a> </div> <span class="a2a_kit a2a_kit_size_25 addtoany_list" data-a2a-url="https://itextpdf.com/blog/itext-news-technical-notes/introducing-pdf2data-44-simplified-data-extraction" data-a2a-title="Introducing pdf2Data 4.4: Simplified User Experience for Democratized Data Extraction"><a class="a2a_button_facebook"><i class="fa-brands fa-facebook-f fa-2x"></i></a><a class="a2a_button_twitter"><i class="fa-brands fa-twitter fa-2x"></i></a><a class="a2a_button_linkedin"><i class="fa-brands fa-linkedin-in fa-2x"></i></a><a class="a2a_button_whatsapp"><i class="fa-brands fa-whatsapp fa-2x"></i></a><a class="a2a_button_email"><i class="fa-solid fa-envelope fa-2x"></i></a></span> <div class="field field--name-field-article-type field--type-entity-reference field__items"> <div class="field__label">Article type</div> <a href="/blog-type/itext-news" hreflang="en">iText news</a> <a href="/blog-type/technical-notes" hreflang="en">Technical notes</a> </div> <div class="field field--name-field-main-image field--type-entity-reference field__items"> <div class="field__label">Main image</div> <a href="/resources/media/images/pdf2data-44-blog-teaser" hreflang="en">pdf2Data 4.4 blog teaser</a> </div> Tue, 30 Jan 2024 09:24:54 +0000 ian.morris 15497 at https://itextpdf.com https://itextpdf.com/blog/itext-news-technical-notes/introducing-pdf2data-44-simplified-data-extraction#comments Introducing iText pdf2Data 4.0: Template management and much more! https://itextpdf.com/blog/itext-news-technical-notes/introducing-itext-pdf2data-40-template-management-and-much-more <span property="schema:name" class="field field--name-title field--type-string field--label-hidden">Introducing iText pdf2Data 4.0: Template management and much more!</span> <span rel="schema:author" class="field field--name-uid field--type-entity-reference field--label-hidden"><span lang="" about="/users/ianmorris" typeof="schema:Person" property="schema:name" datatype="">ian.morris</span></span> <span property="schema:dateCreated" content="2022-12-06T09:17:11+00:00" class="field field--name-created field--type-created field--label-hidden"><time datetime="2022-12-06T10:17:11+01:00" title="Tuesday, December 6, 2022 - 10:17" class="datetime">Tue, 12/06/2022 - 10:17</time> </span> <div property="schema:text" class="wysiwyg-content clearfix text-formatted field field--name-body field--type-text-with-summary field__items"> <p><img alt="iText pdf2Data 4.0 release" data-entity-type="file" data-entity-uuid="c4f82399-07e8-4ccf-8918-4372c3d07e2c" src="/sites/default/files/inline-images/iText%20pdf2Data4%20Main.png" width="1140" height="300" loading="lazy" /></p> <h2>Introduction</h2> <p>We’re pleased to announce a new release of <a href="https://itextpdf.com/en/node/236/" rel="noopener" target="_blank">iText pdf2Data</a>; our user-friendly template-based data extraction solution. We’ve been hard at work since the <a href="https://itextpdf.com/blog/itext-news-technical-notes/itext-pdf2data-311-now-available" rel="noopener" target="_blank">previous release,</a> and as promised last time there’s a wealth of new stuff to tell you about. Think of it as an early Christmas present from us 😊.</p> <p>The biggest change is the introduction of the pdf2Data Manager. This is a new component to manage your extraction templates more easily, create and manage users and workspaces, and more besides. We’ve also improved the template creation and data field editing experience to accelerate and support collaboration in document workflows.</p> <p>As this is a major release, we’ve also taken the opportunity to revise the pdf2Data SDK’s API to make it clearer and more consistent, with some other additions and improvements mixed in.</p> <p>Since these are some pretty significant additions and improvements to iText pdf2Data, we’re also bumping the version number to 4.0. Let’s get right into it.</p> <h2>What's new</h2> <h3>pdf2Data Manager</h3> <p>If you’re familiar with iText DITO, our other user-friendly PDF document solution, then you’ll recognize a lot of similarities with the new pdf2Data Manager. Like iText DITO’s management component, it serves as the central environment for iText pdf2Data. All users must now log in with their credentials to help protect your data and prevent unauthorized access. They are then presented with a clear and user-friendly interface where they can access, import, and export extraction templates.</p> <p>Since we want to ensure a seamless transition between management and editing, it is tightly integrated with the pdf2Data Editor where you define the data fields and parsing rules in your templates. Once you are done editing, you simply save and exit back to the Manager screen.</p> <p>The pdf2Data Manager acts not only as a centralized storage for all your extraction templates but also allows the administration of users and multiple workspaces. Administrators can easily create and manage users and user roles, and also assign them to specific workspaces.</p> <p><img alt="iText pdf2Data Manager Settings" data-entity-type="file" data-entity-uuid="02f85123-2df2-4cf0-83be-3c9fdeb53764" src="/sites/default/files/inline-images/chrome_KL3WqDJZ5U.png" width="1862" height="385" loading="lazy" /></p> <p><img alt="iText pdf2Data Manager User Management" data-entity-type="file" data-entity-uuid="8e560a5c-c21e-42ee-af7a-494373a1d4c0" src="/sites/default/files/inline-images/chrome_kNqlu7akge.png" width="1920" height="639" loading="lazy" /></p> <p>By selecting a particular template in the pdf2Data Manager you can also adjust existing parsing rules for templates, and quickly replace the reference PDFs used to verify extraction templates.</p> <p>Finally, it’s now even easier to get started with iText pdf2Data with the introduction of template blueprints for extracting data from specific document types. Blueprints can reduce time when creating extraction templates, since they have predefined data fields which you can adapt for your own documents by simply replacing the sample file and adjusting the existing fields. In this release we have included an invoice blueprint, although we’ll be building more blueprints soon.</p> <p><img alt="iText pdf2Data Blueprint" data-entity-type="file" data-entity-uuid="eac86dad-ca6c-4b96-b59d-6b99e9cf2e38" src="/sites/default/files/inline-images/chrome_5MqS5G43F8.png" width="592" height="578" loading="lazy" /></p> <p>To allow iText pdf2Data to support all this new functionality, we are moving to a new more flexible and reusable format for extraction templates. Don’t worry though; you won’t need to recreate your existing templates since the pdf2Data Manager includes a tool to import and convert your legacy templates into the new format. See the <a href="https://kb.itextpdf.com/2data/installation-guidelines/migration-from-v3-to-v4" rel="noopener" target="_blank">migration guide</a> for more details on converting templates and the new format.</p> <h3>pdf2Data Editor</h3> <p>As noted, the new pdf2Data Manager is integrated with the existing pdf2Data Editor so you can seamlessly switch between template editing and management. For this version though, we’ve made some improvements to the user experience when editing templates and data fields. While in earlier versions, you sometimes needed to use the expert mode to get the most out of it iText pdf2Data, that is no longer the case.</p> <p>From now on, all extraction functionality is entirely available from the UI, although fans of the expert mode will be happy to know it still exists. Expert mode users now also get the benefit of a new and more convenient syntax.</p> <p><img alt="iText pdf2Data Editor new interface" data-entity-type="file" data-entity-uuid="270b105e-2498-4557-aafb-1550b919e66f" src="/sites/default/files/inline-images/chrome_WaBCYGBgFB.png" width="1920" height="1080" loading="lazy" /></p> <h3>pdf2Data SDK</h3> <p>The SDK is the key part of iText pdf2Data that manages the job of document data extraction. While template designers are won’t ever have to deal with the SDK itself, developers will be happy to know we’ve made some improvements to its API. This will mean they can read less documentation in order to integrate it into workflows. And of course, it has been updated to fully support the new template format.</p> <h3>Extraction updates</h3> <p>On top of its high-volume PDF data extraction capabilities, our built-in extraction algorithms are what makes iText pdf2Data special. These are fine-tuned to recognize common document elements such as tables, paragraphs, dates, and so on. We are adding to and improving these all the time, and this release is no exception.</p> <p>Table extraction gained improved merging strategies, specifically for tables which span multiple pages. Error messages became clearer, so more useful for debugging. In addition, the overall extraction process became more stable, reducing the chance of exceptions leading to problems.</p> <h3>Want to know more?</h3> <p>As usual, you can find all the technical details in the <a href="https://kb.itextpdf.com/2data/releases/release-itext-pdf2data-4-0" rel="noopener" target="_blank">release notes</a> on our Knowledge Base, along with our revised installation guides and other documentation. If you’re not already an iText pdf2Data customer, you can request a free <a href="https://itextpdf.com/en/get-started" rel="noopener" target="_blank">30-day online trial</a> to test it out for yourself, or check the <a href="https://itextpdf.com/en/node/236/" rel="noopener" target="_blank">product page</a> to learn more about its data extraction capabilities.</p> </div> <div class="field field--name-field-tags field--type-entity-reference field__items"> <div class="field__label">Tags</div> <a href="/tags/data-extraction" property="schema:about" hreflang="en">data extraction</a> <a href="/tags/json" property="schema:about" hreflang="en">JSON</a> <a href="/tags/invoices" property="schema:about" hreflang="en">Invoices</a> <a href="/tags/tables-pdf" property="schema:about" hreflang="en">tables in PDF</a> <a href="/tags/xml" property="schema:about" hreflang="en">XML</a> <a href="/tags/templates" property="schema:about" hreflang="en">templates</a> </div> <span class="a2a_kit a2a_kit_size_25 addtoany_list" data-a2a-url="https://itextpdf.com/blog/itext-news-technical-notes/introducing-itext-pdf2data-40-template-management-and-much-more" data-a2a-title="Introducing iText pdf2Data 4.0: Template management and much more!"><a class="a2a_button_facebook"><i class="fa-brands fa-facebook-f fa-2x"></i></a><a class="a2a_button_twitter"><i class="fa-brands fa-twitter fa-2x"></i></a><a class="a2a_button_linkedin"><i class="fa-brands fa-linkedin-in fa-2x"></i></a><a class="a2a_button_whatsapp"><i class="fa-brands fa-whatsapp fa-2x"></i></a><a class="a2a_button_email"><i class="fa-solid fa-envelope fa-2x"></i></a></span> <div class="field field--name-field-article-type field--type-entity-reference field__items"> <div class="field__label">Article type</div> <a href="/blog-type/itext-news" hreflang="en">iText news</a> <a href="/blog-type/technical-notes" hreflang="en">Technical notes</a> </div> <div class="field field--name-field-related-products field--type-entity-reference field__items"> <div class="field__label">Related products</div> <a href="/products/pdf2data/extract-your-content-from-pdf" hreflang="en">pdf2Data</a> </div> <div class="field field--name-field-main-image field--type-entity-reference field__items"> <div class="field__label">Main image</div> <a href="/resources/media/images/itext-pdf2data-40-release" hreflang="en">iText pdf2Data 4.0 release</a> </div> <div class="field field--name-field-promoted-to-home-page-text field--type-string field__items"> <div class="field__label">Promoted to home page text</div> Introducing iText pdf2Data 4.0: Template management and much more! </div> Tue, 06 Dec 2022 09:17:11 +0000 ian.morris 15471 at https://itextpdf.com https://itextpdf.com/blog/itext-news-technical-notes/introducing-itext-pdf2data-40-template-management-and-much-more#comments iText pdf2Data 3.1.1 is now available! https://itextpdf.com/blog/itext-news-technical-notes/itext-pdf2data-311-now-available <span property="schema:name" class="field field--name-title field--type-string field--label-hidden">iText pdf2Data 3.1.1 is now available!</span> <span rel="schema:author" class="field field--name-uid field--type-entity-reference field--label-hidden"><span lang="" about="/users/ianmorris" typeof="schema:Person" property="schema:name" datatype="">ian.morris</span></span> <span property="schema:dateCreated" content="2022-07-25T15:36:33+00:00" class="field field--name-created field--type-created field--label-hidden"><time datetime="2022-07-25T17:36:33+02:00" title="Monday, July 25, 2022 - 17:36" class="datetime">Mon, 07/25/2022 - 17:36</time> </span> <div property="schema:text" class="wysiwyg-content clearfix text-formatted field field--name-body field--type-text-with-summary field__items"> <p><img alt="iText pdf2Data 3.1.1 release" data-entity-type="file" data-entity-uuid="150c9573-93e5-434a-bd48-5f9dc79c933a" src="/sites/default/files/inline-images/pdf2Data_3_1_1_header-1.png" width="1140" height="300" loading="lazy" /></p> <h2>Introduction</h2> <p>We are proud to announce the release of <a href="https://kb.itextpdf.com/2data/releases/release-itext-pdf2data-3-1-1">iText pdf2Data 3.1.1</a>, the latest version of our template-based data extraction solution. iText pdf2Data intelligently recognizes data inside structured and semi-structured PDF documents and extracts them in a structured format.</p> <p>iText pdf2Data consists of two main components: as the browser-based pdf2Data Editor which enables creation of extraction templates and the pdf2Data SDK (available for Java, .NET, and as a command-line interface application) that you use to automatically extract data from PDF documents. This data can then be used in customer processes such as business analytics and reporting.</p> <p>Our main focus of this release is on the SDK side; adding JSON output support to simplify the process of reusing extracted data. We’ve also concentrated on improving the accuracy of our high-level extraction selectors, which help you extract data painlessly without needing any technical knowledge. </p> <h2>What's new</h2> <h3>JSON output</h3> <p>First things first, an important innovation in iText pdf2Data 3.1.1 is the introduction of support for JSON format for output data. From now on, both the native Java and .NET SDK libraries and the CLI variant are now able to output extracted data in JSON format as well as XML. This will allow more convenient integration into workflows in microservices and cloud-based solutions, as JSON is the de-facto standard for these applications and so is especially widely used there.</p> <figure role="group" class="caption caption-img"><img alt="JSON output selection" data-entity-type="file" data-entity-uuid="7e25ac78-b4c6-4f17-bf7e-fb81d6df6859" height="473" loading="lazy" src="/sites/default/files/inline-images/json.png" width="549" /><figcaption>JSON output can also be selected from the template editor.</figcaption></figure><p>For anyone who prefers to use XML though, don’t worry! This output option is still available and can be used in exactly the same way as before.</p> <h3>Improved data extraction</h3> <p>A key feature of iText pdf2Data is that to ease the process of data extraction, it provides high-level selectors which your less-technical employees can use from the intuitive template editor. The accuracy of these selectors and therefore the extraction algorithms behind them are vitally important for our customers. </p> <p>In this release, we focused on tweaking two of them in particular: Date and Price. As well as being able to manually configure these selectors to improve extraction, we also improved the validation of extracted values so you will get exactly what you expect in the XML or JSON output. You can now avoid getting outputs such as “32nd of July” from the Date selector or prices in Euros when parsing US invoices.</p> <figure role="group" class="caption caption-img"><img alt="Price selector configuration" data-entity-type="file" data-entity-uuid="41014cad-d987-43d5-b1f3-da8afd6bb78c" height="407" loading="lazy" src="/sites/default/files/inline-images/price.png" width="899" /><figcaption>The Price selector in action.</figcaption></figure><p>Special mention should be made of the improved table selector since it is a favorite selector of many customers. Indeed, iText pdf2Data features one of the best table extraction algorithms around, and so it is a significant reason our customers use iText pdf2Data. We’re always working to raise the bar for the recognition and extraction of tables in PDF though, and this release is no exception.</p> <p>Bills of lading, purchase orders, invoices and similar documents often use templates which feature predefined, structured layouts, yet suppliers may need to include important notes for specific products. If there was no provision for this when the template was created the supplier would then need to write the notes directly into the table, and so you might end up with a table being split over multiple pages. In certain cases, this would lead to the table being detected as two separate tables. To prevent this, we’ve modified the table detection heuristics to support variable leading (line-spacing) between table rows.</p> <p>That’s not all though, as another nice improvement to the table selector is that it can now ignore watermarks. Since watermarks tend to have different styling and don’t respect table structure, they could cause problems for the table detection in previous versions.</p> <figure role="group" class="caption caption-img"><img alt="Watermark example" data-entity-type="file" data-entity-uuid="e7c411d5-04f7-4059-90f8-65714a901a63" height="206" loading="lazy" src="/sites/default/files/inline-images/watermark.png" width="724" /><figcaption>Watermarks such as the example shown here can now be ignored.</figcaption></figure><h3>Improved user experience</h3> <p>Users can now expect a better experience while creating extraction templates, as we've been making efforts to reduce the learning curve for new users. In addition to the improved high-level selectors, we’ve also revised the messaging in the pdf2Data Editor to provide users with clearer explanations and make it easier to begin data extraction. Of course, we’ll never stop improving our iText pdf2Data documentation regardless of our release schedule, so make sure you keep tabs on our <a href="https://kb.itextpdf.com/2data">Knowledge Base</a>.</p> <h2>What else?</h2> <p>We’ve fixed a couple of bugs in this release; one for PDFs which contain unsupported color spaces, and an out of memory exception which could occur when grouping lines with the Paragraph selector. As always, you can check our <a href="https://kb.itextpdf.com/2data/releases/release-itext-pdf2data-3-1-1">release notes</a> for more details.</p> <p>If you’re not already an iText pdf2Data customer, you can explore all its features and capabilities with a <a href="https://itextpdf.com/en/get-started">free 30-day online trial</a>! Alternatively, check out the <a href="https://itextpdf.com/en/node/236/">product page</a> for a detailed overview of how iText pdf2Data works.</p> <p>You can also visit our Knowledge Base where we have tutorials and a breakdown of all available pdf2Data selectors, including tips on how to use them effectively.</p> <h3>What's next?</h3> <p>Without giving too much away, you can expect some awesome additions and improvements to iText pdf2Data in the future, covering everything from template creation to data extraction, and perhaps even more 😉.</p> <p>See you in the next quarterly release!</p> </div> <div class="field field--name-field-tags field--type-entity-reference field__items"> <div class="field__label">Tags</div> <a href="/tags/data-extraction" property="schema:about" hreflang="en">data extraction</a> <a href="/tags/json" property="schema:about" hreflang="en">JSON</a> <a href="/tags/invoices" property="schema:about" hreflang="en">Invoices</a> <a href="/tags/tables-pdf" property="schema:about" hreflang="en">tables in PDF</a> <a href="/tags/xml" property="schema:about" hreflang="en">XML</a> <a href="/tags/watermark" property="schema:about" hreflang="en">watermark</a> </div> <span class="a2a_kit a2a_kit_size_25 addtoany_list" data-a2a-url="https://itextpdf.com/blog/itext-news-technical-notes/itext-pdf2data-311-now-available" data-a2a-title="iText pdf2Data 3.1.1 is now available!"><a class="a2a_button_facebook"><i class="fa-brands fa-facebook-f fa-2x"></i></a><a class="a2a_button_twitter"><i class="fa-brands fa-twitter fa-2x"></i></a><a class="a2a_button_linkedin"><i class="fa-brands fa-linkedin-in fa-2x"></i></a><a class="a2a_button_whatsapp"><i class="fa-brands fa-whatsapp fa-2x"></i></a><a class="a2a_button_email"><i class="fa-solid fa-envelope fa-2x"></i></a></span> <div class="field field--name-field-article-type field--type-entity-reference field__items"> <div class="field__label">Article type</div> <a href="/blog-type/itext-news" hreflang="en">iText news</a> <a href="/blog-type/technical-notes" hreflang="en">Technical notes</a> </div> <div class="field field--name-field-related-products field--type-entity-reference field__items"> <div class="field__label">Related products</div> <a href="/products/pdf2data/extract-your-content-from-pdf" hreflang="en">pdf2Data</a> </div> <div class="field field--name-field-main-image field--type-entity-reference field__items"> <div class="field__label">Main image</div> <a href="/resources/media/images/itext-pdf2data-311-teaser" hreflang="en">iText pdf2Data 3.1.1 teaser</a> </div> <div class="field field--name-field-promoted-to-home-page-text field--type-string field__items"> <div class="field__label">Promoted to home page text</div> iText pdf2Data 3.1.1 is now available! </div> Mon, 25 Jul 2022 15:36:33 +0000 ian.morris 15449 at https://itextpdf.com https://itextpdf.com/blog/itext-news-technical-notes/itext-pdf2data-311-now-available#comments How to modify XFA documents before flattening https://itextpdf.com/blog/technical-notes/how-modify-xfa-documents-flattening <span property="schema:name" class="field field--name-title field--type-string field--label-hidden">How to modify XFA documents before flattening</span> <span rel="schema:author" class="field field--name-uid field--type-entity-reference field--label-hidden"><span lang="" about="/users/ianmorris" typeof="schema:Person" property="schema:name" datatype="">ian.morris</span></span> <span property="schema:dateCreated" content="2020-04-16T08:42:26+00:00" class="field field--name-created field--type-created field--label-hidden"><time datetime="2020-04-16T10:42:26+02:00" title="Thursday, April 16, 2020 - 10:42" class="datetime">Thu, 04/16/2020 - 10:42</time> </span> <div property="schema:text" class="wysiwyg-content clearfix text-formatted field field--name-body field--type-text-with-summary field__items"> <h2><img alt="XFA" data-entity-type="file" data-entity-uuid="d64fb0fd-6fce-49f4-a8e3-65d76ea68ee4" src="/sites/default/files/inline-images/blog%20XFA_2.png" width="1140" height="300" loading="lazy" /></h2> <h2>Intro</h2> <p>XFA is still widely used despite the fact that it was deprecated in PDF 2.0 (published in 2017), and the last update to the XFA specification was in 2012. If you need to flatten XFA to PDF you can use our iText 7 add-on <a href="/en/products/itext-7/pdfxfa" target="_blank" title="pdfXFA">pdfXFA</a>, but one of the challenges XFA documents present is modifying them before flattening. While it's fairly easy to modify them in an application such as Adobe LiveCycle Designer, modifying the JavaScript through code can be challenging, as some structure information is required.</p> <h2 id="structure">Structure</h2> <p>The XFA document has a few different XML files, <code>template</code>, <code>localeSet</code>, <code>xmpmeta</code>, <code>datasets</code>, <code>config</code>, and <code>xfdf</code>. When modifying the JavaScript, the important file is <code>template</code>. This contains the structure of the XFA document including all the information about the fields and any JavaScript used, generally stored under a <code>script</code> or a <code>calculate</code> tag. In order to see the XML structure and navigate the DOM you can use <a href="https://itextpdf.com/en/products/rups-pdf-diagnostic-tool-reading-and-updating-pdf-syntax-debugging-pdf-code">RUPS</a>, and access this on the XFA tab in the RUPS window. <img alt="An XFA document viewed in RUPS" src="/sites/default/files/inline-images/rups.PNG" /></p> <h3 id="what-can-javascript-do-">What can JavaScript Do?</h3> <p>Below are some basic examples of what JavaScript can do to an XFA form. Because there isn't a limit on the JavaScript, there are countless possibilities as to what can be achieved with JavaScript. Theoretically (although not very practically), entire applications could be written using just JavaScript and XFA.</p> <h4 id="buttons">Buttons</h4> <p><img alt="A simple XFA form" src="/sites/default/files/inline-images/order1.png" /></p> <p><img alt="The Additional Order form" src="/sites/default/files/inline-images/order2_0.png" /></p> <h4 id="pop-up-messages">Pop-up Messages</h4> <p>Pop up messages can be called anytime JavaScript is executed, for example at load, a button, when a field is calculated, etc.<br /><img alt="A pop up message called by JavaScript execution" src="/sites/default/files/inline-images/popup.PNG" /></p> <h3 id="modifying-the-xfa">Modifying the XFA</h3> <p>Modifying the JavaScript happens in the <code>template</code> branch of the DOM. There are three basic steps to modifying the XFA document:</p> <ol><li>Extract the XFA XML DOM.</li> <li>Modify the DOM</li> <li>Write the DOM back to the PDF</li> </ol><h4 id="extract-the-dom">Extract the DOM</h4> <p>The XML is stored in a w3 DOM Document Object (<code>org.w3c.dom.Document</code>) which can be extracted using the following code:</p> <pre> <code class="language-java">PdfReader reader = new PdfReader(inputFileDir + "invoice.pdf"); PdfWriter writer = new PdfWriter(destFile); PdfDocument pdfDoc = new PdfDocument(reader, writer); XfaForm xfa = PdfAcroForm.getAcroForm(pdfDoc, false).getXfaForm(); Document domDoc = xfa.getDomDocument();</code></pre> <p>This pulls the DOM which can be navigated a few different ways.</p> <h4 id="modify-the-dom">Modify the DOM</h4> <p>Once inside the DOM any modification method will work. We'll go over two methods. The first is manually navigating the DOM. The second is using a <code>NodeFilter</code>.</p> <h4 id="manual-navigation">Manual Navigation</h4> <p>The DOM can be navigated manually as well with the methods <code>getNextSibling()</code> and <code>getFirstChild()</code>. For example, to get to the Template node you can use <code>domDoc.getFirstChild().getFirstChild().getNextSibling().getNextSibling().getNextSibling();</code>.</p> <h4 id="nodefilter">NodeFilter</h4> <p>A <code>NodeFilter</code> is an <code>Interface</code> that filters out unwanted nodes. In the following example we filter out anything that isn't an amount field:</p> <pre> <code class="language-java">public class CalcCheck implements NodeFilter { @Override public short acceptNode(Node n) { try { if (n.getLocalName().equalsIgnoreCase("calculate")) { if(n.getParentNode().getAttributes().getNamedItem("name").getNodeValue().equalsIgnoreCase("amount")) { return NodeFilter.FILTER_ACCEPT; } } return NodeFilter.FILTER_SKIP; } catch (NullPointerException e){ return NodeFilter.FILTER_SKIP; } } } </code></pre> <p>Then you create a <code>NodeIterator</code> that allows for iterating over the accepted nodes:</p> <pre> <code class="language-java">public NodeIterator findCalc(Document doc){ Node first = doc.getDocumentElement(); DocumentTraversal docT = (DocumentTraversal) doc; return docT.createNodeIterator(first, NodeFilter.SHOW_ALL, new CalcCheck(),true); } </code></pre> <h4 id="write-the-dom">Write the DOM</h4> <p>Writing the DOM doesn't change based on how the DOM was modified. The DOM needs to be replaced into the <code>XfaForm</code> and the XFA form needs to written back into the <code>PdfDocument</code>. Finally the document needs to be closed:</p> <pre> <code class="language-java">xfa.setDomDocument(domDoc); xfa.write(pdfDoc); pdfDoc.close(); </code></pre> <h4>Examples</h4> <p>Change the Calculate field with a Node Filter: <a href="https://git.itextsupport.com/projects/I7JS/repos/examples/browse/src/main/java/com/itextpdf/samples/xfa/XFACalculate.java" target="_blank">Java </a> or <a href="https://git.itextsupport.com/projects/I7NS/repos/samples/browse/itext/itext.samples/itext/samples/sandbox/xfa/XFACalculate.cs" target="_blank">.NET</a></p> <p>Modify the JavaScript message manually: <a href="https://git.itextsupport.com/projects/I7JS/repos/examples/browse/src/main/java/com/itextpdf/samples/xfa/XFAModify.java" target="_blank">Java </a> or <a href="https://git.itextsupport.com/projects/I7NS/repos/samples/browse/itext/itext.samples/itext/samples/sandbox/xfa/XFAModify.cs" target="_blank">.NET</a></p> <p>Remove the JavaScript message manually: <a href="https://git.itextsupport.com/projects/I7JS/repos/examples/browse/src/main/java/com/itextpdf/samples/xfa/XFARemove.java" target="_blank">Java </a> or <a href="https://git.itextsupport.com/projects/I7NS/repos/samples/browse/itext/itext.samples/itext/samples/sandbox/xfa/XFARemove.cs" target="_blank">.NET</a></p> <h2 id="conclusion">Conclusion</h2> <p>The biggest issue with modifying the XFA document is that because of the flexibility of the embedded JavaScript along with the different possibilities on creating fields it is difficult to automate the fixing of files. This guide relies on some internal knowledge of the XFA structure along with the specific changes that are being made. And while a specific change might fix one file, the same issue in a different PDF might not be fixed with the same solution.</p> <p>After reading all of this you might be thinking:</p> <p><em>When would I need to do this?</em></p> <p>This is generally only required when there is no control over where the XFA document comes from, otherwise it would be easier to modify it in a prefilled document.</p> <p><em>Should I still be making XFA documents?</em></p> <p>There are still pros and cons to making XFA documents, while generally it should be avoided if a system is in place built around XFA Forms then it might make sense. It's important to remember that XFA Forms are not PDF 2.0 and require knowledge of JavaScript and XFA documents.</p> <p><em>Is there an alternative?</em></p> <p>If you want a way to process dynamic data and output as PDF then <a href="https://itextpdf.com/en/products/itext-dito" target="_blank">iText DITO</a> is the alternative you're looking for. iText DITO is iText's low-code document generator that simplifies the process of creating and maintaining data-driven forms and templates.</p> <p>You can learn more about iText DITO in <a href="https://itextpdf.com/en/blog/itext-news/itext-dito-alternative-pdf-forms" target="_blank">this article</a>, or by visiting <a href="https://itextpdf.com/en/products/itext-dito" target="_blank">the product page</a>.</p> </div> <div class="field field--name-field-tags field--type-entity-reference field__items"> <div class="field__label">Tags</div> <a href="/tags/xfa" property="schema:about" hreflang="en">XFA</a> <a href="/tags/xfa-forms" property="schema:about" hreflang="en">XFA forms</a> <a href="/tags/xml" property="schema:about" hreflang="en">XML</a> </div> <span class="a2a_kit a2a_kit_size_25 addtoany_list" data-a2a-url="https://itextpdf.com/blog/technical-notes/how-modify-xfa-documents-flattening" data-a2a-title="How to modify XFA documents before flattening"><a class="a2a_button_facebook"><i class="fa-brands fa-facebook-f fa-2x"></i></a><a class="a2a_button_twitter"><i class="fa-brands fa-twitter fa-2x"></i></a><a class="a2a_button_linkedin"><i class="fa-brands fa-linkedin-in fa-2x"></i></a><a class="a2a_button_whatsapp"><i class="fa-brands fa-whatsapp fa-2x"></i></a><a class="a2a_button_email"><i class="fa-solid fa-envelope fa-2x"></i></a></span> <div class="field field--name-field-article-type field--type-entity-reference field__items"> <div class="field__label">Article type</div> <a href="/blog-type/technical-notes" hreflang="en">Technical notes</a> </div> <div class="field field--name-field-related-products field--type-entity-reference field__items"> <div class="field__label">Related products</div> <a href="/products/itext-core" hreflang="en">iText Core</a> <a href="/products/itext-suite" hreflang="en">iText Suite</a> <a href="/products/itext-community" hreflang="en">iText Community</a> <a href="/products/rups" hreflang="en">RUPS</a> <a href="/products/itext-dito" hreflang="en">iText DITO® </a> </div> <div class="field field--name-field-main-image field--type-entity-reference field__items"> <div class="field__label">Main image</div> <a href="/resources/media/images/xfa-1" hreflang="en">XFA</a> </div> <div class="field field--name-field-promoted-to-home-page-text field--type-string field__items"> <div class="field__label">Promoted to home page text</div> Blog: How to modify XFA documents before flattening </div> Thu, 16 Apr 2020 08:42:26 +0000 ian.morris 13966 at https://itextpdf.com https://itextpdf.com/blog/technical-notes/how-modify-xfa-documents-flattening#comments iText and Business Process Automation https://itextpdf.com/blog/technical-notes/itext-and-business-process-automation <span property="schema:name" class="field field--name-title field--type-string field--label-hidden">iText and Business Process Automation</span> <span rel="schema:author" class="field field--name-uid field--type-entity-reference field--label-hidden"><span lang="" about="/users/admin-marketing" typeof="schema:Person" property="schema:name" datatype="">admin-marketing</span></span> <span property="schema:dateCreated" content="2013-01-22T00:00:00+00:00" class="field field--name-created field--type-created field--label-hidden"><time datetime="2013-01-22T01:00:00+01:00" title="Tuesday, January 22, 2013 - 01:00" class="datetime">Tue, 01/22/2013 - 01:00</time> </span> <div property="schema:text" class="wysiwyg-content clearfix text-formatted field field--name-body field--type-text-with-summary field__items"> <p><a href="https://www.itextpdf.com/node/65"><img alt="" height="324" src="/sites/default/files/xfaworker_signing.preview_0.jpg" width="560" loading="lazy" /></a></p> <p><em>No matter which business you run —a large corporation with thousands of employees and ditto clientele, or an SMB with only a handful of employees and a hundred customers—, there’s one thing all companies have in common: they all implement business processes, and every business process involves data as well as documents.</em></p> <p><em>The above figure shows an example how iText can automate such processes, combining machine-readable XML with human-readable PDF documents:</em></p> <p>This is how it works: the person to the left registers data. It could be a teacher writing a proposal for a new course, a company submitting a tender for a specific assignment, and so on. On the back end, the data is stored as an XML file, because that's the best way to store data with an unpredictable, variable size. Different people will collaborate on this document. For instance: an administrator can check if the submitted content meets the requirements and return the document with a request for completion if it doesn't. This is part of a typical workflow.</p> <p>XML is more or less human-readable, but you don't want to present an XML to people who aren't tech-savvy. One way to make the XML presentable, is to create a dynamic form using Adobe LiveCycle Designer, which is a product that ships with Adobe Acrobat. Such a form is based on the XML Forms Architecture (XFA) and the result is a PDF document that acts as a container for XML. With iText, you can programmatically inject your custom XML into such a template, making the many different XML files that are passing around in the workflow much easier to consume by a human being.</p> <p>However, the document remains dynamic, and that may not be desirable at the end of the process, when you want to persist the document. This is when you'll use iText's <a href="https://itextpdf.com/Products/xfa-worker">XFA Worker</a> in combination with PdfAWriter to convert the XFA form into a PDF/A document. PDF/A is an ISO standard for Archiving.</p> <p>Finally, to avoid that the document is changed, you can digitally sign the document. If long-term preservation is important, you may want to use iText to add a Document Security Store (DSS) and a Document-Level Timestamp on a regular basis to allow long-term validation (LTV) as described in the PAdES-4 standard.</p> <p>While iText isn't a full-blown BPM product, you can see that iText fills the gap that exists in many existing BPM solutions. The use case as described above has been successfully deployed by different iText customers. Please contact sales for more information.</p> <p>The images in the figure are courtesy of stockimages, Jomphong, and twobee / FreeDigitalPhotos.net</p> </div> <div class="field field--name-field-tags field--type-entity-reference field__items"> <div class="field__label">Tags</div> <a href="/tags/itext-5" property="schema:about" hreflang="en">iText 5</a> <a href="/tags/xml" property="schema:about" hreflang="en">XML</a> <a href="/tags/pdfawriter" property="schema:about" hreflang="en">PdfAwriter</a> <a href="/tags/xmlworker" property="schema:about" hreflang="en">XMLWorker</a> </div> <span class="a2a_kit a2a_kit_size_25 addtoany_list" data-a2a-url="https://itextpdf.com/blog/technical-notes/itext-and-business-process-automation" data-a2a-title="iText and Business Process Automation"><a class="a2a_button_facebook"><i class="fa-brands fa-facebook-f fa-2x"></i></a><a class="a2a_button_twitter"><i class="fa-brands fa-twitter fa-2x"></i></a><a class="a2a_button_linkedin"><i class="fa-brands fa-linkedin-in fa-2x"></i></a><a class="a2a_button_whatsapp"><i class="fa-brands fa-whatsapp fa-2x"></i></a><a class="a2a_button_email"><i class="fa-solid fa-envelope fa-2x"></i></a></span> <div class="field field--name-field-article-type field--type-entity-reference field__items"> <div class="field__label">Article type</div> <a href="/blog-type/technical-notes" hreflang="en">Technical notes</a> </div> <div class="field field--name-field-related-products field--type-entity-reference field__items"> <div class="field__label">Related products</div> <a href="/products/itext-5-legacy" hreflang="en">iText 5</a> <a href="/products/itext-5-legacy/xml-worker" hreflang="en">XML Worker</a> </div> <div class="field field--name-field-main-image field--type-entity-reference field__items"> <div class="field__label">Main image</div> <a href="/resources/media/images/xfaworkersigningpreview0jpg-0" hreflang="en">xfaworker_signing.preview_0.jpg</a> </div> Tue, 22 Jan 2013 00:00:00 +0000 admin-marketing 461 at https://itextpdf.com https://itextpdf.com/blog/technical-notes/itext-and-business-process-automation#comments