iText and PDF/UA development

Tags: PDF/UAinterview

ITEXT BRINGING PDF/UA SUPPORT TO JAVA IMPLEMENTATIONS

Interview by Duff Johnson

Owing in no small part to its free and open source licensing model, iText is one of the most popular and widely implemented Java libraries for PDF file creation and manipulation. I interviewed Bruno to discuss his thoughts about PDF/UA and how he envisions implementers using iText to manage tagged PDF.

iText has supported creation of tagged PDF for some time, but according to Bruno Lowagie, the original developer of iText, the publication of ISO 14289 (PDF/UA) has provided a solid technical basis for achieving consistent results when implementing tagged PDF. The new standard is helping iText’s developers specify and prioritize additional development of tagged PDF-related features, which will soon translate into more accessible PDF files delivered to end users.

While iText is planning to formally announce support for PDF/UA compliance by Q3 of 2013 once documentation and various convenience features are added, Bruno emphasized that it will already be possible in 13Q1 to use iText to create PDF/UA documents from scratch using iText’s basic building blocks.

Please describe your product or suite of products, and how it (or they) use PDF/UA.

At the core of iText, you’ll find a proven enterprise-grade software library that interfaces with every different aspect of PDF.

iText allows developers to generate documents using high level objects and convenience methods but also permits access to PDF at the lowest level using COS objects and AIM methods, and even rewrite entire content streams.

PDF/UA was already supported for a long time when creating a PDF document on the lowest level, but this required plenty of programming and knowhow. More recently, we started supporting the automatic creation of Tagged PDF based on the high-level objects that are used to create the PDF. This was the first step towards PDF/UA support.

How do you see iText’s role in the electronic document industry.

The iText software library is traditionally used to create or manipulate PDF documents in automated processes. Typically an iText project is deployed in web applications, where content needs to be served dynamically to a browser. In these cases content isn’t available in advance: it’s calculated based on user input or real-time database information. iText can be used to build a standalone solution from scratch, but we also have many customers who are using iText to fill a need for advanced PDF technology in Enterprise Information Management (EIM), as well as many other BI/BA, BPM and ECM products.

How does PDF/UA support work in iText?

One of the traditional ways to create documents with iText, is by using basic building blocks such as Paragraph, Image, List and so on. All these objects implement the Element interface, but now we’ve also introduced the IAccessibleElement interface with methods that allow you to add Attributes and to set a Role. These attributes and the role, is used to create Tagged PDF, the basis for PDF/UA.

We also updated the core of iText to ensure the natural reading order of the content was aligned with the structure tree. We still need to do some work on merging Structured Tree Roots when concatenating PDFs, and so on.

Does the product create PDF/UA files, process PDF/UA files, or what?

Currently we’re focusing on PDF/UA creation. We’re also working on document manipulation (such as splitting and merging) that maintain PDF/UA status. We already support conversion of Tagged PDF into XML. This functionality is useful in the context of PDF/UA and for now, we think it’s sufficient. We don’t have the ambition to create a PDF Accessibility Checker,

How does iText manipulate existing PDF files with respect to tagged PDF?

When I talk about manipulating PDFs, I mean splitting/merging existing files.

All other manipulations can be done via low-level methods provided by iText, but it goes without saying that one’s PDF needs to be really great to do this correctly.

Let’s say we can change structure element attributes, or parse page contents and remove some tagged parts of content. Depending on the demand, you’ll probably see us adding convenient methods for operations that turn out to be useful.

One important note: when flattening a form containing AcroForm fields, we won’t rewrite the complete content stream for now. Flattening an AcroForm form to PDF/UA won’t be supported in the next couple of releases.

As I discussed in a recent blog post, the key requirement of PDF/UA from the assistive technology user’s point of view is: ”Content shall be marked in the structure tree with semantically appropriate tags in a logical reading order.” How does iText address this requirement?

This is how it works:

As soon as you set the Tagged flag with the PdfWriter.setTagged() method, you can focus on using iText’s high-level objects (the so-called basic building blocks). iText will use the order in which you add elements (such as Paragraph, Phrase, etc…) to the document to create an appropriate structure tree automatically, so iText implementers are responsible for getting that right. Note that default roles are chosen depending on the type of the high-level object, but you can set custom roles when needed.

With the setTagged() method and 4 other calls made you can be sure that your PDF will pass as PDF/UA according to the PDF Accessibility Checker. This is available in iText 5.4.0.

Does iText support the PrintField attribute?

One could add the attribute using low-level operations, but not many people use iText to create forms, so we didn’t plan any support for the PrintField attribute.

We do have the necessary infrastructure to parse all content streams and add the flattened field content at the appropriate place. That could be ready by 13Q3.

Can you say if you are planning to implement the PDF Association’s forthcoming Matterhorn Protocol?

Currently, we base our work on the ISO standard. We’re interested in whatever other document is published, but we can’t tell in advance if we’re going to implement it.

If your product includes verification features, will you require the user to verify each affected object, or address all such objects at once?

Verification is outside the scope of iText.

WCAG 2.0 has been around since 2008. Why didn’t you produce software to support that standard; why did you wait for PDF/UA?

When we first looked at the description of WCAG in the context of PDF, we felt it wasn’t as clear as ISO-32000-1, so we stuck to whatever is explained about accessibility in the PDF standard. Today, the PDF Accessibility Checker recognizes the PDF documents we produce as PDF/UA compliant, not as WCAG 2.0.

Apart from accessibility, what do you see as the most likely value end users can get from PDF/UA support in creation or processing software?

Things like PDF to HTML conversion are obvious, but the ability to extract useful data from a PDF document, making a document not only readable for humans but also by machines, will be a huge step forward in business analytics and business process automation.

The big challenge for Enterprise Information Management systems today is the existence of a plethora of unstructured documents. At iText, we’re doing projects that involve extracting data from traditional PDF documents. For instance: read all the lines from a bank statement in PDF and store them in a database; find a national number on the first page of a document and route it to the correct destination; and so on. Without structure, these kind of operations are often difficult to implement and error prone. Let’s change this by creating documents that contain structure!

In the second edition of my book iText in Action, I present an example where I convert an XML file with the first paragraphs of Moby Dick to PDF and back. Because of the use of Tagged PDF, the resulting XML is identical to the original one (even using the same custom tags). Suppose you create PDF invoices from XML, wouldn’t it be great if the instance receiving the PDF invoice is able to extract the original invoice data from the invoice?

I know that all of this was already possible for a long time, if only the documents were created as Tagged PDFs, but I hope that the buzz about PDF/UA will result in a higher percentage of structured documents.

What can you tell me about your release plans?

Better PDF/UA support was scheduled for 13Q2 / 13Q3, but we advanced the development to 2012 due to a customer request which gave the project a higher priority. We’ve made good progress and we’re testing PDF/UA support with a handful of selected customers. The functionality may be released earlier than 13Q3, but we don’t consider a product as released until we’ve documented it, so the release date remains 13Q3.