The Portable Document Format
In 2008, the PDF specification was published as an ISO standard: ISO 32000-1. This wasn't the first ISO standard for PDF. Figure 1.1 shows that there's an umbrella of PDF standards, each having its own specific purpose, often in the context of a specific sector or industry. ISO 32000 was written as the core standard that is used as the basis for all the sub-standards under this umbrella.
Figure 1.1: PDF, an umbrella of standards
The Portable Document Format was originally designed by Adobe. The first version of the specification was released in 1993. New versions were published on a regular basis, adding more and more functionality. ISO 32000-1 was based on version 1.7 (2006) of Adobe's PDF specification.
PDF/A: long-term preservation of documents
ISO 19005, or PDF/A, was originally developed to meet long-term archival needs. Part 1 was released in 2005. It was defined as a subset of version 1.4 of Adobe's PDF specification (which, at that time, wasn't an ISO standard yet). It introduced a series of obligations and restrictions:
The document needs to be self-contained: all fonts need to be embedded; external movie, sound or other binary files are not allowed.
The document needs to contain metadata in the eXtensible Metadata Platform (XMP) format: ISO 16684 (XMP) describes how to embed XML metadata into a binary file, so that software that doesn't know how to interpret the binary data format can still extract the file's metadata.
From the start, it was determined that approved parts of ISO 19005 could never become invalid. New, subsequent parts would only define new, useful features.
ISO 19005-1:2005 (PDF/A-1) defined two conformance levels:
Level B ("basic"): ensures that the visual appearance of a document will be preserved for the long term.
Level A ("accessible"): ensures that the visual appearance of a document will be preserved for the long term, but also introduces structural and semantic properties. The PDF needs to be a Tagged PDF.
ISO 19005-2:2011 (PDF/A-2) was introduced to have a PDF/A standard that was based on the ISO standard (ISO 32000-1) instead of on Adobe's PDF specification. PDF/A-2 also adds a handful of features that were introduced in PDF 1.5, 1.6 and 1.7:
Useful additions include: support for JPEG2000, Collections, object-level XMP, and optional content.
Useful improvements include: better support for transparency, comment types and annotations, and digital signatures.
PDF/A-2 also defines an extra level besides Level A and Level B:
Level U ("Unicode"): ensures that the visual appearance of a document will be preserved for the long term, and that all text is stored in UNICODE.
ISO 19005-3:2012 (PDF/A-3) was an almost identical copy of PDF/A-2 (even the typos were copied). There was only one difference with PDF/A-2: in PDF/A-3, attachments don't need to be PDF/A. You can attach any file to a PDF/A-3 document, for instance: an XLS file containing calculations of which the results are used in the document, the original Word document that was used to create the PDF document, and so on. The document itself needs to conform to all the obligations and restrictions of the PDF/A specification, but these obligations and restrictions do not apply to its attachments.
PDF as a format for invoices
All the qualities of the PDF/A standard are also highly desirable for invoices. Alas, PDF/A alone doesn't solve the problem of processing invoices (yet). Few PDF invoices today are structured; they lack the information that is required for automatic extraction of key data such as the amount due, addresses, taxes, wiring information, and so on.
If the PDF is Tagged, you already get some information about the semantics of the different content elements in the document. You could apply some artificial intelligence to detect the subtotal, the taxes and the grand total. But this will only work to a certain extend. Tagged PDF wasn't designed for this purpose. The process will never be flawless, especially if Optical Character Recognition (OCR) is involved. Numbers can easily be misinterpreted: a zero can be read as the letter O, a 5 can be scanned as a 6; It's an error-prone process in an area where there's little tolerance for errors.
Ideally, we'd create our invoice as a PDF/A-3 invoice and attach a document that contains all the necessary data in a format that allows a machine to interpret the invoice without any human intervention. To find a format that meets this requirement, let's take a look at how large corporations exchange data.
Electronic Data Interchange
For decades, large corporations have used EDI to exchange information with each other in the form of structured data that is transmitted without (or with minimum) manual intervention. This required bilateral arrangements between the companies, defining which data is to be exchanged and which format is to be used. Implementing EDI isn't trivial. In the case of small and medium businesses, the low volume of transactions doesn't justify the cost of putting in place an EDI system. Several standardization organizations have tried to reduce this cost by introducing industry standards.
Electronic Business eXtensible Markup Language
In 1999, the United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) and the organization for the Advancement of Structured Information Standards (OASIS) started an initiative that resulted in a suite of standards that were approved by the International Organization for Standardization (ISO) in 2004. They were released under the general title ISO 15000: Electronic Business eXtensible Markup Language (ebXML).
This standard enables enterprises in any industry, of any size, anywhere in the world to conduct business over the internet. Originally, ISO 15000 consisted of these four parts:
ISO 15000-1:2004: ebXML Collaborative Partner Profile Agreement
ISO 15000-2:2004: ebXML Messaging Service Specification
ISO 15000-3:2004: ebXML Registry Information Model
ISO 15000-4:2004: ebXML Registry Services Specification
The goal for these standards was to make EDI less expensive and less difficult to implement, by providing companies with a standard method to exchange business messages, conduct trading relationships, communicate data in common terms and define and register business processes.
OASIS and UN/CEFACT also developed a common set of semantic building blocks that represent general types of business data. Existing business vocabularies were restructured and new business vocabularies were created. These were published in a fifth part of the ISO standard ISO 15000-5:2005, the ebXML Core Components Technical Specification (CCTS).
Uniform Business Language and the Core Components Library
OASIS then went on to produce a data format in full conformance with the CCTS: the Universal Business Language (UBL). UBL became the foundation of a number of successful international frameworks such as ePrior, PEPPOL, and many other specifications. It's an XML only specification of which the data model is not normative.
UN/CEFACT released several versions of a Core Components Library (CCL) based on ISO 15000-5. A CCL is a repository of easily reused generic business data components. It provides templates discribing postal adresses, tax information, payment information, and so on. UBL uses Core Components, but it's an XML only specification. The Core Components of the CCL are syntax-independent. They can be used to create syntax solutions other than XML.
In 2014, ISO released an update of part 5 of ISO 15000: ISO 15000-5:2014 ebXML Core Components Specification (CCS). Unlike UBL, this specification is normative and syntax neutral.
The Cross Industry Invoice standard
ISO 15000-5:2014 and the CCL were used by UN/CEFACT as the basis for specific business document models, such as the Cross Industry Order (CIO), the Cross Industry Order Response (CIOR), the Cross Industry Invoice (CII), and so on.
The goal of these uniform, standardized models is to permit the exchange of data electronically in a syntax-independent, interoperable way, without any human intervention. Using the CII standard, companies can easily process any number of invoices in an automated way, based on mutual agreements on which data should be shared, and in which form, for instance using XML.
At the European level, the European Committee for Standardization (CEN) derived different Message User Guides (MUG) from these standards, such as the Core Invoice Data Model MUG which is a subset, derived from the CII standard. CEN Workshop Agreement (CWA) documents CWA 16356-1, -2 and -3 describe the setup, content and data structures of a minimum scope in the context of sending invoice data. The Core Invoice Data Model defines about 100 field types related to an invoice.
Importance of the evolution of EDI standards
Electronic Data Interchange has been standardized and simplified in such a way that it is no longer impossible for small and medium businesses to implement. This has been recognized by the German Forum for Electronic Invoicing (FeRD) who used the CII and CEN's Message User Guides as the basis for the ZUGFeRD Model.
Zentraler User Guide des Forums elektronische Rechnung Deutschland
The "Forum elektronische Rechnung Deutschland" (FeRD) is a German platform of associations, ministries and companies promoting electronic invoicing. In 2014, FeRD published the Central User Guide for Electronic Invoicing (ZUGFeRD) as a new standard for invoicing. This standard consists of the ZUGFeRD data model describing which contents make up an invoice (the semantics) and the ZUGFeRD format describing how these contents will be transferred.
The ZUGFeRD Data Model
For the data model, ZUGFeRD uses the CCI standard and CEN's Message User Guides to which the UN/CEFACT Naming and Design Rules (NDR) are applied, resulting in the ZUGFeRD XML schema. Every invoice needs to contain an XML file that validates against this schema.
Figure 1.2 shows the three different profiles that are supported.
Figure 1.2: Semantic profile of the ZUGFeRD standard
The Comfort profile supports processing posting, payment and checking of invoices. The information required to do so is present either in structured form or as qualified text. Other data can be included as free text. Free text imposes no requirements in terms of coding of the information itself, but qualified text needs to be accompanied by a code qualifying the content.
The Basic profile is a subset of the Comfort profile. It reduces the requirements for structured data to the absolute minimum, such as posting and payment information. Other information can be added as free text.
The Extended profile is a superset of the Comfort profile. It covers the cross-industry requirements as fully as possible. All relevant data is available either in structured form, or as a qualified text field. Other data, such as a note on an advertising campaign, can be included as free text.
ZUGFeRD Basic only supports commercial invoices, notifications and credit notes (code 380). ZUGFeRD Comfort also supports debit and credit notes related to financial adjustments (code 84). ZUGFeRD Extended also supports self-billed invoices and self-billed credit notes (code 389).
We'll go into much more detail, explaining which data is required and what these codes are about, when we discuss how to use iText to create invoices that conform to the Basic and the Comfort profile.
The ZUGFeRD Format
PDF/A-3 was chosen for the ZUGFeRD format and as a rule each PDF/A-3 file contains one and only one invoice. This PDF file will contain all data used for automated processing as an embedded XML file (either conforming with the requirements of the Basic, Comfort, or the Extended profile). Due to the nature of PDF/A-3, invoicees who want to process the PDF manually aren't hindered by this additional file; they can open the document in one of the many PDF viewers which come preinstalled on virtually all PCs, smart phones and other devices. The fact that ZUGFeRD uses PDF/A also means that the visual representation of the document is preserved for the long term.
We will take a closer look at how the XML file is embedded into the PDF file in the examples that demonstrate how to create ZUGFeRD invoices with iText.
Bridging the Gap between SMBs and EDI
The aim of ZUGFeRD is to close the gap between the simple exchange of invoices as printed pages or image files, and EDI processes, which consist purely of structured data. Figure 1.3, taken from the ZUGFeRD specification, shows the spectrum. The more your process is EDI-friendly, the more automation increases, but without a PDF page as part of the invoice, the less easy it becomes for humans to consume the document.
Figure 1.3: Closing the gap between paper and EDI
FeRD created ZUGFeRD to lower the threshold for small and medium businesses to implement EDI-like invoicing processes, even if they only issue and receive a small number of invoices. Thanks to ZUGFeRD, they will be able to exchange invoices with companies of any size without any prior consultation or agreement.
iText wants to contribute to this evolution. Every small and medium business is already able to create invoices in the PDF format. With a tool such as iText, it's not difficult to comply with the PDF/A standard and to attach an XML attachment. Thanks to iText, even small and medium businesses can afford to create ZUGFeRD compliant invoices.
World-wide implementation of the ZUGFeRD standard could yield financial, technical and operational benefits across the entire economy, regardless of organization size or nationality.