The PDF standard for long-term preservation of documents

Archiving digital documents with PDF/A

Highlights for archiving and preservation of digital documents

  • Archiving digital documents remains an essential part of document management. In many industries documents need to be stored for decades. Consistency is key to keeping them accessible.
  • Digitally archiving your PDF documents can be achieved with PDF/A, the standard for long-term document preservation. The Portable in PDF already guarantees that a user will view a file in a consistent way, no matter the environment. 
  • PDF/A adds extra requirements to make sure the document remains consistent over longer periods of time.
  • PDF/A is of great use and often required by law in in governments and the public sector, manufacturing and construction, the financial sector and healthcare.
  • PDF/A has different parts and conformance levels. Some even let you embed other files and open up possibilities for advanced machine reading and hybrid archiving.
  • With the iText Core library for Java and .NET (C#) you can scale up your PDF/A compliant document generation.

What is digital archiving?

While the carrier has transformed from physical to digital, archiving documents remains an essential and inevitable part of today’s document management. For various industries and authorities long-term archiving is even a legislated requirement. Often documents need to be stored for decades. And if you are going to go through all that trouble, you might as well go for a solution that does so consistently.

Why PDF/A compliance?

When it comes to the file format and standard to use for digital archiving, there is a widespread acceptance of one specific standard: PDF/A. And for good reason. PDF/A is based on the general PDF specifications, a standard that was already built to display documents on a wide range of devices and environments in a consistent way. This means that unlike many other document file formats, you don’t have to worry about other users seeing something different than you see.

PDF/A introduces a number of extra restrictions to guarantee the document also remains consistent over longer periods of time.

Self-containment

This key aspect of PDF/A ensures all content and info is embedded in the file. This includes the displayed content, fonts, and color information (ICC color profiles).

Content restrictions

Video and audio content is not allowed: Since these rely on external software to be rendered, there is no guarantee this content will remain consistent.

Encryption is not allowed: PDF/A doesn’t allow for encryption of the document or embedded content.

JavaScript and executable file launcher restrictions: Since these actions could alter the content of the PDF, these are forbidden. PDF/A-4 allows for limited use of JavaScript.

Standardized metadata

Using the XMP format is required. This metadata can hold (but is not limited to) copyright information and the indication that the PDF is a PDF/A.

Access the power of PDF

With PDF/A conformance you can access world a possibilities that are unique to PDF, such as digital signing, data extraction, redaction, and optimizing documents for size and speed.

Who needs archiving?

Many industry-specific compliance regulations and needs offer a challenge that can often be resolved with the use of PDF/A.

Governments and public institutions

Many public authorities recommend PDF/A and some even make it a hardline requirement. For example, the Dutch, Swiss and the Danish governments all enforce the use of PDF/A for non-editable documents.

Among archive institutions PDF/A is also one of the preferred formats for The Smithsonian, New York State Archives, the National Archives of the Netherlands, The National Archives of the UK and so on.

government building capitol

Manufacturing and construction

Industry documentation often needs to be kept available for future reference and liability issues. A famous example is that of aeroplane manufacturer Airbus. Aeroplane blueprints must be preserved for at least 99 years. Even before PDF/A was created, the Airbus team developed a “minimal PDF” to avoid the pitfalls of general PDF.

PDF/A-3 has the added advantage that any sort of file such as 3D models can be embedded into the PDF container.

manufacturing and construction working on 3d file

Financial sector

The financial and insurance sector sometimes requires documents to be retained for 50 or more years. Additionally, financial records and sensitive client information needs to be kept in a contained and secure format.

Another great use case is the one of e-invoicing. The ZUGFerd format uses the PDF/A-3 container capabilities to embed invoice data in a machine-readable XML format with the original invoice. This allows for appropriate software to process the invoice automatically.

financial sector graph

Healthcare

As a general rule medical documents need to be preserved for up to 30 years. Think patient records, medical statements, reports, and imagery like X-rays. Digital signing and time stamping is often added to attach an audit trail to documents.

Medical documents archived as PDF/A can also be a useful resource for research on long-term effect of medication and treatments. PDF/A assures a reliable visual representation for decades, and since text can be stored as Unicode, content can be searchable and easily extracted for reuse.

healthcare xray on tablet

Understanding PDF/A: Parts and conformance levels

PDF/A-1

The strictest of the four parts. It forbids Transparent elements, Layers and JPEG2000 and LZW compression.

PDF/A-2

Gets rid of the restricted features mentioned in PDF/A-1 and introduces the embedding of other PDF/A files, making it an effective container for multiple PDF documents.

PDF/A-3

The biggest change here is that one can now embed any type of file in a PDF/A-3 document. This proves especially handy to add a machine readable (XML) copy or to include the original data the document was based on in a form of hybrid archiving.

PDF/A-4

Published in 2020, Part 4 is not widely used yet. Its biggest change is that it allows some of the new features of PDF 2.0 such as page level output intents (output intent tells the processor how to interpret the colors used in the document).

Conformance levels

Level b (“basic”): ensures that the visual appearance of a document will be preserved for the long term.

Level a (“accessible”): ensures that the visual appearance of a document will be preserved for the long term, but also introduces structural and semantic properties. The PDF needs to be a Tagged PDF.

Level u (“Unicode”): ensures that the visual appearance of a document will be preserved for the long term, and that all text is stored in Unicode. This facilitates the searchability of text. This is also the preferred output format for OCR-generated PDFs.

The f and e conformance levels are much more functional profiles that extend the specification than conformance levels. PDF/A-4f allows embedding any other file, making it a successor to PDF/A-3. PDF/A-4e supports Rich Media and 3D annotations.

Image
pdfa table

Making PDF/A compliant documents with iText Core

The open-source iText Core library gives us all the tools we need to make a PDF/A compliant document. See the following links to jump right into our tutorial:

Java tutorial .NET(C#) tutorial

Scanned documents to a PDF/A-3u document with iText pdfOCR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.pdfocr.OcrPdfCreator;
import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine;
import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties;
 
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
 
 
public class JDoodle {
 
    static final Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties();
    private static List LIST_IMAGES_OCR = Arrays.asList(new File("invoice_front.jpg"));
    private static String OUTPUT_PDF = "/myfiles/hello.pdf";
    private static final String DEFAULT_RGB_COLOR_PROFILE_PATH = "profiles/sRGB_CS_profile.icm";
 
    public static void main(String[] args) throws IOException {
 
        final Tesseract4LibOcrEngine tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties);
        tesseract4OcrEngineProperties.setPathToTessData(new File(TESS_DATA_FOLDER));        
 
        OcrPdfCreatorProperties properties = new OcrPdfCreatorProperties();
        properties.setPdfLang("en"); //we need to define a language to make it PDF/A compliant
 
        OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(tesseractReader, properties);
        try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) {
            ocrPdfCreator.createPdfA(LIST_IMAGES_OCR, writer, getRGBPdfOutputIntent()).close();
        }
    }
 
    public static PdfOutputIntent getRGBPdfOutputIntent() throws FileNotFoundException {
        InputStream is = new FileInputStream(DEFAULT_RGB_COLOR_PROFILE_PATH);
        return new PdfOutputIntent("", "",
                "", "sRGB IEC61966-2.1", is);
    }
 
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
using System.Collections.Generic;
using System.IO;
using iText.Kernel.Pdf;
using iText.Pdfocr;
using iText.Pdfocr.Tesseract4;
 
public class Program
{
    private static readonly Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties();
    private static string OUTPUT_PDF = "/myfiles/hello.pdf";
    private const string DEFAULT_RGB_COLOR_PROFILE_PATH = @"profiles\sRGB_CS_profile.icm";
    private static IList LIST_IMAGES_OCR = new List
    {
        new FileInfo("invoice_front.jpg")
    };
 
    static void Main()
    {
        var tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties);
        tesseract4OcrEngineProperties.SetPathToTessData(new FileInfo(TESS_DATA_FOLDER));
 
        var properties = new OcrPdfCreatorProperties();
        properties.SetPdfLang("en"); //we need to define a language to make it PDF/A compliant
 
        var ocrPdfCreator = new OcrPdfCreator(tesseractReader, properties);
        using (var writer = new PdfWriter(OUTPUT_PDF))
        {
            ocrPdfCreator.CreatePdfA(LIST_IMAGES_OCR, writer, GetRgbPdfOutputIntent()).Close();
        }
    }
 
    static PdfOutputIntent GetRgbPdfOutputIntent()
    {
       Stream @is = new FileStream(DEFAULT_RGB_COLOR_PROFILE_PATH, FileMode.Open, FileAccess.Read);
       return new PdfOutputIntent("", "", "", "sRGB IEC61966-2.1", @is);
    }
}
Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now