How to add metadata to a PDF using pdfHTML?

Busted! This isn't a frequently asked question. I made it up because I thought the answer was interesting enough for this book. The metadata.html HTML file has some metadata added inside its sec

Busted! This isn't a frequently asked question. I made it up because I thought the answer was interesting enough for this book.

The metadata.html HTML file has some metadata added inside its

  1. </p>
  2.  
  3. <p>
section:
  1. </p>
  2.  
  3. <p>Metadata</p>
  4.  
  5. <h1>Metadata</h1>
  6.  
  7. <p>Please check the document properties of the PDF file for the metadata.</p>
  8.  
  9. <p>

Now let's convert this simple HTML file to PDF using the C07E08_Metadata.java example.

public void createPdf(String src, String dest) throws IOException {
    PdfWriter writer = new PdfWriter(dest, new WriterProperties().addXmpMetadata());
    HtmlConverter.convertToPdf(new FileInputStream(src), writer);
}

As you can see, we've made sure that we also add metadata in the XMP format.

When we look at the Description tab of the Document Properties, we see the title of the page (

), the author (), the subject (), the keywords (), and the application (). The rest of the metadata, such as the creation date, the modification date, and the PDF producer, is added automatically.

Metadata taken from the Info dictionary

Metadata taken from the Info dictionary

Most of the metadata shown above is information taken from the Info dictionary. This approach of storing metadata is deprecated in PDF 2.0 in favor of XMP metadata.

XMP is metadata stored as a clear text (uncompressed) XML stream inside the PDF. Since we also added XMP metadata in our example, we can click the "Additional Metadata" button. When we go to the "Advanced" view, we can see the structure of an XML file.

Metadata taken from the XMP stream

Metadata taken from the XMP stream

XMP uses slightly different terminology. We see three namespaces: dc (Dublin Core), pdf, and xmp. We recognize the dc:creator (), the dc:description (), and the dc:title (

). The keywords are stored twice, once as dc:subject, and once as pdf:Keywords (). Finally, there's the xmp:CreatorTool ().

That XML file is also visible when you open the PDF document in a text editor.

The XMP stream inside the PDF file

The XMP stream inside the PDF file

The use of XMP isn't limited to PDF. The goal of XMP is to allow content management systems to extract metadata from every binary file without being aware of the specific syntax of the binary file's format.

For instance: you could have an XMP stream inside a JPEG. Any content management system would be able to extract that XMP from the JPEG without having to use a JPEG library.

The same goes for PDF: you don't need a PDF library to read the XMP metadata from a PDF file. All you have to do, is to look for the uncompressed XMP stream inside the otherwise binary file.



Ready to use iText?

Try our iText 7 Library and add-ons FREE for 30 days. Test your proof of concept, and see if our solution is right for you.

Get my FREE trial
Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now