How pdfHtml maps HTML tags onto iText objects

Tue - 03/07/2017

This blog post will give you an overview of how pdfHTML maps the HTML model onto the iText model, and how you can influence this process.

Share this article

This blog post will give you an overview of how pdfHTML maps the HTML model onto the iText model, and how you can influence this process.

About the iText document model and Layout objects

To give you a quick idea of what our model stands for, hereby a quick overview of the key features:

  • it allows developers to quickly grasp how to use iText without having to understand the PDF specification.

  • it closely resembles HTML, and is based on how documents are formed structurally.

  • it is designed using concepts that most people can easily understand, such as paragraphs and tables.

During the design process for the iText 7 layout module, we noticed that there are a lot of similarities between the iText document model and the HTML model. For example, iText uses objects - such as paragraph and tables - to simplify the creation of PDF documents. HTML uses similar objects to let you construct HTML files.

Considering this overlap, we took a few ideas of the HTML model and implemented these into our model. Also, can the object models be easily mapped onto one another. We have tried to map every HTML tag - that makes sense in a PDF file - to our iText layout object. This is easy for some objects, for example: a span is still a span.

Nevertheless, there are a few HTML tags that don't make sense in the context of PDF, but most of these do correspond to an iText equivalent. They were a modified version of what we already implemented. For instance, an "article" tag is mapped onto a "div" element. The iText 7 layout model doesn't know the concept of an article, so we mapped the article onto a div. After all, when you squint your eyes, an article can be seen as a div under a different name. Some tags were not included because they were not relevant in a PDF, such as Audio.

This table gives you a brief overview of the default mapping we provided:

A

ATagWorker

Article   

DivTagWorker

BDI

SpanTagWorker

BDO

SpanTagWorker

A technical overview of the mapping process

Let's have a look at how the mapping process is handled. In the diagram "Processing nodes" you can see how PDFHTML works.

There are 4 steps that pdfHTML goes through to make an iText layout object:

  1. Create the layout object
    The first step is the most obvious one: if an HTML tag corresponds to an iText Layout object then at one point you should make an iText Layout object. This is done through the ITagWorker interface. We'll dive into the TagWorkers later in this post.

  2. Process children
    pdfHTML processes the tags depth-first: if this tag has any child tags it will loop over all the child tags and go through this flow again with the child tags before finishing the flow for the current tag. Meaning that if a child tag has more children, it will go over them as well, and so on.

  3. Apply the CSS
    After processing the child tags, pdfHTML will apply the CSS to the iText Layout object. This is done through CSS Appliers.

  4. Return to the parent layout object
    The Layout object is now complete. It contains its child elements and its CSS is also applied. This object will now be returned to its parent tag for further handling.

Tag processing flox

Figure 1: pdfHTML tag processing

ITagWorkers and ITagWorkerFactory

The actual processing of the HTML tags into iText objects happens in the ITagWorker implementations. We've provided standard implementations for every HTML tag we support. The mapping of HTML tags onto ITagWorkers happens in the ITagWorkerFactory. Our default implementation is the DefaultTagWorkerFactory and it can easily be extended to support your own tags.

It is important to know that for each tag the HtmlProcessor encounters, by default a new TagWorker instance will be created. So, when you're creating your own custom tags, do keep in mind that it was our intent to have no state in a TagWorker.

Customizing the mapping

While our set of tags is quite large and covers most needs, pdfHTML doesn't support every tag. Some tags are not relevant in a PDF context or maybe you added your own custom tags to your HTML files for processing purposes. You might want to process an unsupported tag and at least add some of its content to the output PDF. Or maybe you want to map every table tag onto a span object, just because you can.

Well, you can! pdfHTML allows for some very extensive customization concerning the mapping and there's a few options:

  1. You can implement the ITagWorkerFactory interface.

  2. You can extend the DefaultTagWorkerFactory and override the getCustomTagWorker method.

We always advise you to use the second option. This option allows you to take advantage of our own internal mapping we made. It also allows you to remap existing tags onto other ITagWorkers as the getCustomTagWorker method has priority over our internal mapping.

The following code sample maps everything onto a SpanTagWorker.

public class CustomTagWorkerFactory extends DefaultTagWorkerFactory {
    @Override
    public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
        return new SpanTagWorker(tag, context);
    }
}

Don't try this at home! There are more useful things to do with a custom mapping.

customTagWorkers

Let's say that in our converted PDF files we want to insert a QR Code based on the data in our database. You could insert this data into the HTML in a custom tag:

1
2
3
4
5
6
7
8
9
10
11
</p>
 
<p>QRCode Example</p>
 
<p>QR Code below,Q</p>
 
<p>With great power comes great current squared times resistance</p>
 
<p>QR Code below, L</p>
 
<p>With great power comes great current squared times resistance 

For this tag to be picked we'll need to write a TagWorker. This class will parse and process the information inside the tag and its attributes.

QRCodeTagWorker

public class QRCodeTagWorker implements ITagWorker {
    private static String[] allowedErrorCorrection = {"L","M","Q","H"};
    private static String[] allowedCharset = {"Cp437","Shift_JIS","ISO-8859-1","ISO-8859-16"};
    private BarcodeQRCode qrCode;
    private Image qrCodeAsImage;
    public QRCodeTagWorker(IElementNode element, ProcessorContext context){
        //Retrieve all necessary properties to create the barcode
        Map hints = new HashMap();
        //Character set
        String charset = element.getAttribute("charset");
        if(checkCharacterSet(charset )){
            hints.put(EncodeHintType.CHARACTER_SET, charset);
        }
        //Error-correction level
        String errorCorrection = element.getAttribute("errorcorrection");
        if(checkErrorCorrectionAllowed(errorCorrection)){
            ErrorCorrectionLevel errorCorrectionLevel = getErrorCorrectionLevel(errorCorrection);
            hints.put(EncodeHintType.ERROR_CORRECTION, errorCorrectionLevel);
        }
        //Create the QR-code
        qrCode = new BarcodeQRCode("placeholder",hints);
    }
    @Override
    public void processEnd(IElementNode element, ProcessorContext context) {
        //Transform barcode into image
        qrCodeAsImage = new Image(qrCode.createFormXObject(context.getPdfDocument()));
    }
    @Override
    public boolean processContent(String content, ProcessorContext context) {
        //Add content to the barcode
        qrCode.setCode(content);
        return true;
    }
    @Override
    public boolean processTagChild(ITagWorker childTagWorker, ProcessorContext context) {
        return false;
    }
    @Override
    public IPropertyContainer getElementResult() {
        return qrCodeAsImage;
    }
    private static boolean checkErrorCorrectionAllowed(String toCheck){
        for(int i = 0; i

 

Because pdfHTML doesn't know about the QRCodeTagWorker, we'll need to plug it in into our custom DefaultTagWorkerFactory implementation:

public class CustomTagWorkerFactory extends DefaultTagWorkerFactory {
    @Override
    public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
        if ( "qr".equalsIgnoreCase(tag.name()) ) {
            return new QRCodeTagWorker(tag, context);
        }
        return null;
    }
}

Now, all we need to do it register this into the pdfHTML workflow:

ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setTagWorkerFactory(new CustomTagWorkerFactory());
HtmlConverter.convertToPdf(
        new FileInputStream("C:\\Temp\\qr\\qrcode.html"),
        new FileOutputStream("C:\\Temp\\qr\\out.pdf"),
        converterProperties);

Run this code sample and you'll get the following output:

Summary/Conclusion

Now that we've reached the end of this blog post, let's look at what we've learned of how pdfHTML:

  • The layout model has a lot of similarities to the HTML model, so mapping is relatively easy

  • pdfHTML iterates over the HTML structure depth-first to create its iText layout structure

  • You can extend or customize the mapping using ITagWorkers and ITagWorkerFactory

Try pdfHTML for yourself with a Demo or download and try it free for 30 Days with our Free Trial.



Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now