Due to the popularity of the World Wide Web, its underlying technologies have become commonplace and ubiquitous. HTML (and CSS) in particular are widely used to present text and information in a visually pleasing manner. At the same time, many companies rely on the PDF format for document workflows, internal documentation and archiving. The main draws of PDF are the reliability and portability of the format, and the commitment to present the same appearance across different devices, which is an explicit mission statement of the PDF specification.
PDF however, is not an easily editable format, and is often converted from other formats as a final step in document creation. One such format can be HTML, pdfHTML leverages the widespread knowledge of the format and existing skills of development resources in converting HTML to PDF. pdfHTML provides the engine that converts HTML/CSS content to a well-formatted, well-structured PDF document, without the need to know the technical details of the PDF format.
HTML/CSS to PDF transformation
In HTML, content is wrapped in HTML tags. Each tag corresponds to a conceptual element (e.g. paragraph, table), and many tags can be nested. These tags can be organized into a hierarchical model, such as the DOM (Document Object Model) of the HTML document, detailing the structure and semantics of the content.
Styling and visual representation for HTML content is provided by the use of Cascading Style Sheets (CSS). CSS declarations define styling and layout information (e.g. font, font-size, margins, borders, color, alignment, etc.) for the various HTML tags and their content. These can be found in the HTML file itself, or as a separate style-sheet file.
A web browser parses and interprets the HTML file and accompanying CSS to create a visual representation of the content, calculating the rendering and layout on the fly and visualizing the various contents according to their CSS declarations and the renderers own settings.
In contrast, at its core, a PDF document is not inherently structured and semantic. Its content consists of a set of instructions that result in painting at absolute positions on a large canvas. The concept of a line of text, for example, does not exist at this basic level. We only infer that visually because characters appear next to each other at the same vertical position.
PDF does offer an additional layer of functionality to store semantic and structural information, using similar concepts as HTML: tagging pieces of content according to their roles in the document and constructing a hierarchical tree. To support features like proper content extraction, repurposing of content, search indexing and accessibility, it’s crucial to augment the visual-only representation of PDF documents with this additional information.
Since HTML documents inherently contain semantic and structural information, they are an excellent source to convert to rich, smart PDF documents. This is where pdfHTML comes in.
Continue reading the pdfHTML white paper
We hope you enjoyed this first page of the white paper, continue reading the full whitepaper.