I have an application that extracts headings out of pdf files. The documents that the application is supposed to work with, all have more or less coherent structure and formatting. In fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. I know that there is one more way of making text appear bold, for instance by changing the render mode. However in my case calling
GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. Are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
Fonts inside a PDF
We see that the font is of subtye
/TrueType, we see that the
/ItalicAngle is 0, and... we see that the 3rd bit of the
/Flags is set. Let's check the PDF reference to find out what this tells us:
PDF Reference 1.7 section 5.7.1
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the
In short: iTextSharp doesn't have any indications that the font is bold.