How can I display text in different languages in a single PDF?
Earlier versions of the iText PDF library were already able to render Chinese, Japanese and Korean glyphs in PDF documents, but to correctly display right-to-left scripts like Hebrew and Arabic, we needed the information provided by OpenType fonts to help with handling the complexities of all the world's writing systems. So, for iText 7 we went back to the drawing board to provide OpenType support for advanced font features in PDF documents.
However, we decided to go a step further and created pdfCalligraph, a commercially licensed add-on module for the iText 7 library which was specifically designed to support many more languages and writing systems, and like iText 7 it's available for both the Java and .NET (C#) platforms. For detailed information about the inherent difficulties of supporting multiple languages and writing systems in the PDF standard, and the powerful and unique solutions pdfCalligraph provides, we recommend reading the pdfCalligraph white paper.
In this article, we’ll demonstrate how you can use pdfCalligraph with iText to create a PDF containing text using different languages. But first, here’s a short explanation of how pdfCalligraph works.
How does pdfCalligraph integrate with iText 7?
The iText layout module will automatically look for pdfCalligraph in its dependencies if text if a language or writing system that requires it is encountered by the Renderer Framework. For example, when iText encounters text that contain Indic texts, or a script that's written from right to left, iText checks if pdfCalligraph is available and will then use its functionality to provide the correct glyph shapes to write to the PDF file. However, as the typography logic is complex and can be resource-heavy even for documents that don’t require this functionality, iText won't attempt any advanced shaping operations if the pdfCalligraph module has not been loaded as a binary dependency.
pdfCalligraph features
- Automatic detection of writing systems
- Wider language support
- Right-to-left support
- Ligatures
- Kerning
- Glyph substitution
- Available for Java and .NET (C#)
Using pdfCalligraph
To use pdfCalligraph you simply load the correct binaries into your project, make sure your valid license file is loaded, and iText 7 will automatically use the pdfCalligraph code when it is required by a document. If you don't have a commercial license for pdfCalligraph, you can get a free trial of the iText 7 Suite which includes the iText 7 Core library, plus all the add-ons.
Usage example
For this example, we'll demonstrate using pdfCalligraph to correctly render text in different languages. First, let’s start with a simple English sentence, which we've translated into three different languages using Google Translate.
This is an example sentence.
Let's see how that looks in Arabic:
.هذه هي الجملة المستخدمة في المثال
Now let’s see how it looks in Hindi:
यह एक उदाहरण वाक्य है।
And finally in Tamil:
இது ஒரு எடுத்துக்காட்டு வாக்கியம்.
Those of you familiar with any of the languages in question may notice the translations are not perfect, but they will be sufficient for our purposes here.
Now we’ll save each piece of text as separate XML files, english.xml
, arabic.xml
, hindi.xml
and tamil.xml
. Note that because Arabic is written right-to-left, in order for pdfCalligraph to display the Arabic text starting from the right side of the PDF you’ll need to specify this in the XML. However, the languages will be detected and handled by pdfCalligraph automatically.
To render the text correctly in our PDF, we’ll be making use of the Google Noto fonts which you can download using the link. You may have noticed that sometimes when text is rendered by a computer, certain characters are displayed as little boxes. This indicates your device doesn’t have a font that's able display the text, so unrecognized characters are rendered as these boxes (or “tofu”).
The Noto fonts are Google’s answer to tofu and the name “noto” was chosen to convey the idea that Google’s goal is to see “no more tofu”. The Noto fonts are free to use and have multiple styles and weights available. The Noto font family is comprised of over 100 individual fonts which have been designed to cover all the scripts encoded in the Unicode standard with a harmonious look and feel.
We’ll be using the NotoNaskhArabic-Regular
and NotoSansTamil-Regular
fonts to render our Arabic, Hindi and Tamil texts as they are intended to appear.
In the following code, we take the text from all four source files and display them as separate paragraphs in a single PDF document. You can click the button in the top-right of the code window to switch between Java and .NET (C#) code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
final String[] sources = {"english.xml", "arabic.xml", "hindi.xml", "tamil.xml"};
final PdfWriter writer = new PdfWriter(DEST);
final PdfDocument pdfDocument = new PdfDocument(writer);
final Document document = new Document(pdfDocument);
final FontSet set = new FontSet();
set.addFont("fonts/NotoNaskhArabic-Regular.ttf");
set.addFont("fonts/NotoSansTamil-Regular.ttf");
set.addFont("fonts/FreeSans.ttf");
document.setFontProvider(new FontProvider(set));
document.setProperty(Property.FONT, new String[]{"MyFontFamilyName"});
for (final String source : sources) {
final File xmlFile = new File(source);
final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
final DocumentBuilder builder = factory.newDocumentBuilder();
final org.w3c.dom.Document doc = builder.parse(xmlFile);
final Node element = doc.getElementsByTagName("text").item(0);
final Paragraph paragraph = new Paragraph();
final Node textDirectionElement = element.getAttributes().getNamedItem("direction");
boolean rtl = textDirectionElement != null && textDirectionElement.getTextContent()
.equalsIgnoreCase("rtl");
if (rtl) {
paragraph.setTextAlignment(TextAlignment.RIGHT);
}
paragraph.add(element.getTextContent());
document.add(paragraph);
}
document.close();
pdfDocument.close();
writer.close();
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
string[] sources = new string[] { "english.xml", "arabic.xml", "hindi.xml", "tamil.xml" };
PdfWriter writer = new PdfWriter(DEST);
PdfDocument pdfDocument = new PdfDocument(writer);
Document document = new Document(pdfDocument);
FontSet set = new FontSet();
set.AddFont("NotoNaskhArabic-Regular.ttf");
set.AddFont("NotoSansTamil-Regular.ttf");
set.AddFont("FreeSans.ttf");
document.SetFontProvider(new FontProvider(set));
document.SetProperty(Property.FONT, new String[] { "MyFontFamilyName" });
foreach (string source in sources)
{
XmlDocument doc = new XmlDocument();
var stream = new FileStream(source, FileMode.Open);
doc.Load(stream);
XmlNode element = doc.GetElementsByTagName("text").Item(0);
Paragraph paragraph = new Paragraph();
XmlNode textDirectionElement = element.Attributes.GetNamedItem("direction");
Boolean rtl = textDirectionElement != null && textDirectionElement.InnerText.Equals("rtl");
if (rtl)
{
paragraph.SetTextAlignment(TextAlignment.RIGHT);
}
paragraph.Add(element.InnerText);
document.Add(paragraph);
}
document.Close();
This will create a PDF containing the following text in the four specified languages, as illustrated below:
Results
You can see our example PDF here.
Resources
Supported languages and scripts
This table shows the additional languages and scripts support enabled by pdfCalligraph, as well as the ones natively supported in iText 7:
Language | Script | Module |
---|---|---|
Arabic, Persian, Kurdish, Azerbaijani, Sindhi, Pashto, Lurish, Urdu, Mandinka, Punjabi and others | ARABIC | pdfCalligraph |
Hebrew, Yiddish, Judaeo-Spanish, and Judeo-Arabic | HEBREW | pdfCalligraph |
Bengali | BENGALI | pdfCalligraph |
Hindi, Sanskrit, Pali, Awadhi, Bhojpuri, Braj Bhasha, Chhattisgarhi, Haryanvi, Magahi, Nagpuri, Rajasthani, Bhili, Dogri, Marathi, Nepali, Maithili, Kashmiri, Konkani, Sindhi, Bodo, Nepalbhasa, Mundari and Santali | DEVANAGARI, NAGARI | pdfCalligraph |
Gujarati and Kutchi | GUJARATI | pdfCalligraph |
Punjabi | GURMUKHI | pdfCalligraph |
Kannada, Konkani and others | KANNADA | pdfCalligraph |
Khmer (Cambodia) | KHMER | pdfCalligraph |
Malayalam | MALAYALAM | pdfCalligraph |
Odia | ORIYA | pdfCalligraph |
Tamil | TAMIL | pdfCalligraph |
Telugu (Dravidian language) | TELUGU | pdfCalligraph |
Thai | THAI | pdfCalligraph |
Chinese | Core | |
Japanese | Core | |
Korean | Core | |
WESTERN | Core | |
Russian, Ukrainian, Belarussian, Bulgarian and others | CYRILLIC | Core |
Greek | GREEK | Core |
Armenian | ARMENIAN | Core |
Georgian | GEORGIAN | Core |