pdfOCR

Intro

iText 7 pdfOCR

iText pdfOCR offers Optical Character Recognition functionality to convert your scanned documents, PDFs and images into fully ISO-compliant PDF or PDF/A-3u files making it possible to access and process the text they contain. 

Every day we receive scanned documents or images containing printed text in huge amounts. But without machine-readable text, the content cannot be edited, searched, indexed or processed.

Image
schema OCR

How it works

Take a look at how easy it is to OCR a list of images and create a PDF file!

Don't forget to specify the path to your local Tesseract Data files using TESS_DATA_FOLDER in the code below. You can always find the most accurate trained LSTM models here.

import com.itextpdf.kernel.pdf.PdfWriter; import com.itextpdf.pdfocr.OcrPdfCreator; import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine; import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties; import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.List; public class JDoodle { static final Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties(); private static List LIST_IMAGES_OCR = Arrays.asList(new File("invoice_front.jpg")); private static String OUTPUT_PDF = "/myfiles/hello.pdf"; public static void main(String[] args) throws IOException { final Tesseract4LibOcrEngine tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties); tesseract4OcrEngineProperties.setPathToTessData(new File(TESS_DATA_FOLDER)); OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(tesseractReader); try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) { ocrPdfCreator.createPdf(LIST_IMAGES_OCR, writer).close(); } } }
using System.Collections.Generic; using System.IO; using iText.Kernel.Pdf; using iText.Pdfocr; using iText.Pdfocr.Tesseract4; private static readonly Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties(); public class Program { private static string OUTPUT_PDF = "/myfiles/hello.pdf"; private static IList LIST_IMAGES_OCR = new List { new FileInfo("invoice_front.jpg") }; static void Main() { { var tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties); tesseract4OcrEngineProperties.SetPathToTessData(new FileInfo(TESS_DATA_FOLDER)); var ocrPdfCreator = new OcrPdfCreator(tesseractReader); using (var writer = new PdfWriter(OUTPUT_PDF)) { ocrPdfCreator.CreatePdf(LIST_IMAGES_OCR, writer).Close(); } } }
Benefits

Why use iText 7 pdfOCR?

One of the major challenges in document management is dealing with inaccessible data, data which is locked away in non-editable documents. Scanning a document containing printed text does not make it editable or searchable however, you just have a scanned image of the content.

Optical Character Recognition (OCR) can help to unlock this data. One of the most common use cases for OCR is to produce documents which can be searched, processed, or archived. While some word processing and PDF applications now offer OCR functionality to make PDFs editable, manually doing this for more than a few documents is impractical.

iText pdfOCR provides a way to automate the OCR process, and integrate it into document workflows.

iText pdfOCR icon
Automate text recognition

  • iText pdfOCR enables the automation of text recognition into a document workflow process.

iText pdfOCR icon
Ideal for long-term archiving

  • iText pdfOCR can generate PDF/A-3u compliant files, the accepted standard for long-term archiving and preservation of PDF electronic documents.
  • Documents can also be secured with digital signatures, based on the PAdES standard.

iText pdfOCR icon
Process and transform data using iText

OCR enables you to perform additional processing and data transformation. Some examples for using iText pdfOCR in combination with other iText software:

  • Define specific document elements for extraction with iText pdf2Data.
  • Securely redact recognized text with iText pdfSweep
  • Use extracted text to populate PDF form fields using iText 7 Core
  • Merge text into HTML templates for iText pdfHTML conversion to PDF.
  • Use recognized text with iText DITO and add data binding and conditional formatting to PDF templates.
Key features

Core capabilities of pdfOCR

The output can be configured to be text, a PDF consisting of separate layers for the source image data and a layer containing all recognized text, or as a flattened PDF with the layers merged. If you need documents to be suitable for long-term archive storage, then the support for PDF/A-3u output is an added bonus.

 

 

Core capabilities development icon
Powered by the open source Tesseract 4 engine

  • Tesseract 4 is the latest stable release of the popular open source OCR engine
  • It uses a Long Short-Term Memory (LSTM) neural network to improve its speed and accuracy of text recognition.

Core capabilities development icon
Simple, yet flexible API

  • The API is simple to use, and consistent with common practices for both Java and .NET
  • It is also abstracted, to allow support for different OCR engines with little or no effort from users.

Core capabilities development icon
Supports multiple input images

  • Can process single images, or a list of images at once.
  • Accepts BMP, PNM, PNG, JFIF, JPEG or TIFF formats.

Core capabilities development icon
Text only extraction option

  • iText pdfOCR can recognize text in documents and export it as a text file
  • This can be used to populate external databases or with other tools.
Contact

Still have questions? 

We're happy to answer your questions. Reach out to us and we'll get back to you shortly.

Contact us
Stay updated

Join 11,000+ subscribers and become an iText PDF expert by staying up to date with our new products, updates, tips, technical solutions and happenings.

Subscribe Now