Recently I am tasked with reducing the file sizes of PDF generated from blank office documents. The images are mostly blank, but they have a variety of company letterheads (in color), borders and footers. Some are generated by software (and therefore have very clean pixels), others are scanned by desktop scanners. Being "blank", what I mean is center part of the page (two inches away from each margin) will be absolutely blank and white.
My boss want to keep these PDFs in color, but wouldn't mind making them fuzzier as long as they are not too ugly. I have tested with many file reduction schemes: different color compression methods (FlateDecode, LZWDecode, DCTDecode, ...), different JPEG quality settings, reducing the pixel width and height of the JPEG, and only stretch it up when the PDF is displayed, cutting up the image content into smaller patches of images,...
So far, I have found that the third method (reducing pixel dimensions) was more effective than reducing the JPEG quality settings (say, scaling image down 50% as opposed to dropping quality from 50 down to 20)
However, one approach that has eluded me in some of the sample PDFs we collected from other companies is that some had JPEG images that are multi-stage filtered. What I mean is that:
- The Filter name is an array of two: /DCTDecode, /FlateDecode
- when the PDF was generated, JPEG compression was applied first, and then it was Deflated. Upon viewing the PDF, the data was first Inflated and then JPEG decompressed into pixels.
I applied the first step of decompression (the FlateDecode) to the multi-stage filtered data, giving the original JPEG data. I used a hex viewer to inspect the JPEG data, and found that in the blank areas of the JPEG image, most of the bytes are repeated patterns. This explains why it would be advantageous to apply a secondary compression on top of an JPEG file - if only one knows that JPEG image is mostly blank.
Apparently those PDFs were also created by iText. However, it is not clear to me if there is an instance of the
iTextSharp.text.Image class that supports this combination of two-stage filtering.
In case iText does not have built-in support for creating such two-stage compressed images, would it be possible to insert an image if I handle the two-stage compression and use something like
Please, take a look at the FlateCompressJPEG1Pass example.
Image image = new Image(ImageDataFactory.create(imgSrc)); image.getXObject().getPdfObject().setCompressionLevel(CompressionConstants.BEST_COMPRESSION);
As you are passing a JPEG to iText, the filter will be
/DCTDecode. If you then use the
setCompressionLevel() method, the stream will be deflated too. You'll have two entries in the filter:
/DCTDecode. This is also shown in the FlateCompressJPEG2Passes example:
PdfReader reader = new PdfReader(src); PdfDocument pdfDoc = new PdfDocument(new PdfReader(src), new PdfWriter(dest)); PdfDictionary pageDict = pdfDoc.getFirstPage().getPdfObject(); PdfDictionary pageResources = pageDict.getAsDictionary(PdfName.Resources); PdfDictionary pageXObjects = pageResources.getAsDictionary(PdfName.XObject); PdfName imgName = pageXObjects.keySet().iterator().next(); PdfStream imgStream = pageXObjects.getAsStream(imgName); imgStream.setData(reader.readStreamBytesRaw(imgStream)); PdfArray array = new PdfArray(); array.add(PdfName.FlateDecode); array.add(PdfName.DCTDecode); imgStream.put(PdfName.Filter, array); pdfDoc.close();
This code sample post-processes an existing document, it requires more code and it's more error prone: in this short snippet, I assume that there's a single
XObject and that this single object is an image. This may not be the case for your PDFs. So, I recommend you to use the first sample.
Click this link if you want to see how to answer this question in iText 5.