Recently I am tasked with reducing the file sizes of PDF generated from blank office documents. The images are mostly blank, but they have a variety of company letterheads (in color), borders and footers. Some are generated by software (and therefore have very clean pixels), others are scanned by desktop scanners. Being "blank", what I mean is center part of the page (two inches away from each margin) will be absolutely blank and white.
My boss want to keep these PDFs in color, but wouldn't mind making them fuzzier as long as they are not too ugly. I have tested with many file reduction schemes: different color compression methods (FlateDecode, LZWDecode, DCTDecode, ...), different JPEG quality settings, reducing the pixel width and height of the JPEG, and only stretch it up when the PDF is displayed, cutting up the image content into smaller patches of images,...
So far, I have found that the third method (reducing pixel dimensions) was more effective than reducing the JPEG quality settings (say, scaling image down 50% as opposed to dropping quality from 50 down to 20)
However, one approach that has eluded me in some of the sample PDFs we collected from other companies is that some had JPEG images that are multi-stage filtered. What I mean is that:
- The Filter name is an array of two: /DCTDecode, /FlateDecode
- when the PDF was generated, JPEG compression was applied first, and then it was Deflated. Upon viewing the PDF, the data was first Inflated and then JPEG decompressed into pixels.
I applied the first step of decompression (the FlateDecode) to the multi-stage filtered data, giving the original JPEG data. I used a hex viewer to inspect the JPEG data, and found that in the blank areas of the JPEG image, most of the bytes are repeated patterns. This explains why it would be advantageous to apply a secondary compression on top of an JPEG file - if only one knows that JPEG image is mostly blank.
Apparently those PDFs were also created by iText. However, it is not clear to me if there is an instance of the
iTextSharp.text.Image class that supports this combination of two-stage filtering.
In case iText does not have built-in support for creating such two-stage compressed images, would it be possible to insert an image if I handle the two-stage compression and use something like
I've created two examples, one that will work with old iText versions, one that will only work with the iText 5.5.1 and later versions:
Image img = Image.getInstance("some.jpg"); img.setCompressionLevel(PdfStream.BEST_COMPRESSION);
As you are passing a JPEG to iText, the filter will be
If you then use te
setCompressionLevel() method, the stream will be deflated too.
You'll have two entries in the filter:
/DCTDecode. This is shown in the
I've also created the FlateCompressJPEG2Passes example for people who are stuck with an older iText version:
PdfReader reader = new PdfReader(src); // We assume that there's a single large picture on the first page PdfDictionary page = reader.getPageN(1); PdfDictionary resources = page.getAsDict(PdfName.RESOURCES); PdfDictionary xobjects = resources.getAsDict(PdfName.XOBJECT); PdfName imgName = xobjects.getKeys().iterator().next(); PRStream imgStream = (PRStream)xobjects.getAsStream(imgName); imgStream.setData(PdfReader.getStreamBytesRaw(imgStream), true); PdfArray array = new PdfArray(); array.add(PdfName.FLATEDECODE); array.add(PdfName.DCTDECODE); imgStream.put(PdfName.FILTER, array); PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest)); stamper.close(); reader.close();
This code sample post-processes an existing document, it requires more code and it's more error prone: in this short snippet, I assume that there's a single XObject and that this single object is an image. This may not be the case for your PDFs.