Converting images of scanned documents into proper PDF files is quite a hard task. What I usually want is
- put the images on a page of a well-defined size (e.g. A4 or Letter)
- don’t resample the image data
- have precise control over compression – in particular, I want to use JPEG images as-is, without any recompression
This sounds simple and reasonable, but I’ve yet to find a tool that does exactly that. Adobe Acrobat handles the latter two constraints well, but I don’t know how to set the paper size when importing an image. This is no problem when using a normal vector graphics or page layout tool, but then you usually don’t have much influence on what nasty things the PDF output code does to your images. Furthermore, you mostly end up with useless cruft in the PDF files, like XML metadata or even fonts (even though there’s not a single letter of text anywhere in the document). So I decided to end this mess once and for all and write my own image-to-PDF converter. Here it is: pdfgen.
pdfgen is a Python 2.x script that can be used either as a command-line tool or as a Python library. It uses the Python Imaging Library (PIL) to import image files, but apart from that, it only depends on modules that are part of the standard Python runtime library. The PDF generation code is completely contained in the program itself, as is the documentation. Generating an A4-sized PDF file from a JPEG file is as simple as
python pdfgen.py example.jpg
For any further details, please have a look into the code or just call
pdfgen -h on the command line.
So, without further ado, here’s the code: