Postcard from Building PDF files

It's not often that I get really enthusiastic about a piece of software, but today is the day it happens.

First a bit of background. I maintain a Wiki on one of the servers here, and it gets news articles and useful or interesting material added most days. As of this morning there are 11,586 pages in the system. I'm developing software to parse this corpus and pull out recognised names to build an ontology. The large and growing number of files makes manually skimming more than a tiny portion of the Wiki impossible, so I rely heavily on the parser.

The next stage has always been to build PDFs from specific subsets of the pages, but until this afternoon, any software that I found was either commercial and expensive, or broken in varying ways. I don't have a problem paying for software, but it really has to work well before I'm willing to fork out money. By "well", I really mean it has to do what I want in the way I want it done.

Yesterday I came across wkhtmltopdf and downloaded it as the next thing to test. Guess what? It actually works brilliantly. It can merge mutliple HTML files into one very professional looking PDF. You have control over headers and footers, it can build a Table of Contents for you if that's what you want and in general, the options available provide good, detailed control.

So for me the process works like this:

