|Postcard from Building PDF files|
It's not often that I get really enthusiastic about a piece of software, but today is the day it happens.
First a bit of background. I maintain a Wiki on one of the servers here, and it gets news articles and useful or interesting material added most days. As of this morning there are 11,586 pages in the system. I'm developing software to parse this corpus and pull out recognised names to build an ontology. The large and growing number of files makes manually skimming more than a tiny portion of the Wiki impossible, so I rely heavily on the parser.
The next stage has always been to build PDFs from specific subsets of the pages, but until this afternoon, any software that I found was either commercial and expensive, or broken in varying ways. I don't have a problem paying for software, but it really has to work well before I'm willing to fork out money. By "well", I really mean it has to do what I want in the way I want it done.
Yesterday I came across wkhtmltopdf and downloaded it as the next thing to test. Guess what? It actually works brilliantly. It can merge mutliple HTML files into one very professional looking PDF. You have control over headers and footers, it can build a Table of Contents for you if that's what you want and in general, the options available provide good, detailed control.
So for me the process works like this:
Colour me impressed at the ease and speed of the whole process.
Addendum - It's a couple of weeks later, and I'm not quite as impressed. The software still works well building a PDF from pages in the local wiki, but fails when I try to create one fromn pages in Wikipedia. If I create individual PDFs and then use a Python script to combine them all, that works fine, but that's not what I want. Apart from anything else, page numbers and the ToC have disappeared.