my package of the day – htmldoc – for converting html to pdf on the fly

PDF creation got actually fairly easy. OpenOffice.org, the Cups printing system, KDE provide methods for easily printing nearly everything to a PDF file right away. A feature that even outperforms most Windows setups today. But there are still PDF related task that are not that simple. One I often run into is automated PDF creation on a web server. Let’s say you write a web application and want to create PDF invoices on the fly.

There are, of course, PDF frameworks available. Let’s take PHP as an example: If you want to create a PDF from a php script, you can choose between FPDF, Dompdf, the sophisticated Zend Framework and more (and commercial solutions). But to be honest, they are all either complicated (as you often have to use a specific syntax) to use or just quite limited in their possibilities to create a pdf file (as you can only use few design features). As I needed a simple solution for creating a 50+ pages pdf file with a huge table on the fly I tested most frameworks and failed with most of them (often just as I did not have enough time to write dozens of line of code).

So I hoped to find a solution that allowed me just to convert a simple HTML file to a PDF file on the fly providing better compatibility than Dompdf for instance. The solution was … uncommon. It was no PHP class but a neat command line tool called „htmldoc“ available as a package. If you want to give it a try just install it by calling „aptitude install htmldoc“.

You can test htmldoc by saving some html files to disk and call „htmldoc –webpage filename.html“. There a lot of interesting features like setting font size, font type, the footer, color and greyscale mode and so on. But let’s use htmldoc from PHP right away. The following very simple script uses the PHP output buffer for minimizing the need for a write to disk to one file only (if somebody knows a way of using this without any temporary files from a script, let me know):

// start output buffer for pdf capture
 
ob_start();
?>
your normal html output will be places here either by
dumping html directly or by using normal php code
<?php
// save output buffer
$html=ob_get_contents();
// delete Output-Buffer
ob_end_clean();
// write the html to a file
$filename = './tmp.html';
if (!$handle = fopen($filename, 'w')) {
	print "Could not open $filename";
	exit;
}
if (!fwrite($handle, $html)) {
	print "Could not write $filename";
	exit;
}
fclose($handle);
// htmldoc call
$passthru = 'htmldoc --quiet --gray --textfont helvetica \
--bodyfont helvetica --logoimage banner.png --headfootsize 10 \
--footer D/l --fontsize 9 --size 297x210mm -t pdf14 \
--webpage '.$filename;
 
// write output of htmldoc to clean output buffer
ob_start();
passthru($passthru);
$pdf=ob_get_contents();
ob_end_clean();
 
// deliver pdf file as download
header("Content-type: application/pdf");
header("Content-Disposition: attachment; filename=test.pdf");
header('Content-length: ' . strlen($pdf));
echo $pdf;

As you can see, this is neither rocket science nor magic. Just a wrapper for htmldoc enabling you to forget about the pdf when writing the actual content of the html file. You’ll have to check how htmldoc handles your html code. You should make it as simple as possible, forget about advanced css or nested tables. But it’s actually enough for a really neat pdf file and it’s fast: The creating of 50+ page pdf files is fast enough in my case to make the on demand access of htmldoc feel like static file usage.

Please note: Calling external programs and command line tools from a web script is always a security issue and you should carefully check input and updates for the program you are using. The code provided should be easily ported to another web language/framework like Perl and Rails.