20120228

Slick PDFs from scanned documents

It is convenient to have digital versions of your most important documents. While they will probably not hold up as a replacement should the originals disappear, the ability to produce a copy at any time with a simple print command alone is huge. Also, to be able to attach a "physical" document to an email or a job application is really handy and professional.
Let's have a look at how to generate PDFs from scanned documents, which are:
  • small in file size
  • one file per (potentially multi-page) document
  • fit to generate a good copy of the original
First, start with a high quality scan of the document. Make sure the page is positioned correctly to avoid skew and shading. A resolution of 300dpi is enough in my experience. If the document consists solely of text, use linedrawing mode to reduce the file size. This will save you some work later.
On linux, this command will do that work for you:
scanimage --mode=Lineart --resolution=300 --format=tiff > in.tiff
Second, improve the resulting image. I recommend to crop the image to the page and to adjust the levels to generate a clear white background and crisp black text. If you are using the GIMP for this, choose Color->Levels and use the black and white point tools.
The same number of colors creates a much clearer picture
if the input image contained fewer colors.
Also, consider reducing the colors in parts of the image. Many times documents contain just a small part in any color (like a signature). Reducing the colors of the rest of the document allows the compression to achieve good results with a smaller number of total colors.Choose black and white only for text! Don't worry about aliasing, the resolution will take care of that.
The secret to achieving small file sizes is to reduce the number of colors required to save the image. Change the color mode to indexed (Image->Mode->Indexed... in GIMP) and have it generate a palette with a small number of colors. 16 entries per color should be plenty. Leave out black and white in that calculation if you reduced them to one gray value each. So, for text only documents you should arrive at 1 color, for text with a single additional color 16 is more than enough.
Save the result as a TIFF file (use "save as..." to guarantee you get to choose the compression parameters) and instruct it to use lzw compression. (Note that this file format does not support transparency, so remove the alpha channel from each layer first by right-clicking and choosing "Remove Alpha Channel".)
The resulting image should be far smaller in size than the original. Less than 200KB is realistic for black and white, not much more than 1MB for anything with more colors is feasible.
Next, if your document consists of more than one page, repeat the steps above for each. Then join them together using the command tiffcp:
tiffcp -c lzw page1.tiff page2.tiff ... out.tiff
Of course, replace the pageX.tiff with the filenames you actually chose.
Finally, to turn this TIFF file into a PDF, use tiff2pdf. This should work:
tiff2pdf -z -pa4 -F out.tiff -o out.pdf
You can also specify the author information (your name, presumably) using the -a parameter and the title using -t. So maybe:
tiff2pdf -z -pa4 -F out.tiff -o out.pdf -a "John Doe" -t "Master's Certificate"
And that should be it!
Did I miss anything? Do you have more tricks up your sleeve for how to make a really high quality digital version of your documents? Share them in the comments!

1 comment:

Ruby said...

There are scanned documents today that don’t need to be edited or improved, and that is made possible by the advanced programs and services we have today. In fact, there are computer programs today that allow you to immediately create a PDF file from the image of a scanned paper document. I think it is better to create high-quality scanned documents right from the start to avoid spending time and effort in editing. However, if the latter cannot be avoided, then there are photo manipulation software also available to edit out flaws in the scans.
Ruby Badcoe