Scanning papers and converting to pdfs

Here is how I scan a printed paper and convert to a pdf file with free Gnu utilities.

First scan at 300 dpi, as black and white, and save as separate gif files. Gif is not lossy for black and white, and run-length encoding keeps the files much smaller than tiffs. Be sure to frame your page to avoid bad margins.

All scanner software can do this much. The rest requires expensive software (like Adobe Acrobat) or a few free software commands. You should be able to install all of these commands on Windows with cygwin.

I name the images after page numbers, so that they sort easily: 01.gif, 02.gif, etc.

I convert each gif file to a postscript file with these commands (scripted inside a loop);
  $ giftopnm < 01.gif | pnmtops > 01.ps
  $ giftopnm < 02.gif | pnmtops > 02.ps
...

Then I join the separate postscript files into a single postscript document with
  $ ghostscript -dNOPAUSE -sDEVICE=pswrite -dBATCH \
       -sOutputFile=combined.ps 01.ps 02.ps [...]

psmerge is simpler but is pickier about input. ghostscript may be available as gs.

Finally I convert the postscript document into a pdf with
  $ ps2pdf combined.ps combined.pdf
or 
  $ ghostscript -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
    -sOutputFile=combined.pdf -c .setpdfwrite -f combined.ps 

If you want a pdf file saved as text rather than as an image, then you will need some expensive software that performs character recognition. You will be obliged to correct many errors by hand.

Bill Harlan, 2002


Return to parent directory.