Feb 08

Imagine you have a pdf-file you want to make ocr-recognition. Take a scenario where you want to automatically let your linux pc do the job, e.g. in a folder.

I choose tesseract-ocr as ocr-programm. Easy to install and use and ok for my use. Unfortunately it takes only tif as input file type so that we have to convert the pdf to tif first.

To create a tif file with Ghostscript from pdf:

gs -q -r300 -dBATCH -dNOPAUSE -sDEVICE=tiff24nc  -sOutputFile=Dokument2.tif Dokument1.pdf

Now start OCR-recognition with tesseract-ocr (maybe you have to install it).

 tesseract Dokument2.tif doc.txt -l deu

-l deu means “Deutsch” for German language recognition.
-r300 means 300 DPI

Now it is easy to create a script which will automatically check a folder for new files and start ocr etc.

E.g. you can create a shell-script and start this shell script via cronjob. Maybe I will write an example here later.

Tagged with: