The specifics of installing Tesseract on your machine will depend on your operating system, but in general there are three basic steps required:
- Installing the Leptonica library which handles the image analysis and processing required by Tesseract. Files are located here: http://www.leptonica.com/download.html
- Installing Tesseract source; files can be downloaded here.
- Installing the Tesseract language data, available from the same download link above. It is possible to download multiple languages and call them simultaneously (more on this below). For my OCR testing process I downloaded and used the English data files.
If you’ve installed the Tesseract language data as described above you’ve already seen the quite substantial list of supported languages. Once you’ve copied the languages to the tessdata folder they can be invoked on the command line with the -l parameter. For example:
tesseract example.jpg out -l eng
In my tests I only used the English language training data, but what if your corpus contains texts with mixed languages? You should be able invoke two languages at once like so:
tesseract example.jpg out -l eng+spa
This obviously requires that you either have a list of expected languages that you invoke each time, or that you choose the language at the time of processing, perhaps based on the metadata of an object. I expect there will be an impact to processing times if you invoke Tesseract with a long list of languages since it would have to check each token against several dictionaries. That’s just a guess, though.
By default Tesseract produces plain text files. I did however confirm that the software could produce hOCR. To achieve this, you need to create a config file in the same directory as Tesseract with the single line:
Then you can invoke tesseract with:
tesseract input.tiff output -l eng +myconfig
The hOCR output might be a helpful step in generating ‘smart’ PDFs from your images. There are even several scripts out there to do just that; for example, hocr-pdf.
There is not much documentation available for Tesseract, but there is some anecdotal information available on sites like StackOverflow. In general, it is suggested to do some preprocessing of your images before running them through Tesseract. For example, full color magazine pages might benefit from ‘grayscale’ processing and certain more vintage documents, where the text is often blurred, might benefit from some contrast. A popular solution for this kind of image conversion is ImageMagick. A simple grayscale invocation of the ImageMagick tool looks like:
convert example.jpg -type grayscale
If you want to adjust the images DPI (say to 300, which is often recommended for optimal OCR) try:
convert example.jpg -density 300 -type grayscale
Also recommended is a script called textcleaner available from Fred’s ImageMagick Scripts. Much of what is contained in this script eludes me, but it seems to merge a dozen or so different conversions all with the intent of producing the most readable text possible. I saw Fred’s work cited often within OCR discussions. Here’s how to invoke textcleaner with the added grayscale step:
textcleaner -g $i
As I mentioned above, Tesseract itself does not come with much documentation, and the more ephemeral community-produced literature is fragmented and often hard to find. So, I hope this is helpful to someone who is just getting started with Tesseract.