Getting started with Tesseract

Setup

The specifics of installing Tesseract on your machine will depend on your operating system, but in general there are three basic steps required:

  • Installing the Leptonica library which handles the image analysis and processing required by Tesseract. Files are located here: http://www.leptonica.com/download.html
  • Installing Tesseract source; files can be downloaded here.
  • Installing the Tesseract language data, available from the same download link above. It is possible to download multiple languages and call them simultaneously (more on this below). For my OCR testing process I downloaded and used the English data files.

Languages

If you’ve installed the Tesseract language data as described above you’ve already seen the quite substantial list of supported languages. Once you’ve copied the languages to the tessdata folder they can be invoked on the command line with the -l parameter. For example:

tesseract example.jpg out -l eng

In my tests I only used the English language training data, but what if your corpus contains texts with mixed languages? You should be able invoke two languages at once like so:

tesseract example.jpg out -l eng+spa

This obviously requires that you either have a list of expected languages that you invoke each time, or that you choose the language at the time of processing, perhaps based on the metadata of an object. I expect there will be an impact to processing times if you invoke Tesseract with a long list of languages since it would have to check each token against several dictionaries. That’s just a guess, though.

Outputs

By default Tesseract produces plain text files. I did however confirm that the software could produce hOCR. To achieve this, you need to create a config file in the same directory as Tesseract with the single line:

tessedit_create_hocr 1

Then you can invoke tesseract with:

tesseract input.tiff output -l eng +myconfig

The hOCR output might be a helpful step in generating ‘smart’ PDFs from your images. There are even several scripts out there to do just that; for example, hocr-pdf.

Image preprocessing

There is not much documentation available for Tesseract, but there is some anecdotal information available on sites like StackOverflow. In general, it is suggested to do some preprocessing of your images before running them through Tesseract. For example, full color magazine pages might benefit from ‘grayscale’ processing and certain more vintage documents, where the text is often blurred, might benefit from some contrast. A popular solution for this kind of image conversion is ImageMagick. A simple grayscale invocation of the ImageMagick tool looks like:

convert example.jpg -type grayscale

If you want to adjust the images DPI (say to 300, which is often recommended for optimal OCR) try:

convert example.jpg -density 300 -type grayscale

Also recommended is a script called textcleaner available from Fred’s ImageMagick Scripts. Much of what is contained in this script eludes me, but it seems to merge a dozen or so different conversions all with the intent of producing the most readable text possible. I saw Fred’s work cited often within OCR discussions. Here’s how to invoke textcleaner with the added grayscale step:

textcleaner -g $i

Summary

As I mentioned above, Tesseract itself does not come with much documentation, and the more ephemeral community-produced literature is fragmented and often hard to find. So, I hope this is helpful to someone who is just getting started with Tesseract.

A brief survey of current OCR solutions

The last time I had occasion to select an OCR solution for a project was sometime in 2008. Apparently, not much has changed since then, or so it seems from a brief survey I conducted recently. Though this is perhaps due to the kind of application I was looking for, which would be 1) free or relatively affordable 2) supporting standards for input and output formats 3) scriptable via an API or an open-source programming language like Python (i.e. not a standalone desktop application). I’m sure the NSA has some sweet OCR tools, but I doubt they’re interested in sharing.

I found it helpful to group the tools into three loose categories:

  • Commercial (i.e. ‘paid’), usually Windows based, desktop applications
  • Open source (i.e. ‘free’) software solutions, usually run in a Linux environment
  • Wildcard, generally imperfect solutions sought out by folks looking for an easy and free fix for an immediate problem

Tools in the first category make up for their expense somewhat by being all-in-one packages that deliver image pre-processing (e.g. de-skew, auto-rotation, layout analysis), support for multiple languages, and a host of input and output formats. Their disadvantages include a frequent requirement that they are run on Windows, difficulty using in an automated fashion, and their expense which can range up to several thousand dollars a year (or more if there are page limitations).

A highly rated application in this category is Abbyy FineReader. Their primary product is a personal use application, but Abbyy also offers their OCR tools at an enterprise level, either as a Windows/Windows Server application, or as a developer’s SDK. FineReader claims to support the recognition of 198 languages and offers a rather long list of input and output formats. I used this tool once upon a time and it really is quite powerful and accurate. But it doesn’t really meet the criteria I defined above.

Let me take a quick detour into the third category. People looking for a quick OCR fix without paying a lot of money or installing a bunch of untested software have found some luck using software they already happen to have installed. Adobe Acrobat Pro offers an OCR feature, as does Microsoft Word (Or at least it did; that neat MS Word feature does not exist in recent versions of the software). I also saw some discussion of reusing Google Drive’s OCR API, described here. Evernote also performs OCR on submitted images. With some imagination one could come up with a little workflow in which images are POSTed to Drive or Evernote and their text representation retrieved once processing is complete. Though these ideas have some appeal to me–especially the web services–I have to assume that one would be breaking the terms of service agreements using such tools to generate large amounts of OCR in a Production environment.

The remaining category is where I focused most of my attention. This is what I wanted after all: an open source solution that supports common input/output standards, that could be run in a Linux environment and be easily integrated into an existing production workflow. In this category, the two most discussed open options are OCRopus (seemingly sponsored by Google) and Tesseract (developed once upon a time by HP, but possibly used by Google currently). Of the two, Tesseract seems to be the only one with any recent development (OCRopus’s last update was in 2009), and I have seen several statements indicating that it is currently in use in the Google Books project (though I haven’t seen any definitive proof). I have also seen some indications that it is used by the Google Drive service, which we discussed a bit above.

Unlike many of the other open source options I’ve seen, Tesseract does include some level of image processing by incorporating the Leptonica library. Leptonica supports a wide array of pre-processing operations such as deskew, grayscaling, segmentation of pages with images and text, and much more.

Tesseract is a Linux-based command line application, though there exist several GUI applications developed for use with Tesseract which could be used. There is also a Python wrapper for the Tesseract service, which might make it easy to then fold in Python’s OpenCV library for pre-OCR image cleanup. Tesseract purportedly supports 60 some languages and will output to both text and hOCR. Other outputs are certainly possible (such as XHTML and PDF), but might require a custom transformation step. There has already been quite a bit of work in this area which could be reused. For example, see the Python hocr-tools library, which contains functions for converting from hOCR+JPEG to smart PDFs.

Tesseract seems pretty popular right now, though much of the discussion goes well above my head. Still, I’ll be giving it a little bit of a trial in the coming weeks and will report back here.