The last time I had occasion to select an OCR solution for a project was sometime in 2008. Apparently, not much has changed since then, or so it seems from a brief survey I conducted recently. Though this is perhaps due to the kind of application I was looking for, which would be 1) free or relatively affordable 2) supporting standards for input and output formats 3) scriptable via an API or an open-source programming language like Python (i.e. not a standalone desktop application). I’m sure the NSA has some sweet OCR tools, but I doubt they’re interested in sharing.
I found it helpful to group the tools into three loose categories:
- Commercial (i.e. ‘paid’), usually Windows based, desktop applications
- Open source (i.e. ‘free’) software solutions, usually run in a Linux environment
- Wildcard, generally imperfect solutions sought out by folks looking for an easy and free fix for an immediate problem
Tools in the first category make up for their expense somewhat by being all-in-one packages that deliver image pre-processing (e.g. de-skew, auto-rotation, layout analysis), support for multiple languages, and a host of input and output formats. Their disadvantages include a frequent requirement that they are run on Windows, difficulty using in an automated fashion, and their expense which can range up to several thousand dollars a year (or more if there are page limitations).
A highly rated application in this category is Abbyy FineReader. Their primary product is a personal use application, but Abbyy also offers their OCR tools at an enterprise level, either as a Windows/Windows Server application, or as a developer’s SDK. FineReader claims to support the recognition of 198 languages and offers a rather long list of input and output formats. I used this tool once upon a time and it really is quite powerful and accurate. But it doesn’t really meet the criteria I defined above.
Let me take a quick detour into the third category. People looking for a quick OCR fix without paying a lot of money or installing a bunch of untested software have found some luck using software they already happen to have installed. Adobe Acrobat Pro offers an OCR feature, as does Microsoft Word (Or at least it did; that neat MS Word feature does not exist in recent versions of the software). I also saw some discussion of reusing Google Drive’s OCR API, described here. Evernote also performs OCR on submitted images. With some imagination one could come up with a little workflow in which images are POSTed to Drive or Evernote and their text representation retrieved once processing is complete. Though these ideas have some appeal to me–especially the web services–I have to assume that one would be breaking the terms of service agreements using such tools to generate large amounts of OCR in a Production environment.
The remaining category is where I focused most of my attention. This is what I wanted after all: an open source solution that supports common input/output standards, that could be run in a Linux environment and be easily integrated into an existing production workflow. In this category, the two most discussed open options are OCRopus (seemingly sponsored by Google) and Tesseract (developed once upon a time by HP, but possibly used by Google currently). Of the two, Tesseract seems to be the only one with any recent development (OCRopus’s last update was in 2009), and I have seen several statements indicating that it is currently in use in the Google Books project (though I haven’t seen any definitive proof). I have also seen some indications that it is used by the Google Drive service, which we discussed a bit above.
Unlike many of the other open source options I’ve seen, Tesseract does include some level of image processing by incorporating the Leptonica library. Leptonica supports a wide array of pre-processing operations such as deskew, grayscaling, segmentation of pages with images and text, and much more.
Tesseract is a Linux-based command line application, though there exist several GUI applications developed for use with Tesseract which could be used. There is also a Python wrapper for the Tesseract service, which might make it easy to then fold in Python’s OpenCV library for pre-OCR image cleanup. Tesseract purportedly supports 60 some languages and will output to both text and hOCR. Other outputs are certainly possible (such as XHTML and PDF), but might require a custom transformation step. There has already been quite a bit of work in this area which could be reused. For example, see the Python hocr-tools library, which contains functions for converting from hOCR+JPEG to smart PDFs.
Tesseract seems pretty popular right now, though much of the discussion goes well above my head. Still, I’ll be giving it a little bit of a trial in the coming weeks and will report back here.