A brief survey of current OCR solutions

The last time I had occasion to select an OCR solution for a project was sometime in 2008. Apparently, not much has changed since then, or so it seems from a brief survey I conducted recently. Though this is perhaps due to the kind of application I was looking for, which would be 1) free or relatively affordable 2) supporting standards for input and output formats 3) scriptable via an API or an open-source programming language like Python (i.e. not a standalone desktop application). I’m sure the NSA has some sweet OCR tools, but I doubt they’re interested in sharing.

I found it helpful to group the tools into three loose categories:

  • Commercial (i.e. ‘paid’), usually Windows based, desktop applications
  • Open source (i.e. ‘free’) software solutions, usually run in a Linux environment
  • Wildcard, generally imperfect solutions sought out by folks looking for an easy and free fix for an immediate problem

Tools in the first category make up for their expense somewhat by being all-in-one packages that deliver image pre-processing (e.g. de-skew, auto-rotation, layout analysis), support for multiple languages, and a host of input and output formats. Their disadvantages include a frequent requirement that they are run on Windows, difficulty using in an automated fashion, and their expense which can range up to several thousand dollars a year (or more if there are page limitations).

A highly rated application in this category is Abbyy FineReader. Their primary product is a personal use application, but Abbyy also offers their OCR tools at an enterprise level, either as a Windows/Windows Server application, or as a developer’s SDK. FineReader claims to support the recognition of 198 languages and offers a rather long list of input and output formats. I used this tool once upon a time and it really is quite powerful and accurate. But it doesn’t really meet the criteria I defined above.

Let me take a quick detour into the third category. People looking for a quick OCR fix without paying a lot of money or installing a bunch of untested software have found some luck using software they already happen to have installed. Adobe Acrobat Pro offers an OCR feature, as does Microsoft Word (Or at least it did; that neat MS Word feature does not exist in recent versions of the software). I also saw some discussion of reusing Google Drive’s OCR API, described here. Evernote also performs OCR on submitted images. With some imagination one could come up with a little workflow in which images are POSTed to Drive or Evernote and their text representation retrieved once processing is complete. Though these ideas have some appeal to me–especially the web services–I have to assume that one would be breaking the terms of service agreements using such tools to generate large amounts of OCR in a Production environment.

The remaining category is where I focused most of my attention. This is what I wanted after all: an open source solution that supports common input/output standards, that could be run in a Linux environment and be easily integrated into an existing production workflow. In this category, the two most discussed open options are OCRopus (seemingly sponsored by Google) and Tesseract (developed once upon a time by HP, but possibly used by Google currently). Of the two, Tesseract seems to be the only one with any recent development (OCRopus’s last update was in 2009), and I have seen several statements indicating that it is currently in use in the Google Books project (though I haven’t seen any definitive proof). I have also seen some indications that it is used by the Google Drive service, which we discussed a bit above.

Unlike many of the other open source options I’ve seen, Tesseract does include some level of image processing by incorporating the Leptonica library. Leptonica supports a wide array of pre-processing operations such as deskew, grayscaling, segmentation of pages with images and text, and much more.

Tesseract is a Linux-based command line application, though there exist several GUI applications developed for use with Tesseract which could be used. There is also a Python wrapper for the Tesseract service, which might make it easy to then fold in Python’s OpenCV library for pre-OCR image cleanup. Tesseract purportedly supports 60 some languages and will output to both text and hOCR. Other outputs are certainly possible (such as XHTML and PDF), but might require a custom transformation step. There has already been quite a bit of work in this area which could be reused. For example, see the Python hocr-tools library, which contains functions for converting from hOCR+JPEG to smart PDFs.

Tesseract seems pretty popular right now, though much of the discussion goes well above my head. Still, I’ll be giving it a little bit of a trial in the coming weeks and will report back here.

Generating synonyms

If you spend any time managing vocabularies, there may come a day when you need to quickly generate a set of synonyms for your existing terms. After all, synonyms are quite useful. Thanks to synonyms, a Google search for ‘theatre’ should also return content with the word ‘theater,’ or perhaps even ‘cinema.’ We can also use synonyms to account for common misspellings like ‘loose’ for ‘lose’ (and vice versa).

Synonyms are also useful for people, places and organizations. Consider a text classification engine that only looks for the full term name ‘Apple, Inc.’ rather than the shorter and frequently used ‘Apple.’ Or perhaps, an auto-suggest search box that does not know that I really mean ‘President Barack Obama’ when I type ‘Obama.’

What dark arts must we employ to quickly generate a substantial set of synonyms? Let’s explore.

Bing Synonyms API

The Synonyms API from Bing returns alternate forms of products, people, locations and more. The free version is limited to 5000 calls per day, and the terms of service indicate that users of the service should not copy, store, or cache any Synonyms results. This pretty much excludes the service for my purposes, but it is a good demonstration of the kind of service we will be considering here.

This is a RESTful web API that can be invoked rather simply:

https://api.datamarket.azure.com/Bing/Synonyms/v1/GetSynonyms?Query=%27zimbabwe%27

This will return an atom feed containing entries like the following:

<entry>
 <id>https://api.datamarket.azure.com/Data.ashx/Bing/Synonyms/v1/GetSynonyms?Query='zimbabwe'&$skip=4&$top=1</id>
 <title type="text">GetSynonymsEntitySet</title>
 <updated>2013-07-11T15:28:55Z</updated>
 <link rel="self"
 href="https://api.datamarket.azure.com/Data.ashx/Bing/Synonyms/v1/GetSynonyms?Query='zimbabwe'&$skip=4&$top=1"/>
 <content type="application/xml">
 <m:properties>
 <d:Synonym m:type="Edm.String">republic of zimbabwe</d:Synonym>
 </m:properties>
 </content>
</entry>

Wordnet

Wordnet is a large lexical database for the English language. Among other things, it groups words into sets of synonyms, each expressing a single concept. It is also a free resource and its data can be used as needed. Perfect, let’s use it!

To pull synonyms from the Wordnet database I used the NLTK Python library. A brief description of the Wordnet interface that is available with NLTK is available here.

This makes getting synonyms for any term as simple as:

def synonymns(word):
    syns = []
    for synset in wordnet.synsets(word):
        for syn in synset.lemma_names:
            syns.append(syn)
    return sorted(set(syns))

Freebase

Freebase is a community curated collection of structured data, including entries for well-known people, places and things. Freebase’s data is licensed under an open, Creative Commons Attribution (CC-BY) license.

To return synonyms from Freebase I am passing an MQL (Metaweb Query Language) query to the MQL Read API.

The MQL itself is expressed in JSON and is fairly simple. Here I am asking for all aliases of a concept with the name ‘Zimbabwe’ with any type (types in Freebase provide a level of disambiguation between concepts so that ‘War’ the subject can be distinguished from ‘War’ the band):

[{
  "id": null,
  "name": "Zimbabwe",
  "/common/topic/alias": [],
  "type": []
}]

The following is a snippet of what is returned for this query:

{
"result": [{
    "id": "/en/zimbabwe",
    "/common/topic/alias": ["The Republic of Zimbabwe"],
    "name": "Zimbabwe",
    "type": [
        "/common/topic",
        "/location/location",
        "/location/country"
    ]
}]
}

We can use the ‘type’ value to indicate something about the alias (in this case the alias is a location as we would expect, but it could be a band or a person name).

DBpedia

I’ve saved my favorite source for last. It’s my favorite because it combines structured data with the haphazard efforts (both intentional and unintentional) of Wikipedia’s many users.

Like Freebase, DBpedia is a community curated collection of structured data, but it differs in that its data has been extracted from Wikipedia alone (whereas Freebase combines data from multiple sources with the contributions of its users).

To return synonyms from DBpedia I am making use of the page redirect information stored for each resource. If you have no idea what I mean by ‘page redirect’, bring up Wikipedia in your browser and search for ‘President Clinton.’ Now look at the page’s heading: ‘Bill Clinton.’ How did we get here? Magic. No wait, it was a page redirect. Now let’s take a look at the DBpedia version of the Bill Clinton resource:

http://dbpedia.org/page/Bill_Clinton

Scroll down the page to dbpedia-owl:wikiPageRedirects to see all of the RDF redirect triples for this resource. For example:

dbpedia:Bill_Clinton  dbpedia-owl:wikiPageRedirects   dbpedia:William_Jefferson_Clinton

This is a great way to get a bunch of synonyms for ‘Bill Clinton,’ but what if our term form is actually ‘William Jefferson Clinton’? Well, then we need to reverse the query. Here is a sample SPARQL query to do just that:

?x <http://dbpedia.org/ontology/wikiPageRedirects> <http://dbpedia.org/resource/Bill_Clinton>

We can also get redirects of our redirects for extra synonym goodness:

<http://dbpedia.org/resource/Bill_Clinton > <http://dbpedia.org/ontology/wikiPageRedirects> ?y.
?x <http://dbpedia.org/ontology/wikiPageRedirects> ?y.

A UNION Of all of the queries above would yield the following list (note the misspellings):

  • Bill Clinton
  • William Jefferson Clinton
  • BillClinton
  • Billl Clinton
  • 42nd President of the United States
  • William Jefferson Blythe III
  • Bull Clinton
  • William clinton
  • William Jefferson “Bill” Clinton
  • Bill Blythe IV
  • William Jefferson Blythe IV
  • Buddy (Clinton’s dog)
  • Bill clinton
  • William J. Clinton
  • Clinton
  • Bill
  • President Bill Clinton
  • President Clinton
  • Clinton Gore Administration
  • Bill Jefferson Clinton
  • Bill J. Clinton
  • Bil Clinton
  • WilliamJeffersonClinton
  • William Blythe III
  • William J. Blythe
  • William J. Blythe III
  • William J Clinton
  • Bill Clinton’s Post-Presidency
  • Bill Clinton’s Post Presidency
  • Bill Clinton\
  • Klin-ton

Huh, how did Buddy get in there?

Update 2016-08-24

I’ve created a Gist of the code I used to generate synonyms using DBpedia:

import sys
from SPARQLWrapper import SPARQLWrapper, JSON, XML, N3, RDF
def dbpedia(term):
term = term.strip()
nterm = term.capitalize().replace(' ','_')
query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
SELECT ?label
WHERE
{
{
<http://dbpedia.org/resource/VALUE&gt; <http://dbpedia.org/ontology/wikiPageRedirects&gt; ?x.
?x rdfs:label ?label.
}
UNION
{
<http://dbpedia.org/resource/VALUE&gt; <http://dbpedia.org/ontology/wikiPageRedirects&gt; ?y.
?x <http://dbpedia.org/ontology/wikiPageRedirects&gt; ?y.
?x rdfs:label ?label.
}
UNION
{
?x <http://dbpedia.org/ontology/wikiPageRedirects&gt; <http://dbpedia.org/resource/VALUE&gt;.
?x rdfs:label ?label.
}
UNION
{
?y <http://dbpedia.org/ontology/wikiPageRedirects&gt; <http://dbpedia.org/resource/VALUE&gt;.
?x <http://dbpedia.org/ontology/wikiPageRedirects&gt; ?y.
?x rdfs:label ?label.
}
FILTER (lang(?label) = 'en')
}
"""
nquery = query.replace('VALUE',nterm)
sparql = SPARQLWrapper("http://dbpedia.org/sparql&quot;)
sparql.setQuery(nquery)
rterms = []
sparql.setReturnFormat(JSON)
try:
ret = sparql.query()
results = ret.convert()
requestGood = True
except Exception, e:
results = str(e)
requestGood = False
if requestGood == False:
return "Problem communicating with the server: ", results
elif (len(results["results"]["bindings"]) == 0):
return "No results found"
else:
for result in results["results"]["bindings"]:
label = result["label"]["value"]
rterms.append(label)
alts = ', '.join(rterms)
alts = alts.encode('utf-8')
return alts
if __name__ == "__main__":
alts = dbpedia(sys.argv[1])
print alts

view raw
dbpedia_redirects.py
hosted with ❤ by GitHub