Do-It-Yourself Optical Character Recognition (OCR)

An application that does Optical Character Recognition (OCR) — extracts text from raster images — will usually set you back a few bob, but if you are not averse to the terminal, you can have a sound OCR solution for nothing. It’s all thanks to an initiative of Google to index all of the information in existence. Some information takes the form of text embedded in images; humans can read it without any problem, but computers find it more of a challenge. This has led Google to start up the open-source project OCRopus.

OCRopus is not supported on Mac OS X at this point, but it is based on an older tool called Tesseract, and Tesseract is quite easy to compile and run. Here’s how:

  1. Download the latest source code. At the time of writing, it’s version 2.03.
  2. From the same download page, download language data for any language you want to use OCR for. The latest pack for English is called English language data for Tesseract (2.00 and up).
  3. Unpack the source code bundle, open Terminal, and change into the root directory (‘tesseract-2.03’ at time of writing).
  4. Issue the standard UNIX build command sequence, and enter your password when prompted.

    ./configure
    make
    sudo make install
    
  5. Unpack the language data, and move or copy each item in the tessdata directory into the directory /usr/local/share/tessdata/. Replace the files already in /usr/local/share/tessdata/ — which are just placeholders — with the ones you unpacked.

    cd ~/Downloads
    sudo cp tessdata/* /usr/local/share/tessdata/
    

Your installation is now complete. Time to test it.

Tesseract works only with TIFF images, so if you have another format, you need to use an application like Preview to convert it to TIFF. Once you have a TIFF image with some text in it — and it must have the extension .tif — you can use Tesseract to extract the text like this:

    /usr/local/bin/tesseract someimage.tif someimage_text

This should produce a text file called someimage_text.txt.

My tests have shown that tesseract does a reasonable job of extracting text, but it only works well with images of reasonably high resolution. If the resolution is too low, you end up with gobbledygook.

Lastly, you can use Automator to make tesseract a bit more user friendly. I’ve created a workflow that prompts the user to select an image, converts it into TIFF format, runs Tesseract, and presents the text to the user in their default text editor.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

OCRopus: people are saying it works in Mac OS X

Did some research this weekend on the exact same topic, but found out that some people actually buit OCRopus in Mac Os:

http://code.google.com/p/ocropus/issues/detail?id=1

Should there be a difference in performance?
Are you happy with your results?

Re: OCRopus

I didn't test extensively, so I don't think I can draw major conclusions regarding speed.
I'm sure you could make ocropus work but it is not supported, and seems to have a number of dependencies that need to be installed first. Tesseract was quite easy to install, so I used that. In the long run, ocropus will probably be the better choice.

Drew

------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

Funny...

Funny, I tried compiling OCRopus two weeks ago, but failed after installing many additional packages and adjusting the makefiles here and there. I also compiled tesseract while trying to get OCRopus to run since I thought it's used by ocropus - Only now after your article I realize that tesseract alone also works. :D :-)

please help me i want to understand.

log dump:
Last login: Mon Feb 2 09:31:55 on console
eoins-macbook:~ Eoin$ cd desktop
eoins-macbook:desktop Eoin$ cd tesseract-2.03
eoins-macbook:tesseract-2.03 Eoin$ ./configure
checking build system type... i686-apple-darwin9.6.0
checking host system type... i686-apple-darwin9.6.0
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name... configure: error: C++ compiler cannot create executables
See `config.log' for more details.
eoins-macbook:tesseract-2.03 Eoin$ ./configure
checking build system type... i686-apple-darwin9.6.0
checking host system type... i686-apple-darwin9.6.0
checking for cl.exe... no
checking for g++... no
checking for C++ compiler default output file name... configure: error: C++ compiler cannot create executables
See `config.log' for more details.
eoins-macbook:tesseract-2.03 Eoin$ make
-bash: make: command not found
eoins-macbook:tesseract-2.03 Eoin$ sudo make install
Password:
sudo: make: command not found
eoins-macbook:tesseract-2.03 Eoin$

Im dying to get this working i have read loads about tesseract.

any tips

OCRopus is now ported to Mac OS X via TakOCR

For those interested in using OCRopus on Mac OS X, check out TakOCR:

http://stuporglue.org/tako/