Saturday, July 17, 2010

Optical Character Recognitions (Gambas2 & Tesseract/OCROpus)


Introductions:

This topic is quite interesting and a lot more fun, to tell you frankly,I had a little applications experience of using it before. Perhaps this time ,I have a simple way to use these freely available OCRs . Of course, back in our old school age we didn’t have cameras, mobile phones nor inexpensive Digicams, to apply such kind of software .But it so cumbersome of doing jotting/writing every time during class hours then and wouldn’t it have saved hours of copying notes!

Basically , this blog is just a complementary to my project "Robo-Book Scanner" and what we will be discussing here are: the main software that will make it easier to use OCR's engine (Tesseract or OCROpus) as a back end applications.And by using Gambas version(VB for Linux) which has an advantage featured of designing front end GUI.

Hence using Open source software as what I have said awhile ago, requires a lot of tweaking before we avail it by our own convenience .Anyway- that is how we patronize and support open community -freedom for all digital applications(my own term). Optical character recognition (OCR) is a system of converting scanned printed/handwritten image files into its machine readable text format. OCR software works by analyzing a document and comparing it with fonts stored in its database and/or by noting features typical to characters. Some OCR software also puts it through a spell checker to “guess” unrecognized words. 100% accuracy is difficult to achieve, but close approximation is what most software strive for.

To make it short ,we need to come across of how we can use Gambas 2 & Tesseract to work one application together ,and for us to see how each would give an excellent results as Robo-Book Scanner software which is an OepnGL release OCR software tool.So, our intentions is to use Gambas as IDE /GUI for OCR and its engine no other than- Tesseract or OCROpus.

The last one , is to give you a brief an idea technically

1) We need to hack Tesseract engines, i have read some articles "Hacking Tesseract V0.04" and this is the website - http://tesseract-ocr.repairfaq.org/ and if we could combine testing OCROpus which documentations can be found here -http://code.google.com/p/ocropus/wiki/InstallTranscript well that is better!

2) We need to study Gambas interfacing with C/C++ specially their shared libraries in the Linux Kernel. "How To Program Components In C/C++" and this website -http://gambasdoc.org/help/dev/overview

3) How we open image file in Gambas version 2 (esp: *.TIFF,JPEG,BMP ) and other available free "pdf"converter ;in this website - http://gamblis.com/2009/08/08/sharpdevelop-tutorial-how-to-rotate-an-image/


Objectives:


Requirements:

g++
scons
svn
libpng12-dev
libjpeg62-dev
libtiff4-dev
libavcodec-dev
libavformat-dev
libsdl-gfx1.2-dev
libsdl-image1.2-dev

Methodology:


Installing iuLib

Download iulib 0.3 package from iulib's google code page. http://code.google.com/p/iulib/
Get any missing libraries, run: sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev libavcodec-dev libavformat-dev libsdl-gfx1.2-dev libsdl-image1.2-dev

if you get errors downloading any of these libraries, change your package download server. See the note at the top of this document.

Run: sudo scons install (This will help avoid an error in the next step. Make sure to have scons installed.)
Run: sudo make install

Installing Tesseract

svn co http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
cd tesseract-ocr
./configure
make
sudo make install

Install OCROpus


# pick one of the following


release
="-r ocropus-0.4.4" # this selects release 0.4.4

release
=""

# download everything

hg clone $release https
://iulib.googlecode.com/hg/ iulib
hg clone $release https
://ocropus.googlecode.com/hg/ ocropus
hg clone $release https
://ocroswig.ocropus.googlecode.com/hg/ ocroswig
hg clone $release https
://ocropy.ocropus.googlecode.com/hg/ ocropy
wget
-nd http://mohri-lt.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.1.tar.gz
hg clone $release https
://pyopenfst.googlecode.com/hg/ pyopenfst
date
;

# compile iulib

cd iulib
sudo sh uninstall
sudo sh ubuntu
-packages
scons
-j 4 sdl=1
sudo scons
-j 4 sdl=1 install
cd
..
date
;

# compile ocropus

cd ocropus
sudo sh uninstall
sudo sh ubuntu
-packages
scons
-j 4 omp=1
sudo scons
-j 4 omp=1 install
cd
..
date
;

# compile openfst

tar
-zxvf openfst-1.1.tar.gz
cd openfst
-1.1
./configure
make
-j 4
sudo make install
cd
..
date
;

# compile ocroswig

cd ocroswig
make
cd
..
date
;

# compile ocropy

cd ocropy
sudo python setup
.py install
cd
..
date
;

# compile Python bindings for openfst

cd pyopenfst
make
cd
..
date
;

You may have to update your LD_LIBRARY_PATH to include /usr/local/lib, for example:

export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/lib

Remarks:


Conclusions:

No comments:

Post a Comment