Thursday, July 4, 2013

HOCR2PDF - ExactCode-ExactImage in Linux(Fedora 1X)

Introductions:

Hello folks ! once again, this tutorial is all about building special functions that  could  be used to process images:OCR, bar-code,grey-scaling and etc. Actually this guide should not be included in the techno-blog; but I was compromised to put the document here to avoid   head ache experience again.It's been awhile  since I've debugged the  exact-image libraries so I've forgotten the  "howtos"  necessary   for a quick  set-up -not even worried about  taking notes then. Just recently ,when I decided to upgrade my deployed OS into a newer version ,I didn't expect that it would take  almost 3 days for me to compile those object modules and  libraries-really a head ache . So this time, I've realized that it's not a good practice to ignore even a little pieces of notes (patches,revision,repositories and etc) in recalling  included files patched in  the program -especially if its free. 

Anyway let me share to you the usefulness of the ExactCode software.The software is a fast, modern and generic image processing library .It  includes codecs  allowing library users to implement their own data sources and destinations, such as in memory locations or network transfers.It is know as viable alternative to ImageMagick. The software was prototyped the needed code in C++, just for speed, and achieved processing times about 1/20th of what ImageMagick consumed. It features explore several new algorithms, e.g. for de-screening, data-dependent triangulation scaling, loss-less JPEG transforms and others needed for fast image processing.

Below are the instructions on how you can install  and build exactimage which includes programs for fast image processing.I've also attached video on how the OCR(hocr2pdf)  program functions in searching  different texts in a PDF viewer. You can use each program in the command line as I've written how-to's and instructions in the testing portion of this blog.You may cut and paste all included examples and see for yourself if it indeed does its job  more than what is expected.But , hey,don't forget to jot some notes before you will forget its procedures.Otherwise  you will experience headache in the future as you wanted to try it once more. A sort of advise folks!

Requirements:
Linux OS: Fedora 18 64 bit
Server ,i7 core
ExactCode image processing library
Cuneiform  (installed)
Tesseract   (installed)

Methodology:

Download
root@localhost#  wget http://exactcode.de/exact-image.0.8.x.tar.bz2
root@localhost#  svn co https://exactcode.de/exact-image/trunk exact-image.8.x

Installations:
 
root@localhost# yum install
gcc gcc-c++  libstdc++
libXrender libXrender-devel
libaa libaa-devel
libX11 libX11-devel
agg agg-devel
freetype2 freetype2-devel
evas  evas-devel
libjpeg libjpeg-devel
libtiff libtiff-devel 
libpng  libpng-devel
libungif  libungif-devel
jasper   jasper-devel
expat expat-devel
openexr  openexr-devel
lcms  lcms-devel
barcode barcode-devel
swig  swig-devel
lua lua-devel
perl perl-devel   perl-ExtUtils-Embed
php  php-devel
python python-devel
ruby   ruby-devel

root@localhost# tar -jxvf exact-image.8.x.tar.bz2
root@localhost# cd exact-image.8.x/
root@localhost# ./configure --prefix=/usr/local/scanner
root@localhost# make && make install

Testing:
Note:
This CLI based program can createa searchable PDF from hOCR input

hocr2pdf: Is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.

(1) hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example: 

roott@localhost# hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr

(2) By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space: 
root@localhost# hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr

(3) If too many gabs between letters in individual words as this might be a problem with imprecise OCR data or justified text with huge gabs. Hocr2pd in ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:

root@localhost#hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr


Details:

0) Exact-Image.8.8  files















(1)Plane image (Tiff file)
























(2) hOCR generated text



















3) Hocr2pdf script which is called every processing OCR






4) OCR searchable texts in PDF viewer
















Remarks:
(1)Troubles:
png error [1]

Shooting:
Note: libpng12 in ExactImage depreciated and causes  bug in the compilation so  better delete  "png.hh" and "png.cc"
root@localhost# cd /codecs
root@localhost# rm -rf png.*

(2)Troubles
/usr/bin/ld: cannot find -lXrender

Shooting
Note: locate the xrender files

root@localhost#  locate Xrender
/usr/lib/libXrender.so.1
/usr/lib/libXrender.so.1.3.0
/usr/lib/vmware/lib/libXrender.so.1
/usr/lib/vmware/lib/libXrender.so.1/libXrender.so.1
/usr/lib/vmware-installer/1.1/lib/lib/libXrender.so.1
/usr/lib/vmware-installer/1.1/lib/lib/libXrender.so.1/libXrender.so.1

Note: As far as Linux is concerned, you do not have libXrender.so (even though you have libXrender.so.1 and libXrender.so.1.3.0).
This is easy to fix though. All you need is a symbolic link to the latest version.su to root (or use sudo if you prefer) and then
root@localhost#ln -s /usr/lib/libXrender.so.1.3.0 /usr/lib/libXrender.so

(3)Troubles
 make: *** [objdir/frontends/optimize2bw] Error 1
Shooting
Adding “LDFLAGS += -lgif” to the Makefile fixes that.


Conclusions:
This open-source scanning software really share efforts in advancing OCR to produce readable and searchable text from a grabbed/captured images in any devices(as source).


Video(OCR processing)