Introductions:
Hello folks ! once again, this tutorial is all about building special functions that could be used to process images:OCR, bar-code,grey-scaling and etc. Actually this guide should not be included in the techno-blog; but I was compromised to put the document here to avoid head ache experience again.It's been awhile since I've debugged the exact-image libraries so I've forgotten the "howtos" necessary for a quick set-up -not even worried about taking notes then. Just recently ,when I decided to upgrade my deployed OS into a newer version ,I didn't expect that it would take almost 3 days for me to compile those object modules and libraries-really a head ache . So this time, I've realized that it's not a good practice to ignore even a little pieces of notes (patches,revision,repositories and etc) in recalling included files patched in the program -especially if its free.
Anyway let me share to you the usefulness of the ExactCode software.The software is a fast, modern and generic image processing library .It includes codecs
allowing library users to implement their own data sources and
destinations, such as in memory locations or network transfers.It is know as viable alternative to ImageMagick. The software was prototyped the needed code in C++, just for speed, and achieved processing times about 1/20th of what ImageMagick consumed. It features explore several new algorithms, e.g. for de-screening, data-dependent triangulation scaling, loss-less JPEG transforms and others needed for fast image processing.
Below are the instructions on how you can install and build exactimage which includes programs for fast image processing.I've also attached video on how the OCR(hocr2pdf) program functions in searching different texts in a PDF viewer. You can use each program in the command line as I've written how-to's and instructions in the testing portion of this blog.You may cut and paste all included examples and see for yourself if it indeed does its job more than what is expected.But , hey,don't forget to jot some notes before you will forget its procedures.Otherwise you will experience headache in the future as you wanted to try it once more. A sort of advise folks!
Requirements:
Linux OS: Fedora 18 64 bit
Server ,i7 core
ExactCode image processing library
Cuneiform (installed)
Tesseract (installed)
Methodology:
Download
root@localhost# wget http://exactcode.de/exact-image.0.8.x.tar.bz2
root@localhost# svn co https://exactcode.de/exact-image/trunk exact-image.8.x
Installations:
root@localhost# yum install
gcc gcc-c++ libstdc++
libXrender libXrender-devel
libaa libaa-devel
libX11 libX11-devel
agg agg-devel
freetype2 freetype2-devel
evas evas-devel
libjpeg libjpeg-devel
libtiff libtiff-devel
libpng libpng-devel
libungif libungif-devel
jasper jasper-devel
expat expat-devel
openexr openexr-devel
lcms lcms-devel
barcode barcode-devel
swig swig-devel
lua lua-devel
perl perl-devel perl-ExtUtils-Embed
php php-devel
python python-devel
ruby ruby-devel
root@localhost# tar -jxvf exact-image.8.x.tar.bz2
root@localhost# cd exact-image.8.x/
root@localhost# ./configure --prefix=/usr/local/scanner
root@localhost# make && make install
Testing:
Note:
This CLI based program can createa searchable PDF from hOCR input
hocr2pdf: Is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.
(1) hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:
roott@localhost# hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr
(2) By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:
root@localhost# hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr
(3) If too many gabs between letters in individual words as this might be a problem with imprecise OCR data or justified text with huge gabs. Hocr2pd in ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:
root@localhost#hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr
Details:
0) Exact-Image.8.8 files
(1)Plane image (Tiff file)
(2) hOCR generated text
3) Hocr2pdf script which is called every processing OCR
4) OCR searchable texts in PDF viewer
Remarks:
(1)Troubles:
png error [1]
Shooting:
Note: libpng12 in ExactImage depreciated and causes bug in the compilation so better delete "png.hh" and "png.cc"
root@localhost# cd /codecs
root@localhost# rm -rf png.*
(2)Troubles
Video(OCR processing)