InspireOCR
Contents
INSPIRE Optical Character Recognition (OCR)
Introduction
Optical character recognition (OCR) is the translation of scanned documents into machine-encoded text. CERN library and many other digital repositories have large numbers of scanned documents where textual information is not available. Therefore it is not possible to search for words or phrases in these documents, and applying techniques such as text mining is not possible. OCR process has often been done using commercial services and tools but now there are powerful open source tools for OCR. Using these tools the OCR process can be carried out in one workstation or by dividing the work in many parallel grid computing jobs. The OCR tool used is OCRopus.
OCRopus is a document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. It is released under the Apache License and has a modular design through the use of plugins. OCRopus is also well suited for large scale batch processing so that the OCR tasks can be divided into independent grid jobs. Typical OCR process consists of selecting a set of scanned documents in pdf format, performing document layout analysis, line recognition and character identification. The output is in hOCR format (HTML document) which can be converted into pfd format. All the tools needed for OCR are available in one package (tar.gz file) that can be sent with the grid job or the tools can be pre-installed in the grid nodes where jobs are executed.
JDLAdaptor Examples
JDLAdaptor input files needed:
- JDL file
- resource file
- OCR code (available as a ".tar.gz" file)
- OCR job script
JDL file example:
[ Type = "Job"; JobType = "Normal"; Executable = "ocrjob.sh"; Arguments = "None"; StdOutput = "job.out"; StdError = "job.err"; VirtualOrganisation = "d4science.research-infrastructures.eu"; InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job"}; OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"}; Requirements = other.GlueHostOperatingSystemName == ScientificSL && other.GlueHostOperatingSystemRelease == 5.0; ]
Resource file example:
scope # /d4science.research-infrastructures.eu/INSPIRE jdl # /home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocr.jdl chokeProgressEvents # false chokePerformanceEvents # false storePlans # true NobelAnnounce.pdf#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/NobelAnnounce.pdf ocrjob.sh#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocrjob.sh ocropus-0.3.1-i386-JK.tar.gz#url#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz
Output is a pdf file with textual content and a file in hOCR format.
GridAdaptor Examples
GridAdaptor input files needed:
- JDL file
- resource file
- OCR code (available as a ".tar.gz" file)
- OCR job script
JDL file example:
[ Type = "Job"; JobType = "Normal"; Executable = "ocrjob.sh"; Arguments = "None"; StdOutput = "job.out"; StdError = "job.err"; VirtualOrganisation = "d4science.research-infrastructures.eu"; InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job", "pdfopt"}; OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"}; ]
Resource file example:
scope # /d4science.research-infrastructures.eu/INSPIRE chokeProgressEvents # false chokePerformanceEvents # false storePlans # true timeout # -1 pollPeriod # 60000 jdl # ocr.jdl # /home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocr.jdl userProxy # userProxy # /home/jklem/d4s_process_engine/wsclient/userProxy inData#pdfopt#/usr/bin/pdfopt#local inData#NobelAnnounce.pdf#/home/jklem/gridsubmit/gridsubmit/NobelAnnounce.pdf#local inData#ocrjob.sh#/home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocrjob.sh#local inData#ocropus-0.3.1-i386-JK.tar.gz#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz#url inData#job_d4science_grid.job#/path/job_d4science_grid.job#local outData#NobelAnnounce.pdf outData#NobelAnnounce.pdf.hocr outData#job.out outData#job.err
Output is a pdf file with textual content and a file in hOCR format.
OCR Job Script Examples
OCR job script for one pdf file (ocrjob.sh):
#!/bin/sh # -*- coding: utf-8 -*- echo "Start." echo "Start." 1>&2 OCROPUS_PACKAGE=./ocropus-0.3.1-i386.tar.gz.gz OCROPUS_PATH=./ocropus-0.3.1-i386 PYTHON_MAIN=$OCROPUS_PATH/run.py export OCROSCRIPTS=$OCROPUS_PATH/share/ocropus/scripts export OCRODATA=$OCROPUS_PATH/share/ocropus export TESSDATA_PREFIX=$OCROPUS_PATH/share/ export LD_LIBRARY_PATH=$OCROPUS_PATH/lib export PYTHONPATH=$OCROPUS_PATH/python if [ -z "$PYTHONPATH" ] ; then export PYTHONPATH=$OCROPUS_PATH/python:$PYTHONPATH else export PYTHONPATH=$OCROPUS_PATH/python fi tar -xzf $OCROPUS_PACKAGE python $PYTHON_MAIN rm -rf $OCROPUS_PATH conversion* tmp*
OCR job script for many pdf files is below.
In this case the pdf files are zipped in many_pdfs_in.zip and it replaces the pdf input file in the other examples.
The files can be zipped by "zip many_pdfs_in.zip *.pdf" command.
The file many_pdfs_out.zip replaces the pdf and hocr output files in the other examples.
#!/bin/sh # -*- coding: utf-8 -*- echo "Start." echo "Start." 1>&2 OCROPUS_PACKAGE=./ocropus-0.3.1-i386.tar.gz.gz OCROPUS_PATH=./ocropus-0.3.1-i386 PYTHON_MAIN=$OCROPUS_PATH/run.py export OCROSCRIPTS=$OCROPUS_PATH/share/ocropus/scripts export OCRODATA=$OCROPUS_PATH/share/ocropus export TESSDATA_PREFIX=$OCROPUS_PATH/share/ export LD_LIBRARY_PATH=$OCROPUS_PATH/lib export PYTHONPATH=$OCROPUS_PATH/python if [ -z "$PYTHONPATH" ] ; then export PYTHONPATH=$OCROPUS_PATH/python:$PYTHONPATH else export PYTHONPATH=$OCROPUS_PATH/python fi unzip many_pdfs_in.zip tar -xzf $OCROPUS_PACKAGE python $PYTHON_MAIN rm -rf $OCROPUS_PATH conversion* tmp* zip many_pdfs_out.zip *.pdf *.hocr rm *.pdf *.hocr
Local OCR execution
OCR process can be tested locally by executing the OCR job file script. Local pdf files are processed and textual content added.
OCR languages
The language for OCR processing can be selected with "@ lang = " part in the ".job" file. Software can also automatically try different OCR languages. The tested languages include English (eng) and French (fr).
An example ".job" file:
@ lang = fr @ jobgroup = test @ script = /path/ocring/ocrjob.sh @ package = ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz @ grid = d4science_grid @ scope = /d4science.research-infrastructures.eu/INSPIRE @ vo = d4science.research-infrastructures.eu @ proxy = /path/gridsubmit/userProxy
JDL requirements, choosing location for job execution
OCR process needs some executables (such as pdfopt) to be present at the worker node. Suitable worker node can be found by adding requirements in the jdl file.
The following example (to be added in jdl file) is recommended in August 2011:
Requirements = other.GlueCEUniqueID == "cream-ce.research-infrastructures.eu:8443/cream-pbs-d4science";