InspireOCR

From Gcube Wiki
Revision as of 16:25, 29 June 2011 by Jukka.klem (Talk | contribs) (JDLAdaptor Examples)

Jump to: navigation, search

INSPIRE Optical Character Recognition (OCR)

Introduction

Optical character recognition (OCR) is the translation of scanned documents into machine-encoded text. CERN library and many other digital repositories have large numbers of scanned documents where textual information is not available. Therefore it is not possible to search for words or phrases in these documents, and applying techniques such as text mining is not possible. OCR process has often been done using commercial services and tools but now there are powerful open source tools for OCR. Using these tools the OCR process can be carried out in one workstation or by dividing the work in many parallel grid computing jobs. The OCR tool used is OCRopus.

OCRopus is a document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. It is released under the Apache License and has a modular design through the use of plugins. OCRopus is also well suited for large scale batch processing so that the OCR tasks can be divided into independent grid jobs. Typical OCR process consists of selecting a set of scanned documents in pdf format, performing document layout analysis, line recognition and character identification. The output is in hOCR format (HTML document) which can be converted into pfd format. All the tools needed for OCR are available in one package (tar.gz file) that can be sent with the grid job or the tools can be pre-installed in the grid nodes where jobs are executed.

JDLAdaptor Examples

JDLAdaptor input files needed:

  • JDL file
  • resource file
  • OCR code
  • OCR job file


JDL file example:

[
    Type = "Job";
    JobType = "Normal";
    Executable = "ocrjob.sh";
    Arguments = "None";
    StdOutput = "job.out";
    StdError = "job.err";
    VirtualOrganisation = "d4science.research-infrastructures.eu";
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job"};
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
    Requirements = other.GlueHostOperatingSystemName == ScientificSL &&
other.GlueHostOperatingSystemRelease == 5.0;
]


Resource file example:

scope # /d4science.research-infrastructures.eu/INSPIRE
jdl # /home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocr.jdl
chokeProgressEvents # false
chokePerformanceEvents # false
storePlans # true
NobelAnnounce.pdf#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/NobelAnnounce.pdf
ocrjob.sh#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocrjob.sh
ocropus-0.3.1-i386-JK.tar.gz#url#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz


Output is a pdf file with textual content and a file in hOCR format.

GridAdaptor Examples

GridAdaptor input files needed (detailed examples will be added):

  • JDL file
  • resource file
  • OCR code
  • OCR job file


JDL file example:

[
    Type = "Job";
    JobType = "Normal";
    Executable = "ocrjob.sh";
    Arguments = "None";
    StdOutput = "job.out";
    StdError = "job.err";
    VirtualOrganisation = "d4science.research-infrastructures.eu";
 
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job", "pdfopt"};
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
 
    Requirements =  other.GlueCEUniqueID == "cream-ce.research-infrastructures.eu:8443/cream-pbs-d4science";
    Requirements = other.GlueHostOperatingSystemName == ScientificCERNSLC && other.GlueHostOperatingSystemRelease >= 5.0;
 
]


Resource file example:

scope # /d4science.research-infrastructures.eu/INSPIRE
chokeProgressEvents # false
chokePerformanceEvents # false
storePlans # true
timeout # -1
pollPeriod # 60000
jdl # ocr.jdl # /home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocr.jdl
userProxy # userProxy # /home/jklem/d4s_process_engine/wsclient/userProxy
inData#pdfopt#/usr/bin/pdfopt#local
inData#NobelAnnounce.pdf#/home/jklem/gridsubmit/gridsubmit/NobelAnnounce.pdf#local
inData#ocrjob.sh#/home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocrjob.sh#local
inData#ocropus-0.3.1-i386-JK.tar.gz#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz#url
inData#job_d4science_grid.job#/path/job_d4science_grid.job#local
outData#NobelAnnounce.pdf
outData#NobelAnnounce.pdf.hocr
outData#job.out
outData#job.err


Output is a pdf file with textual content and a file in hOCR format.

Local OCR execution

OCR process can be tested locally by executing the OCR job file script. Local pdf files are processed and textual content added.