Difference between revisions of "OCR Service"

From Gcube Wiki
Jump to: navigation, search
Line 85: Line 85:
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
 
java org.gcube.execution.ocrservice.tests.TestOCRService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory /d4science.research-infrastructures.eu/INSPIRE NobelAnnounce.pdf InMessageBytes /home/stefanos/NobelAnnounce.pdf eng OCR on NobelAnnouce 2 page document  
 
java org.gcube.execution.ocrservice.tests.TestOCRService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory /d4science.research-infrastructures.eu/INSPIRE NobelAnnounce.pdf InMessageBytes /home/stefanos/NobelAnnounce.pdf eng OCR on NobelAnnouce 2 page document  
 +
 +
</source>
 +
 +
 +
 +
Example of results of the client:
 +
<source lang="java5" highlight="5" >
 +
TestOCRService client started with arguments:
 +
OCR Factory Service: http://jazzman.di.uoa.gr:8081/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory
 +
Scope:              /gcube/devNext
 +
Key:                Frence.pdf
 +
Access:              Reference
 +
Value:              http://dl.dropbox.com/u/19792897/ocr_job_files/french.pdf
 +
Language:            nld
 +
JobName:            AN OCR in French bu nl lang
 +
 +
Getting factory stub...
 +
Preparing input of OCR job...
 +
Submitting OCR job...
 +
 +
OCR job was submitted
 +
Calling status() until completion
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 +
 +
Description:    OCR job is running
 +
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 +
 +
Description:    OCR job finished with no reported errors
 +
Last poll date: Fri Jun 17 17:34:29 EEST 2011
 +
 +
 +
OCR Job has finished:
 +
-----------------------------------------------------
 +
Job Name:        AN OCR in French bu nl lang
 +
Description:    OCR job finished with no reported errors
 +
Submited:        Fri Jun 17 17:33:45 EEST 2011
 +
Last Poll:      Fri Jun 17 17:34:29 EEST 2011
 +
Error:          null
 +
ErrorDetails:    null
 +
hocr  ssid: null
 +
pdf    ssid: cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/e1243600-98ee-11e0-9145-ca34f60d2e2d
 +
stdout ssid: cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/dcbe6ae0-98ee-11e0-9145-ca34f60d2e2d
 +
stderr ssid: cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/dea9c020-98ee-11e0-9145-ca34f60d2e2d
  
 
</source>
 
</source>

Revision as of 23:41, 20 June 2011

This is a stateful Web Service that serves as a wrapper to Optical Character Recognition application developed by the INSPIRE team.


Notes to administrator:

In order for the OCR service to work in a scope, a Scientifil Linux 5 execution node with ocropus software installed must exist in that scope. An SL5 node that can execute OCR must fullfill the following requirements:

a) be a SL5 node and declare it in $GLOBUS_LOCATION/conf/GHNLabels.xml file with the following xml elements:

       <Variable>
       <Key>other.GlueHostOperatingSystemName</Key>
       <Value>ScientificSL</Value>
       </Variable>
       <Variable>
       <Key>other.GlueHostOperatingSystemRelease</Key>
       <Value>5.0</Value>
       </Variable>
       <Variable>
      

b) have ocropus-0.3.1-i386 directory under $GLOBUS_LOCATION and declare it in $GLOBUS_LOCATION/conf/GHNLabels.xml file with the following xml element:

       <Variable>
       <Key>software.ocropus</Key>
       <Value>true</Value>
       </Variable>
      

In addition, it is expected that the following ocrjob.sh script exists under $GLOBUS_LOCATION directory of the node on which OCRService service is running, so that it can be sent to the execution node.File:Ocrjob.tar.gz


If an SL5 node with ocropus installed isn't found in scope, the OCR job submission would fail. As a backup plan, you can use the following client to upload ocropus.tar.gz file in Content Management System:

Usage:

java org.gcube.execution.ocrservice.tests.UploadOcropusClient <scope> <location of jar file>

e.g.

java org.gcube.execution.ocrservice.tests.UploadOcropusClient /gcube/devNext $GLOBUS_LOCATION/ocropus.tar.gz

(make sure that Content Management jars exist in your classpath) and a unique collection will be created that scope that will contain only the provided ocropus.tar.gz. The OCR Service will find that file and send it along with the other resources when a new OCR Job is submitted through JDL Adaptor. We must state that this backup plan hasn't worked as expected because

a)uploading ocropus.tar.gz file takes around 50 minutes b)dowloading ocropus.tar.gz file might fail because the file is too large (18.2 megabytes) and in case of success, the OCR process will take much longer.


Notes to developer:

When OCR service factory receives a call from a user, it tries to find a Workflow Engine instance in that scope which will use to submit a new job using JDL adaptor. In case of success, it will create a Web Service resource for that job that will contain information of that job such as job name,execution id,workflow engine endpoint etc. A background thread operates periodically and is in charge of collecting all WS-resources, polling the workflow engine for the jobs that are still running and updating the corresponding WS-resources.


Notes to user:


OCR service can be consumed through the org.gcube.execution.ocrservice.tests.TestOCRService client. That client submits a OCR job by providing the pdf file (through http,ftp,cms reference of by sending the payload of the file if it exists in his filesystem),optionally the language of the pdf and a job name and polls the status of the job until completion.

java org.gcube.execution.ocrservice.tests.TestOCRService <ocr factory address> <gcube scope> <resource key> <resource access> <value> <language> <optional job name in >=0 words >

Example of use 1: The input file is an http reference

java org.gcube.execution.ocrservice.tests.TestOCRService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory /d4science.research-infrastructures.eu/INSPIRE NobelAnnounce.pdf Reference http://dl.dropbox.com/u/19792897/NobelAnnounce.pdf eng OCR on NobelAnnouce 2 page document

Example of use 2: The input file is in Content Management System

java org.gcube.execution.ocrservice.tests.TestOCRService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory /d4science.research-infrastructures.eu/INSPIRE NobelAnnounce.pdf CMSReference cms://ab2f4d80-87b1-11e0-9fbc-f078a392f5cc/b1beaf60-87b1-11e0-9fbc-f078a392f5cc eng OCR on NobelAnnouce 2 page document

Example of use 3: The input file is in local filesystem of the client

java org.gcube.execution.ocrservice.tests.TestOCRService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory /d4science.research-infrastructures.eu/INSPIRE NobelAnnounce.pdf InMessageBytes /home/stefanos/NobelAnnounce.pdf eng OCR on NobelAnnouce 2 page document


Example of results of the client:

TestOCRService client started with arguments:
OCR Factory Service: http://jazzman.di.uoa.gr:8081/wsrf/services/gcube/execution/ocrservice/OCRServiceFactory
Scope:               /gcube/devNext
Key:                 Frence.pdf
Access:              ReferenceValue:               http://dl.dropbox.com/u/19792897/ocr_job_files/french.pdf
Language:            nld
JobName:             AN OCR in French bu nl lang 
 
Getting factory stub...
Preparing input of OCR job...
Submitting OCR job...
 
OCR job was submitted
Calling status() until completion
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:33:45 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 
Description:    OCR job is running
Last poll date: Fri Jun 17 17:34:04 EEST 2011
 
Description:    OCR job finished with no reported errors
Last poll date: Fri Jun 17 17:34:29 EEST 2011
 
 
OCR Job has finished:
-----------------------------------------------------
Job Name:        AN OCR in French bu nl lang 
Description:     OCR job finished with no reported errors
Submited:        Fri Jun 17 17:33:45 EEST 2011
Last Poll:       Fri Jun 17 17:34:29 EEST 2011
Error:           null
ErrorDetails:    null
hocr   ssid: 	 null
pdf    ssid: 	 cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/e1243600-98ee-11e0-9145-ca34f60d2e2d
stdout ssid: 	 cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/dcbe6ae0-98ee-11e0-9145-ca34f60d2e2d
stderr ssid: 	 cms://14c1fb40-9116-11e0-90f7-ca34f60d2e2d/dea9c020-98ee-11e0-9145-ca34f60d2e2d

Different adaptors:

You can use this client to submit jobs to both gCube and gLite nodes.

To submit a job with jdlAdaptor you have to set TestOCRService.subType field to SubmissionType.jdlAdaptor type.

To submit a job with gridAdaptor you have to set TestOCRService.subType field to SubmissionType.gridAdaptor type and store your proxy in filesystem in location TestOCRService.proxyPath.