InspireOCR

From Gcube Wiki
Revision as of 11:51, 10 June 2011 by Jukka.klem (Talk | contribs)

Jump to: navigation, search

INSPIRE Optical Character Recognition (OCR)

Introduction

Optical character recognition (OCR) is the translation of scanned documents into machine-encoded text. CERN library and many other digital repositories have large numbers of scanned documents where textual information is not available. Therefore it is not possible to search for words or phrases in these documents, and applying techniques such as text mining is not possible. OCR process has often been done using commercial services and tools but now there are powerful open source tools for OCR. Using these tools the OCR process can be carried out in one workstation or by dividing the work in many parallel grid computing jobs. The OCR tool used is OCRopus.

OCRopus is a document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. It is released under the Apache License and has a modular design through the use of plugins. OCRopus is also well suited for large scale batch processing so that the OCR tasks can be divided into independent grid jobs. Typical OCR process consists of selecting a set of scanned documents in pdf format, performing document layout analysis, line recognition and character identification. The output is in hOCR format (HTML document) which can be converted into pfd format. All the tools needed for OCR are available in one package (tar.gz file) that can be sent with the grid job or the tools can be pre-installed in the grid nodes where jobs are executed.