InspireIndexing

From Gcube Wiki
Revision as of 16:30, 16 June 2011 by Jan.iwaszkiewicz (Talk | contribs) (INSPIRE Parallel Indexing: added basic info and arguments desc)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

INSPIRE Parallel Indexing

The parallel indexing module is designed for indexing very large collections of full text documents (hundreds of thousands or millions) The implementation uses Hadoop and Lucene libraries and is meant to be executed on a Hadoop facility. The input documents as well as computed indexes are located on the Hadoop DFS. The arguments for an indexing job:

  • Input directory
  • Output directory
  • number of workers