Difference between revisions of "InspireIndexing"

From Gcube Wiki
Jump to: navigation, search
(Created page with '=== INSPIRE Parallel Indexing === Indexing documentation.')
 
(INSPIRE Parallel Indexing: added basic info and arguments desc)
 
Line 1: Line 1:
 
=== INSPIRE Parallel Indexing ===
 
=== INSPIRE Parallel Indexing ===
  
Indexing documentation.
+
The parallel indexing module is designed for indexing very large collections of full text documents (hundreds of thousands or millions)
 +
The implementation uses Hadoop and Lucene libraries and is meant to be executed on a Hadoop facility.
 +
The input documents as well as computed indexes are located on the Hadoop DFS.
 +
The arguments for an indexing job:
 +
* Input directory
 +
* Output directory
 +
* number of workers

Latest revision as of 16:30, 16 June 2011

INSPIRE Parallel Indexing

The parallel indexing module is designed for indexing very large collections of full text documents (hundreds of thousands or millions) The implementation uses Hadoop and Lucene libraries and is meant to be executed on a Hadoop facility. The input documents as well as computed indexes are located on the Hadoop DFS. The arguments for an indexing job:

  • Input directory
  • Output directory
  • number of workers