Difference between revisions of "Indexer Service"

From Gcube Wiki
Jump to: navigation, search
Line 18: Line 18:
 
(make sure that Content Management jars exist in your classpath) and a unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/sample-indexing-mod7.jar" from your filesystem. The Indexer Service will find that file and send it along with the other resource when a new Indexer Job is submitted through Hadoop Adaptor.
 
(make sure that Content Management jars exist in your classpath) and a unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/sample-indexing-mod7.jar" from your filesystem. The Indexer Service will find that file and send it along with the other resource when a new Indexer Job is submitted through Hadoop Adaptor.
  
 +
 +
If the hadoop installation in scope needs to have fully qualified names (e.g. it needs
 +
hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata/ instead of
 +
/user/INSPIRE/smalldata/ ), you need to add the following xml element in the $GLOBUS_LOCATION/GHNLabels.xml file of the container of the machine with hadoop gateway:
 +
 +
        <Variable>
 +
        <Key>hadoopLocationPrefix</Key>
 +
        <Value>hdfs://node1.hadoop.research-infrastructures.eu:8020</Value>
 +
        </Variable>
  
  
Line 33: Line 42:
 
Usage:
 
Usage:
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <shards number> <optional job name in >=0 words >
+
java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <output location> <shards number> <optional job name in >=0 words >
 
</source>
 
</source>
  
 
Example of use:
 
Example of use:
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata 5 Indexing by John in /user/INSPIRE/smalldata/texts
+
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata hdfs://node1.hadoop.research-infrastructures.eu:8020/tmp/output 5 Indexing by John in /user/INSPIRE/smalldata/texts
 
</source>
 
</source>

Revision as of 17:36, 16 June 2011

This is a stateful Web Service that serves as a wrapper to Parallel Indexing application developed by the INSPIRE team.


Notes to administrator:

In order for the Indexer service to work in a scope and since we currently can't specify requirements for the software of the execution node through Hadoop Adaptor, it is necessary that indexer.jar file has been uploaded to Content Management System manually before running an Indexer job. Client org.gcube.execution.indexerservice.tests.UploadIndexerJarClient must be used for that reason. You can call that client as:

java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient <scope> <location of jar file>

e.g.

java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient /gcube/devNext $GLOBUS_LOCATION/sample-indexing-mod7.jar

(make sure that Content Management jars exist in your classpath) and a unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/sample-indexing-mod7.jar" from your filesystem. The Indexer Service will find that file and send it along with the other resource when a new Indexer Job is submitted through Hadoop Adaptor.


If the hadoop installation in scope needs to have fully qualified names (e.g. it needs hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata/ instead of /user/INSPIRE/smalldata/ ), you need to add the following xml element in the $GLOBUS_LOCATION/GHNLabels.xml file of the container of the machine with hadoop gateway:

       <Variable>
       <Key>hadoopLocationPrefix</Key>
       <Value>hdfs://node1.hadoop.research-infrastructures.eu:8020</Value>
       </Variable>


Notes to developer:

When indexer service factory receives a call from a user, it tries to find a Workflow Engine instance in that scope which will use to submit a new job using Hadoop adaptor. In case of success, it will create a Web Service resource for that job that will contain information of that job such as job name,execution id,workflow engine endpoint etc. A background thread operates periodically and is in charge of collecting all WS-resources, polling the workflow engine for the jobs that are still running and updating the corresponding WS-resources.


Notes to user:

Indexer service can be consumed through the org.gcube.execution.indexerservice.tests.TestIndexerService client. That client submits a Parallel Indexing job by providing the location of the input and the number of shards and polls the status of the job until completion. The output of the job is a directory in the hdfs.

Usage:

java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <output location> <shards number> <optional job name in >=0 words >

Example of use:

java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata		hdfs://node1.hadoop.research-infrastructures.eu:8020/tmp/output 5 Indexing by John in /user/INSPIRE/smalldata/texts