Difference between revisions of "WorkflowHadoopAdaptor"

From Gcube Wiki
Jump to: navigation, search
(Plan Template)
Line 3: Line 3:
  
 
=Plan Template=
 
=Plan Template=
The entire execution process takes place in the gLite Grid UI node. This node is picked from the [[InformationSystem]] and is currently chosen randomly from all the available ones. Currently once the node has been picked, the execution cannot be moved to a different one even if there is a problem communicating with that node. The execution that takes place is a series of steps executed sequentially. These steps include the following:
+
The entire execution process takes place in the Hadoop UI node. This node is picked from the [[InformationSystem]] and is currently chosen randomly from all the available ones. Currently once the node has been picked, the execution cannot be moved to a different one even if there is a problem communicating with that node. The execution that takes place is a series of steps executed sequentially. These steps include the following:
 
*Contact the remote node
 
*Contact the remote node
*Retrieval of the data stored in the [[StorageSystem]] and these include the resources marked as Configuration, Input Data, and JDL description
+
*Retrieval of the data stored in the [[StorageSystem]] and these include the resources marked as Archive, Configuration, Files, Libraries, Jar and Input resources
*Submit the job using the provided JDL file and optionally any configuration additionally provided using the provided user proxy certificate
+
**If some of these resources are declared as stored in the Hadoop HDFS instead of the [[StorageSystem]], calls are made to the HDFS to retrieve the marked content
*Go into a loop until either the job is completed or a timeout has expired (If a timeout has been set)
+
*The Input resources are moved to the HDFS so that they are available to the MapReduce processes
**Wait for a defined period
+
*Submit the job using the provided Jar file, specifying the Class containing the main method to execute and optionally any resources provided such as configuration, properties, files, library jars, archives and arguments
**Retrieve the job status
+
*When the job is executed, and even if some error occurred, the standard output and standard error of the job is retrieved and stored to the [[StorageSystem]]
**Retrieve the job logging info
+
*If the job terminated successfully and it is requested, retrieve the directory containing the output
**Process the results of the above two steps
+
**A call to the HDFS is made to retrieve the directory, a tar.gz archive is created of this directory and stored to the [[StorageSystem]]
*Check the reason the loop ended
+
*Finally remove the input and output directory is it has been requested
*If a timeout happened, cancel the job
+
*If the job terminated successfully retrieve the output files of the job
+

Revision as of 17:29, 24 February 2010

Overview

This adaptor as part of the adaptors offered by the WorkflowEngine constructs an Execution Plan that can mediate to submit a job writen under the Map Reduce design pattern and written against the utilities offered by the Hadoop infrastructure.. After its submission the job is monitored for its status and once completed the output files are retrieved and stored in the StorageSystem. The resources that are provided and need to be moved to the Hadoop infrastructure are all transfered through the StorageSystem. They are stored once the plan is constructed and are then retrieved once the execution is started. The Hadoop infrastructure utilized resides in a cloud infrastructure with which the ExecutionEngine negotiates the resource availability.

Plan Template

The entire execution process takes place in the Hadoop UI node. This node is picked from the InformationSystem and is currently chosen randomly from all the available ones. Currently once the node has been picked, the execution cannot be moved to a different one even if there is a problem communicating with that node. The execution that takes place is a series of steps executed sequentially. These steps include the following:

  • Contact the remote node
  • Retrieval of the data stored in the StorageSystem and these include the resources marked as Archive, Configuration, Files, Libraries, Jar and Input resources
    • If some of these resources are declared as stored in the Hadoop HDFS instead of the StorageSystem, calls are made to the HDFS to retrieve the marked content
  • The Input resources are moved to the HDFS so that they are available to the MapReduce processes
  • Submit the job using the provided Jar file, specifying the Class containing the main method to execute and optionally any resources provided such as configuration, properties, files, library jars, archives and arguments
  • When the job is executed, and even if some error occurred, the standard output and standard error of the job is retrieved and stored to the StorageSystem
  • If the job terminated successfully and it is requested, retrieve the directory containing the output
    • A call to the HDFS is made to retrieve the directory, a tar.gz archive is created of this directory and stored to the StorageSystem
  • Finally remove the input and output directory is it has been requested