WorkflowCondorAdaptor

From Gcube Wiki
Jump to: navigation, search

Overview

This adaptor as part of the adaptors offered by the WorkflowEngine constructs an Execution Plan that can mediate to submit a job described through a Condor submit file using a Condor gateway node. After its submission the job is monitored for its status and once completed the workspace contents are retrieved and stored in the Storage System in the form of a gzip tarball. The resources that are provided and need to be moved to the Condor gateway are all transfered through the Storage System. They are stored once the plan is constructed and are then retrieved once the execution is started.

Plan Template

The entire execution process takes place in the Condor gateway node. This node is picked from the Information System and is currently chosen randomly from all the available ones. Currently once the node has been picked, the execution cannot be moved to a different one even if there is a problem communicating with that node. The execution that takes place is a series of steps executed sequentially. These steps include the following:

  • Contact the remote node
  • Retrieval of the data stored in the Storage System and these include the resources marked as Submit file, Input Data, and Executables
  • Submit the job using the provided job submission file
  • Go into a loop until either the job is completed or a timeout has expired (If a timeout has been set)
    • Wait for a defined period
    • Retrieve the job status
    • Optionally retrieve the job class-ad
    • Process the results of the above two steps
  • Check the reason the loop ended
  • If a timeout happened, cancel the job
  • If the job terminated successfully retrieve the workspace files of the job

Highlights

Execution timeout

A timeout can be set so that if after some period the execution is not completed, the job is canceled and the execution aborted

Polling interval

After the job is submitted, in regular intervals the status as well as optionally class ad info associated with the job is polled to decide on the way to continue the workflow and to report back on the progress. This interval is configurable by the client.

Processing filters

Submit filter

Taking advantage the extensible filtering capabilities of the Filters the WorkflowEngine defines an external filter that can process the output of the condor_submit and condor_submit_dag commands. This way, outputs of the following format can be parsed and the assigned job cluster can be extracted and recognized to be used for job status polling

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 43.

Job Status info filter

Taking advantage the extensible filtering capabilities of the Filters the WorkflowEngine defines an external filter that can process the output of the condor_q command. This way, outputs indicating the progress of specific clusters of jobs can be parsed to signal the continuation of the workflow

Job Class-ad info filter

Taking advantage the extensible filtering capabilities of the Execution Events mechanism, the WorkflowEngine defines an event emiting filter that processes the output of a condor_q -xml command for specific job clusters that will present the class ad of the job clusters to the submiter. Since the only available clusters of the submited jobs are the ones returned by condor_submit, in case of a dag job, the class ad of the dag manager is retrieved and that of the underlying jobs.

Usage

The following snippets demonstrate the usage of the adaptor.

As this is the "Hello World" equivalent for the WorkflowGridAdaptor, lets assume we want to execute the following script on the Grid:

#include <stdio.h>
/*simple.c*/
main(int argc, char **argv)
{
    int sleep_time;
    int input;
    int failure;
 
    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer>\n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);
 
        printf("Thinking really hard for %d seconds...\n", sleep_time);
        sleep(sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    }
    return failure;
}

The respective condor submission file to describe this job can be the following:

Universe   = vanilla
Executable = simple
Arguments  = 120 10
Log        = simple.log
Output     = simple.out
Error      = simple.error
Queue

To submit the job, firstly we will need to define the resources that are needed for this job. These resources can include one of the following types Submit - The condor submission file describing this job (mandatory, exactly one resource of this type must be present) InData - Files that need to be present in the Condor gateway node when the job submission takes place (optional, if not input files are needed for the job) Executable - The executable that is part of the job. The only difference between this resource and the InData resources is that using this resource, the +x flag will be attributed to the file to make it executable (optional) Command - Additional command that can be placed in the command line invocation of condor_submit instead of been placed inside the submission file (optional)

For our case, the following snippet will declare the resources we need. The first argument of the AttachedCondorResource is the name that we want the resource to have once it has been moved to the condor UI node for the resources of type Submit, InData and Executable. The second argument represents the path to the resource that we want to attach and is stored in the machine that we are running the adaptor code.

AdaptorCondorResources attachedResources=new AdaptorCondorResources();
attachedResources.Resources.add(new AttachedCondorResource("submit.condor","local path/submit.condor",AttachedCondorResource.ResourceType.Submit));
attachedResources.Resources.add(new AttachedCondorResource("simple","local path/simple",AttachedCondorResource.ResourceType.Executable));

To create the actual ExecutionPlan using the adaptor the following snippet is enough

WorkflowCondorAdaptor adaptor=new WorkflowCondorAdaptor();
adaptor.SetAdaptorResources(attachedResources);
adaptor.CreatePlan();

The plan that is now created can be submitted to the ExecutionEngine for execution and after the execution is completed, we can retrieve the output of the job, which in our case will be simple.out, simple.error and simple.log with the following snippet

ExecutionPlan plan=adaptor.GetCreatedPlan();
for(IOutputResource res : adaptor.GetOutput()){
  OutputSandboxCondorResource out=(OutputSandboxCondorResource)res;
  System.out.println("Output file "+out.Key+" is hosted in the StorageSystem with identifier "+plan.Variables.Get(out.VariableID));
}

Known limitations

Some of the known limitations of the currently created plan are listed below. These limitations are mainly because of the simplicity of the plan and not because off the lack of constructs to cover them. This list will be updated in later versions of the adaptor that will enrich the produced plan.

  • Condor gateway node selection
    Have a more elaborate selection strategy for the node submitting the jobs
  • Relocation of execution
    Once the UI node has been picked it cannot be moved. This means that if after picking it the node becomes unreachable, the whole workflow is lost. Allow relocation, cancellation of previous submission and resubmission, continue monitoring of previously submitted job, etc
  • Error Handling
    Now all errors are fatal. Be more resilient when errors are non critical
  • Monitoring in cases of DAG jobs for the underlying jobs and not just the dag manager job
  • Allow user cancellation
  • Output is retrieved as tar.gz archives and not as individual files. These include the entire workspace including the input files