WorkflowCondorAdaptor
Contents
Overview
This adaptor as part of the adaptors offered by the WorkflowEngine constructs an Execution Plan that can mediate to submit a job described through a Condor submit file using a COndor gateway node. After its submission the job is monitored for its status and once completed the workspace contents are retrieved and stored in the StorageSystem in the form of a gzip tarball. The resources that are provided and need to be moved to the Condor gateway are all transfered through the [Execution_Environment_Providers || Storage System]. They are stored once the plan is constructed and are then retrieved once the execution is started.
Plan Template
The entire execution process takes place in the gLite Grid UI node. This node is picked from the InformationSystem and is currently chosen randomly from all the available ones. Currently once the node has been picked, the execution cannot be moved to a different one even if there is a problem communicating with that node. The execution that takes place is a series of steps executed sequentially. These steps include the following:
- Contact the remote node
- Retrieval of the data stored in the StorageSystem and these include the resources marked as Configuration, Input Data, and JDL description
- Submit the job using the provided JDL file and optionally any configuration additionally provided using the provided user proxy certificate
- Go into a loop until either the job is completed or a timeout has expired (If a timeout has been set)
- Wait for a defined period
- Retrieve the job status
- Retrieve the job logging info
- Process the results of the above two steps
- Check the reason the loop ended
- If a timeout happened, cancel the job
- If the job terminated successfully retrieve the output files of the job
Highlights
Error tolerance
While the execution engine is polling the progress of the job, a configurable tolerance factor can be set. Should any problem arise during either status polling or logging info polling, the adaptor can be set to retry the commands for a number of times using a configurable period.
Execution timeout
A timeout can be set so that if after some period the execution is not completed, the job is canceled and the execution aborted
Polling interval
After the job is submitted, in regular intervals the status as well as logging info associated with the job is polled to decide on the way to continue the workflow and to report back on the progress. This interval is configurable by the client.
Processing filters
Status filter
Taking advantage the extensible filtering capabilities of the Filters the WorkflowEngine defines an external filter that can process the output of the glite-wms-job-status command. This way, outputs of the following format can be parsed and the status of the job can be extracted and recognized as one of the known statuses (Submitted, Waiting, Ready, Scheduled, Running, Done, Cleared, Aborted, Canceled)
************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://some_host:some_port/some_identifier Current Status: Done (Success) Logged Reason(s): - Job got an error while in the CondorG queue. - Job terminated successfully Exit code: 0 Status Reason: Job terminated successfully Destination: some_host:som_port/some_job_manager Submitted: some_date *************************************************************
Logging info filter
Taking advantage the extensible filtering capabilities of the Filters the WorkflowEngine defines an external filter that can process the output of the glite-wms-job-logging-info command. This way, outputs of the following format can be parsed
********************************************************************** LOGGING INFORMATION: Printing info for the Job: https://som_host:som_port/some_identifier - - - Event: RegJob - source = UserInterface - timestamp = Fri Feb 20 10:30:16 2004 - - - Event: Transfer - destination = NetworkServer - result = START - source = UserInterface - timestamp = Fri Feb 20 10:30:16 2004 - - - Event: Transfer - destination = NetworkServer - result = OK - source = UserInterface - timestamp = Fri Feb 20 10:30:19 2004 - - - Event: Accepted - source = NetworkServer - timestamp = Fri Feb 20 10:29:17 2004 - - - Event: EnQueued - result = OK - source = NetworkServer - timestamp = Fri Feb 20 10:29:18 2004 [...] **********************************************************************
The filter scans through the output and parses it into a series of records/events. Every event extracted is send back to the caller utilizing the Execution Events.
Usage
The following snippets demonstrate the usage of the adaptor.
As this is the "Hello World" equivalent for the WorkflowGridAdaptor, lets assume we want to execute the following script on the Grid:
#!/bin/sh # test.sh echo $*
The respective jdl file to describe this job can be the following:
# test.jdl Executable = "test.sh"; Arguments = "Hello world of D4Science!"; StdOutput = "std.out"; StdError = "std.err"; InputSandbox = {"test.sh"}; OutputSandbox = {"std.out", "std.err"}; VirtualOrganisation = "d4science.research-infrastructures.eu";
Should we want to override the default wms configuration in the Grid UI node, we could also include a configuration file such as the following:
[ WmsClient = [ virtualorganisation = "d4science.research-infrastructures.eu"; requirements = other.GlueCEStateStatus == "Production"; WMProxyEndpoints = { "https://some_wms_endpoint:some_port/glite_wms_wmproxy_server" }; ListenerStorage = "/tmp"; ErrorStorage = "/tmp"; ShallowRetryCount = 10; AllowZippedISB = false; PerusalFileEnable = false; rank =- other.GlueCEStateEstimatedResponseTime; OutputStorage = "/tmp"; RetryCount = 3; MyProxyServer = "some_myproxy_server"; ]; ]
To submit the job, firstly we will need to define the resources that are needed for this job. These resources can include one of the following types
- JDL - The jdl file that describes the job (mandatory, exactly one resource of this type must be present)
- UserProxy - the user proxy certificate (mandatory, exactly one resource of this type must be present)
- Config - The wms configuration file we may want to override (optional, but if present exactly one resource of this type must be present)
- InData - Files that need to be present in the Grid UI node when the job submission takes place (optional, if not input files are needed for the jogb)
- OutData - Files that we want to be recovered after the execution of the job (optional, if no output needs to be retrieved)
For our case, the following snippet will declare the resources we need. The first argument of the AttachedGridResource is the name that we want the resource to have once it has been moved to the grid UI node for the resources of type JDL, UserProxy, Config and InData. For the OutData resources, this is the name of the output file that will be restored when we call the glite-wms-job-output so that the engine can retrieve only the output files we want. The second argument represents the path to the resource that we want to attach and is stored in the machine that we are running the adaptor code.
AdaptorGridResources attachedResources=new AdaptorGridResources(); attachedResources.Resources.add(new AttachedGridResource("test.jdl","local path/test.jdl",AttachedGridResource.ResourceType.JDL)); attachedResources.Resources.add(new AttachedGridResource("userProxy","local path/userProxy",AttachedGridResource.ResourceType.UserProxy)); attachedResources.Resources.add(new AttachedGridResource("wms.test.conf","local path/wms.test.config",AttachedGridResource.ResourceType.Config)); attachedResources.Resources.add(new AttachedGridResource("test.sh","local path/test.sh",AttachedGridResource.ResourceType.InData)); attachedResources.Resources.add(new AttachedGridResource("std.err",null,AttachedGridResource.ResourceType.OutData)); attachedResources.Resources.add(new AttachedGridResource("std.out",null,AttachedGridResource.ResourceType.OutData));
To create the actual ExecutionPlan using the adaptor the following snippet is enough
WorkflowGridAdaptor adaptor=new WorkflowGridAdaptor(); adaptor.SetAdaptorResources(attachedResources); adaptor.CreatePlan();
The plan that is now created can be submitted to the ExecutionEngine for execution and after the execution is completed, we can retrieve the output of the job, which in our case will be std.err and std.out with the following snippet
ExecutionPlan plan=adaptor.GetCreatedPlan(); for(IOutputResource res : adaptor.GetOutput()){ OutputSandboxGridResource out=(OutputSandboxGridResource)res; System.out.println("Output file "+out.Key+" is hosted in the StorageSystem with identifier "+plan.Variables.Get(out.VariableID)); }
Known limitations
Some of the know limitations of the currently created plan are listed below. This limitations are mainly because of simplicity of the plan and not because off the lack of constructs to cover them. This list will be updated in later versions of the adaptor that will enrich the produced plan.
- gLite Grid UI node selection
- Have a more elaborate selection strategy for the node submitting the jobs
- Relocation of execution
- Once the UI node has been picked it cannot be moved. This means that if after picking it the node becomes unreachable, the whole workflow is lost. Allow relocation, cancelation of proevious submittion and resubmission, continue monitoring of previously submitted job, etc
- Error Handling
- Now all errors are fatal. Be more resilient when errors are non critical
- Add SSL option in communication
- Allow multiple JDLs and collection style submission
- Cleanup files stored in StorageSystem as intermediate steps after completion
- Allow user cancellation
- There is currently no web service interface to access the adaptor. The creation of the plan is done through command line utilities.
- The jdl resource need also to be declared as an InData resource in order to be moved to the UI node