Difference between revisions of "Workflow Engine Specification"

From Gcube Wiki
Jump to: navigation, search
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
 +
The Workflow Engine is in charge of processing and workflows, i.e.  high level plans that binds together conceptual operations for the implementation of a task. As a workflow is a higher level sketch of the activity to be performed, potentially without even reference to executables and resources’ usage, the Workflow Engine operates on top of the [[Execution Engine Specification | ExecutionEngine]]. Its purpose is to abstract over the low level details that are needed by the ExecutionEngine and the Execution Plan it is provided with.
  
A brief overview of the subsystem should be here. It should include the key features.  
+
When the Workflow Engine is acting in the context of the gCube platform, it provides a gCube compliant Web Service interface. This Web Service acts as the front end not only to Workflow definition facilities, but it is also the "face" of the component with respect to the gCube platform. The Running Instance profile of the service is the placeholder where the underlying Workflow Engine instance Execution Environment Providers pushes information that need to be made available to other engine instances. Configuration properties that are used throughout the Workflow Engine instance are retrieved by the appropriate technology specific constructs and are used to initiate services and providers once the service starts.
  
 
=== Key features ===
 
=== Key features ===
  
 
;Orchestration for external computational and storage infrastructures.
 
;Orchestration for external computational and storage infrastructures.
:PE2ng allows users and systems to exploit computational resources that reside on multiple eInfrastructures in a single complex process workflow.
+
:Allows users and systems to exploit computational resources that reside on multiple eInfrastructures in a single complex process workflow.
 
;Native computational infrastructure provider and manager.
 
;Native computational infrastructure provider and manager.
:PE2ng can be used to exploit the computational capacity of the D4Science Infrastructure by executing tasks directly on the latter.
+
:Allows full exploitation of the computational capacity of the D4Science Infrastructure by executing tasks directly on the latter.
;Control and monitoring of a processing flow execution.
+
:PE2ng provides progress reporting and control constructs based on an event mechanism for tasks operating both on the D4Science and external infrastructures.
+
 
;Handling of data staging among different storage providers.
 
;Handling of data staging among different storage providers.
:All PE2ng Adaptors handle data staging in a transparent way, without requiring any kind on external intervention.
+
:All Workflow Engine Adaptors handle data staging in a transparent way, without requiring any kind on external intervention.
;Handling of data streaming among computational elements.
+
:PE2ng exploits the high throughput point to point on demand communication facilities offered by [[Result Set Components|gRS2]]
+
;Expressive and powerful execution plan language
+
:The execution plan elements comprising the language can execute literally anything.
+
 
;Unbound extensibility via providers for integration with different environments.
 
;Unbound extensibility via providers for integration with different environments.
:The system is designed in an extensible manner, allowing the transparent integration with a variety of providers for storage, resource registries, reporting systems, etc.
+
:The system is designed in an extensible manner, allowing the transparent integration with a variety of providers, among which environments for storage, resource registries, reporting systems depending on the environment on which the engine is operating.
;Alignment with cloud computing principles
+
:Application as a Service (AaaS), Plarform as a Service (PaaS), Infrastructure as a Service (IaaS)
+
;User Interface
+
:PE2ng offers a User Interface to launch and monitor commonly used workflows and aims to provide a fully fledged Graphical User Interface for the graphical composition of arbitrary workflows.
+
  
 
== Design ==
 
== Design ==
  
 
=== Philosophy ===
 
=== Philosophy ===
This is the rationale behind the design. An example will be provided.  
+
The Workflow Engine is designed to allow specific third party languages to be parsed and translated into internally used constructs. This task is performed by the Adaptors which can understand these kind of languages. In this way, the Workflow Engine opens up its usability level since existing workflows already defined in third party languages can be easily incorporated. Additionally, the learning curve for anyone wishing to use the execution and workflow capabilities of the system is greatly reduced as depending on ones needs one can simply focus on one of the existing supported languages which will best match the job at hand. Additionally, for the same language, more than one adaptors can be implemented that will offer different type of functionality either by modifying the semantics of the produced Execution Plan or even by incorporating external components to the evaluation.
 +
Since the Workflow Engine is a constituent part of the PE2ng system, the Workflow Engine is designed with extensibility and adaptability in mind, being able to operate on a variety of environments. For example, in the context of gCube, the gCube Storage, Information and Account Systems are used.
  
 
=== Architecture ===
 
=== Architecture ===
The main software components forming the subsystem should be identified and roughly described. An architecture diagram has to be added here. A template for the representation of the architecture diagram will be proposed together with an opensource tool required to produce it.  
+
The Workflow Engine comprises a number of adaptors. The following list of adaptors is currently provided:
 +
 
 +
;WorkflowJDLAdaptor
 +
:This adaptor parses a Job Description Language (JDL) definition block and translates the described job or DAG of jobs into an Execution Plan which can be submitted to the ExecutionEngine for execution. The plan is executed directly on the working nodes of the D4Science Infrastructure, in this way exploiting the computational resources of the native infrastructure.
 +
 
 +
;WorkflowGridAdaptor
 +
:This adaptor constructs an Execution Plan that can contact a gLite UI node, submit, monitor and retrieve the output of a grid job.  
 +
 
 +
;WorkflowCondorAdaptor
 +
:This adaptor constructs an Execution Plan that can contact a Condor gateway node, submit, monitor and retrieve the output of a condor job.
 +
 
 +
;WorkflowHadoopAdaptor
 +
:This adaptor constructs an Execution Plan that can contact a Hadoop UI node, submit, monitor and retrieve the output of a Map Reduce job
 +
 
 +
In addition to the above Adaptors, which are all contactable through the Web Service interface of the Workflow Engine, two more adaptors have been implemented to satisfy the special purpose requirements of certain applications. These adaptors are contactable only through the client service which employs them and are the following:
 +
 
 +
;WorkflowSearchAdaptor
 +
:This adaptor is used by the [[Search Planning and Execution Specification|planner]] of the gCube Search System in order to translate abstract search plans to concrete Execution Plans, which materialize search tasks executed on the native infrastructure.
 +
 
 +
;WorkflowDTSAdaptor
 +
:This adaptor is defined and used by the [[Data Transformation Service Specification | Data Transformation Service]] and translates abstract data transformation paths to concrete Execution Plans which materialize transformation tasks and are executed on the native infrastructure.
  
 
== Deployment ==
 
== Deployment ==
Usually, a subsystem consists of a number of number of components. This section describes the setting governing components deployment, e.g. the hardware components where software components are expected to be deployed. In particular, two deployment scenarios should be discussed, i.e. Large deployment and Small deployment if appropriate. If it not appropriate, one deployment diagram has to be produced.  
+
The deployment of the Workflow Engine is tightly coupled with that of the [[Execution Engine Specification | Execution Engine]]. Specifically, the Execution Engine should be deployed on each node the Workflow Engine is deployed, regardless if the execution takes place locally or on one or more remote nodes. An important point in the deployment of the Workflow Engine is that in order for the latter to be discoverable in the infrastructure, it should be deployed as a gCube Web Service. Special purpose adaptors can either be packaged in their own libraries or become part of the core Workflow Engine library (as of now, the former is the preferred method). In either case only the packaging unit containing the adaptor should be deployed on these nodes, unless processing of external workflows is also desirable, or the application itself mandates service deployment.
  
 
=== Large deployment ===
 
=== Large deployment ===
  
A deployment diagram suggesting the deployment schema that maximizes scalability should be described here.
+
When the rate of workflow processing tasks to be serviced is expected to be high, more than one Workflow Engine instances, along with their Web Service interfaces should be deployed in the infrastructure.
 +
[[File:WorkflowEngine_LargeDeployment.jpg|700px|center|Workflow Engine large deployment]]
  
 
=== Small deployment ===
 
=== Small deployment ===
  
A deployment diagram suggesting the "minimal" deployment schema should be described here.
+
In infrastructures with low requirements for workflow processing, only one Workflow Engine instance can be deployed. One can even avoid deploying Workflow Engine instances entirely if the needs for workflow processing are limited to special purpose applications which employ their own locally deployed libraries.
 
+
[[File:WorkflowEngine_SmallDeployment.jpg|700px|center|Workflow Engine small deployment]]
 
== Use Cases ==
 
== Use Cases ==
The subsystem has been conceived to support a number of use cases moreover it will be used to serve a number of scenarios. This area will collect these "success stories".
 
  
 
=== Well suited Use Cases ===
 
=== Well suited Use Cases ===
  
Describe here scenarios where the subsystem proves to outperform other approaches.  
+
*Execution of JDL based workflows in the D4Science and gLite Infrastructures.
 +
**The Workflow Engine has been successfully used to execute a number of such jobs, such as OCRing, Reference Extraction and Text Extraction.
 +
*Execution of arbitrary workflows in the D4Science Infrastructure.
 +
**Through the corresponding adaptors, the Workflow Engine has been successfully used by the gCube Search System and the gCube Data Transformation Service as the means of translating their workflows to parallel distributed Execution Plans.
 +
*Execution of workflows targeting the Hadoop infrastructure
 +
**An example of such an application is Bibliometric Indexing.
  
 
=== Less well suited Use Cases ===
 
=== Less well suited Use Cases ===
  
Describe here scenarios where the subsystem partially satisfied the expectations.
+
The Workflow Engine does not currently offer advanced features for automatic task parallelization, meaning that this logic should be integrated directly into the implementations of Adaptors. This not only makes the implementation of such Adaptors a non-trivial task but such logic should be implemented for each case separately. Despite this fact not posing a serious problem up until nowdue to the nature of the tasks for which Adaptors were employed, this kind of functionality will be included in the next major version of the Engine, becoming one of its core parts alongside the set of Adaptor.

Revision as of 20:45, 30 April 2012

Overview

The Workflow Engine is in charge of processing and workflows, i.e. high level plans that binds together conceptual operations for the implementation of a task. As a workflow is a higher level sketch of the activity to be performed, potentially without even reference to executables and resources’ usage, the Workflow Engine operates on top of the ExecutionEngine. Its purpose is to abstract over the low level details that are needed by the ExecutionEngine and the Execution Plan it is provided with.

When the Workflow Engine is acting in the context of the gCube platform, it provides a gCube compliant Web Service interface. This Web Service acts as the front end not only to Workflow definition facilities, but it is also the "face" of the component with respect to the gCube platform. The Running Instance profile of the service is the placeholder where the underlying Workflow Engine instance Execution Environment Providers pushes information that need to be made available to other engine instances. Configuration properties that are used throughout the Workflow Engine instance are retrieved by the appropriate technology specific constructs and are used to initiate services and providers once the service starts.

Key features

Orchestration for external computational and storage infrastructures.
Allows users and systems to exploit computational resources that reside on multiple eInfrastructures in a single complex process workflow.
Native computational infrastructure provider and manager.
Allows full exploitation of the computational capacity of the D4Science Infrastructure by executing tasks directly on the latter.
Handling of data staging among different storage providers.
All Workflow Engine Adaptors handle data staging in a transparent way, without requiring any kind on external intervention.
Unbound extensibility via providers for integration with different environments.
The system is designed in an extensible manner, allowing the transparent integration with a variety of providers, among which environments for storage, resource registries, reporting systems depending on the environment on which the engine is operating.

Design

Philosophy

The Workflow Engine is designed to allow specific third party languages to be parsed and translated into internally used constructs. This task is performed by the Adaptors which can understand these kind of languages. In this way, the Workflow Engine opens up its usability level since existing workflows already defined in third party languages can be easily incorporated. Additionally, the learning curve for anyone wishing to use the execution and workflow capabilities of the system is greatly reduced as depending on ones needs one can simply focus on one of the existing supported languages which will best match the job at hand. Additionally, for the same language, more than one adaptors can be implemented that will offer different type of functionality either by modifying the semantics of the produced Execution Plan or even by incorporating external components to the evaluation. Since the Workflow Engine is a constituent part of the PE2ng system, the Workflow Engine is designed with extensibility and adaptability in mind, being able to operate on a variety of environments. For example, in the context of gCube, the gCube Storage, Information and Account Systems are used.

Architecture

The Workflow Engine comprises a number of adaptors. The following list of adaptors is currently provided:

WorkflowJDLAdaptor
This adaptor parses a Job Description Language (JDL) definition block and translates the described job or DAG of jobs into an Execution Plan which can be submitted to the ExecutionEngine for execution. The plan is executed directly on the working nodes of the D4Science Infrastructure, in this way exploiting the computational resources of the native infrastructure.
WorkflowGridAdaptor
This adaptor constructs an Execution Plan that can contact a gLite UI node, submit, monitor and retrieve the output of a grid job.
WorkflowCondorAdaptor
This adaptor constructs an Execution Plan that can contact a Condor gateway node, submit, monitor and retrieve the output of a condor job.
WorkflowHadoopAdaptor
This adaptor constructs an Execution Plan that can contact a Hadoop UI node, submit, monitor and retrieve the output of a Map Reduce job

In addition to the above Adaptors, which are all contactable through the Web Service interface of the Workflow Engine, two more adaptors have been implemented to satisfy the special purpose requirements of certain applications. These adaptors are contactable only through the client service which employs them and are the following:

WorkflowSearchAdaptor
This adaptor is used by the planner of the gCube Search System in order to translate abstract search plans to concrete Execution Plans, which materialize search tasks executed on the native infrastructure.
WorkflowDTSAdaptor
This adaptor is defined and used by the Data Transformation Service and translates abstract data transformation paths to concrete Execution Plans which materialize transformation tasks and are executed on the native infrastructure.

Deployment

The deployment of the Workflow Engine is tightly coupled with that of the Execution Engine. Specifically, the Execution Engine should be deployed on each node the Workflow Engine is deployed, regardless if the execution takes place locally or on one or more remote nodes. An important point in the deployment of the Workflow Engine is that in order for the latter to be discoverable in the infrastructure, it should be deployed as a gCube Web Service. Special purpose adaptors can either be packaged in their own libraries or become part of the core Workflow Engine library (as of now, the former is the preferred method). In either case only the packaging unit containing the adaptor should be deployed on these nodes, unless processing of external workflows is also desirable, or the application itself mandates service deployment.

Large deployment

When the rate of workflow processing tasks to be serviced is expected to be high, more than one Workflow Engine instances, along with their Web Service interfaces should be deployed in the infrastructure.

Small deployment

In infrastructures with low requirements for workflow processing, only one Workflow Engine instance can be deployed. One can even avoid deploying Workflow Engine instances entirely if the needs for workflow processing are limited to special purpose applications which employ their own locally deployed libraries.

Use Cases

Well suited Use Cases

  • Execution of JDL based workflows in the D4Science and gLite Infrastructures.
    • The Workflow Engine has been successfully used to execute a number of such jobs, such as OCRing, Reference Extraction and Text Extraction.
  • Execution of arbitrary workflows in the D4Science Infrastructure.
    • Through the corresponding adaptors, the Workflow Engine has been successfully used by the gCube Search System and the gCube Data Transformation Service as the means of translating their workflows to parallel distributed Execution Plans.
  • Execution of workflows targeting the Hadoop infrastructure
    • An example of such an application is Bibliometric Indexing.

Less well suited Use Cases

The Workflow Engine does not currently offer advanced features for automatic task parallelization, meaning that this logic should be integrated directly into the implementations of Adaptors. This not only makes the implementation of such Adaptors a non-trivial task but such logic should be implemented for each case separately. Despite this fact not posing a serious problem up until nowdue to the nature of the tasks for which Adaptors were employed, this kind of functionality will be included in the next major version of the Engine, becoming one of its core parts alongside the set of Adaptor.