Difference between revisions of "Tabular Data Flow Manager"

From Gcube Wiki
Jump to: navigation, search
Line 4: Line 4:
  
 
The goal of this facility is to realise an integrated environment supporting the definition and management of workflows of tabular data.  
 
The goal of this facility is to realise an integrated environment supporting the definition and management of workflows of tabular data.  
Each workflow consists of a number of tabular data processing steps where each step is realized by an existing service conceptually offered by a gCube based infrastructure.   
+
Each workflow consists of a number of tabular data processing steps where each step is realized by an existing service component conceptually offered by a gCube based infrastructure.   
  
 
In the following, the design rationale, key features, high-level architecture, as well as the deployment scenarios are described.
 
In the following, the design rationale, key features, high-level architecture, as well as the deployment scenarios are described.
Line 10: Line 10:
 
== Overview ==
 
== Overview ==
  
The goal of this service is to offer a facilities for tabular data workflow creation, management and monitoring.  
+
The goal of this service is to offer a facilities for tabular data workflow management, execution and monitoring.
The workflow can involve a number of data manipulation steps each performed by potentially different services to produce the desired output.
+
The workflow can involve a number of data manipulation steps each performed by potentially different service components to produce the desired output.
User defined workflow can be scheduled for deferred execution and the user notified about the workflow progress.
+
  
 
=== Key features ===  
 
=== Key features ===  
 
The subsystem provides for:
 
The subsystem provides for:
 +
 +
;hidden workflow
 +
:Instead of providing the user with means to describe the workflow the user provides a table template as a set of properties a target table should comply with. The service translates templates into workflows
  
 
;flexible and open workflow definition mechanism
 
;flexible and open workflow definition mechanism
:The subsystem offers a flexible workflow definition mechanism where workflow elements are dynamically acquired by those available in the infrastructure;
+
:The set of workflow steps can be enriched providing wider capabilities for template descriptiveness;
  
 
;user-friendly interface
 
;user-friendly interface
:The subsystem offers a graphical user interface where users are supported while selecting workflow elements and defining their own manipulation sequences. Moreover, the environment allow to actually perform a workflow by instantiating it on available data sets and monitoring the execution;
+
:The subsystem offers a graphical user interface where users can define table templates. Moreover, the environment allow to actually perform a workflow by applying a template to an imported table;
  
 
== Design ==
 
== Design ==
Line 27: Line 29:
 
=== Philosophy ===
 
=== Philosophy ===
  
Tabular Data Flow Manager offers a service for tabular data workflow creation, management and monitoring.  
+
Tabular Data Flow Manager offers a service for tabular data workflow creation, management and monitoring.
The underlying idea is to decouple the logic needed to represent and execute workflows of tabular data processing from the single steps each taking care of part of the overall manipulation.  
+
The underlying idea is to provide means to the service client to command multiple operations by providing a table template. A table template can be defined in terms of a set of properties the workflow resulting table should camply with. Table templates can be created by the end user with the UI and saved for later reuse. Applying a template to a target tabular data table results in the materialization of a set of workflow steps on the service, which can be monitored remotely.
This aims at maximizing the exploitation and reuse of components aiming at offering data manipulation facilities.
+
Each step is managed by a single software component which can also be invoked singularly.
Moreover, this make it possible to 'codify' standard (including domain oriented ones) data manipulation processes and execute them whenever data deserving such a kind of processing manifest.
+
This approach aims at maximizing the exploitation and reuse of components offering data manipulation facilities.
  
 
=== Architecture ===
 
=== Architecture ===
 
The subsystem comprises the following components:
 
The subsystem comprises the following components:
  
* '''Tabular Data Flow Service''': the core element of this functional area. It offers workflow creation, management and monitoring functions;
+
* '''Flow Service''': A subset of Tabular Data Service functionalities that allows workflow creation, management, execution and monitoring;
 +
 
 +
* '''Flow UI''': the user interface of this functional area. It provides users with the web based user interface for creating, executing and monitoring the workflow(s);
  
* '''Tabular Data Flow UI''': the user interface of this functional area. It provide its user with the web based user interface for creating, executing and monitoring the workflow(s);
+
* '''Workflow Orchestrator''': A service components that ''unpacks'' a table template into a sequence of operations to be performed on a target table;
  
* '''Tabular Data Agent''': an helper component every service willing to offer tabular data manipulation functionality has to be equipped with.
+
* '''Operation modules''': A set of software modules, each one managing a specific operation (transformation,validation,import,export).
  
 
A diagram of the relationships between these components is reported in the following figure:
 
A diagram of the relationships between these components is reported in the following figure:
Line 46: Line 50:
  
 
== Deployment ==
 
== Deployment ==
The Service should be deployed in a single node, while the agents should be deployed with the service that want to offer his functionality to the flow service. The User Interface can be deployed in the infrastructure portal.
+
The Service should be deployed in a single node along with the operation modules. The User Interface can be deployed in the infrastructure portal along with the needed client library.
  
 
== Use Cases ==
 
== Use Cases ==
  
 
=== Well suited Use Cases ===
 
=== Well suited Use Cases ===
This component well fit all the cases where it is necessary to manage a defined flow of data manipulation steps, where the steps are performed by potentially different service instances.  
+
This component well fit all the cases where it is necessary to manage a defined flow of data manipulation steps. An example is the data flow that allows a user to curate a set of not curated data, provided periodically by a data provider, apply a set of default transformation and validation procedures and merge all the curated data chunks together at the end of the process.
An example is the data flow leading to the production of an enhanced version of data set containing catch statistics: the data are initially massaged the [[TimeSeries | Time Series Service]] that takes care of associating to it the proper Reference Data, then passed to the to the [[Occurrence Data Enrichment Service]] to be improved with additional data. Finally, the enriched dataset goes to the [[Statistical_Manager | Statistical Service]] with the goal to extract certain features via data mining algorithms.
+

Revision as of 11:25, 21 November 2013

The goal of this facility is to realise an integrated environment supporting the definition and management of workflows of tabular data. Each workflow consists of a number of tabular data processing steps where each step is realized by an existing service component conceptually offered by a gCube based infrastructure.

In the following, the design rationale, key features, high-level architecture, as well as the deployment scenarios are described.

Overview

The goal of this service is to offer a facilities for tabular data workflow management, execution and monitoring. The workflow can involve a number of data manipulation steps each performed by potentially different service components to produce the desired output.

Key features

The subsystem provides for:

hidden workflow
Instead of providing the user with means to describe the workflow the user provides a table template as a set of properties a target table should comply with. The service translates templates into workflows
flexible and open workflow definition mechanism
The set of workflow steps can be enriched providing wider capabilities for template descriptiveness;
user-friendly interface
The subsystem offers a graphical user interface where users can define table templates. Moreover, the environment allow to actually perform a workflow by applying a template to an imported table;

Design

Philosophy

Tabular Data Flow Manager offers a service for tabular data workflow creation, management and monitoring. The underlying idea is to provide means to the service client to command multiple operations by providing a table template. A table template can be defined in terms of a set of properties the workflow resulting table should camply with. Table templates can be created by the end user with the UI and saved for later reuse. Applying a template to a target tabular data table results in the materialization of a set of workflow steps on the service, which can be monitored remotely. Each step is managed by a single software component which can also be invoked singularly. This approach aims at maximizing the exploitation and reuse of components offering data manipulation facilities.

Architecture

The subsystem comprises the following components:

  • Flow Service: A subset of Tabular Data Service functionalities that allows workflow creation, management, execution and monitoring;
  • Flow UI: the user interface of this functional area. It provides users with the web based user interface for creating, executing and monitoring the workflow(s);
  • Workflow Orchestrator: A service components that unpacks a table template into a sequence of operations to be performed on a target table;
  • Operation modules: A set of software modules, each one managing a specific operation (transformation,validation,import,export).

A diagram of the relationships between these components is reported in the following figure:

Tabular Data Flow Manager, internal Architecture

Deployment

The Service should be deployed in a single node along with the operation modules. The User Interface can be deployed in the infrastructure portal along with the needed client library.

Use Cases

Well suited Use Cases

This component well fit all the cases where it is necessary to manage a defined flow of data manipulation steps. An example is the data flow that allows a user to curate a set of not curated data, provided periodically by a data provider, apply a set of default transformation and validation procedures and merge all the curated data chunks together at the end of the process.