Data Transfer Scheduler & Agent components

From Gcube Wiki
Jump to: navigation, search

Overview

This class of components manages transfer capabilities among gCube infrastructure nodes, in particular but not only it can handle data transfers between Data Sources and Data Storages exploiting the interfaces and the services implemented under the Data Access and Storage Facilities subsystem.

This document outlines the design rationale, key features, and high-level architecture, the options for their deployment and as well some use cases.

Key features

The components belonging to this class are responsible for:

reliable data transfer between Infrastructure Data Sources and Data Storages
by exploiting the uniform access interfaces provided by gCube and standard transfer protocols
structured and unstructured Data Transfer
it guarantees both Tree based and File based transfer to cover all possible iMarine use-cases
transfers to local nodes for data staging
data staging for particular use cases can be enabled on each node of the infrastructure
advanced transfer scheduling and transfer optimization
a dedicated gCube service responsible for data transfer scheduling combined to transfer optimization at the level of protocols and Access interfaces.
transfer statistics availability
transfers are logged by the system and make available to interested consumers.
transfer shares per scopes and users
a management interface is used to configure transfer shares per scopes and users at the level of Data Sources and Storages.

Design

Philosophy

Data transfer on a distributed infrastructure has to guarantee in first place transfer reliability and optimization in the sense of the resource usage (minimize network load while not causing storage overload). In addition compared to most of the solution developed for data transfer, the solution designed has to take into account not only the standard "unstructured" data transfer (file transfer) but the capability of "structured" data transfer peculiar to the iMarine data infrastructure.

Architecture

The main components forming this class of Data transfer facilities are two gCube services plus the related libraries.

The Data Transfer Scheduler service ( gDT Scheduler)
The service is responsible for the transfer scheduling activity delegating and spawning the transfer logic to the series of gDT Agent deployed on the infrastructure. It relies on Messaging to consume transfer results from gDT Agent. The service has two main porttypes, one for the transfer scheduling and one for the management of transfer share per scopes and users
The Data Transfer Scheduler DB interface ( gDT Scheduler DB interface)
it's a component that models the gDT Scheduler DB, by abstracting the particular DB technology underneath.
The Data Transfer Agent service ( gDT Agent)
the component implementing the transfer logic. It accesses Data Sources trough the interfaces made available by the gCube Data Access facilities and transfers it locally to the infrastructures by relying on gCube Data Storages. It handles several transfer protocol by exploiting the facilities provided by the gCube Result Set components. It relies on Messaging to publish transfer results, to be consumed by the gDT Scheduler. The service is responsible as well of data staging to infrastructure nodes.
The Data Transfer Scheduler Library ( gDT Scheduler Lib)
The library exploited by clients to schedule data transfer among Data Sources and Storages. It offers a uniform interfaces for both structured and unstructured data transfers.
The Data Transfer Agent Library ( gDT Agent Lib)
The library can be used to contact gDT Agent and stage data to infrastructure nodes.
The Data Transfer widget ( gDT Widget)
a GUI component that can be integrated into any gCube Portlets to enable Data transfer facilities. It relies on both gDT Scheduler library and gDT Agent library.

the following diagram depicts the dependencies among the described components:

Data Transfer Scheduler & Agent Architecture

Deployment

The component of the subsystem can be deployed according to their nature in different execution environments:

  • The gDT Widget can be included in any Web Applications therefore it's deployable on a WebApp Container ( Tomcat, Jetty, JBoss..)
  • The gDT Scheduler and the Agent Services can be deployed on gCube enabled containers.
  • The gDT Scheduler & Agent Library can be integrated in other gCube Services or run as standalone libraries.

Large Deployment

In the case of exploitation of all the functionalities provided by the Data Transfer components in a wide area infrastructure, the deployment of the components need to be addressed as in the picture below. The gDT Scheduler service is deployed in junction with a series of gDT Agent services: the scheduler can dynamically fetch the information regarding the gDT agent, Data source and Storages on the infra, and schedule and optimized transfer plan. Regarding the gDT scheduler DB, it can be deployed locally to the service or remotely.

Data Transfer Scheduler & Agent Large Deployment schema

Small Deployment

Data transfer can be achieved also deploying a subset of the components described above. The gDT Agent service can be contacted by any client ( trough the gDT Agent library ) to move data from a local node to another, or from Data Sources to Data Storages without the scheduling and optimization facilities offered by the gDT Scheduler.

The picture below describes this minimalistic deployment schema.

Data Transfer Agent Small Deployment schema

Use Cases

Well suited use cases

The Data Transfer use cases most suitable for the series of components that are part of this subsystem are the following:

  • Data Staging
  • Data Caching/Import

In the case of Data Staging the components are exploited to move data in/from nodes of the infrastructure in order to implement complex Data workflows or implement data staging as input for Data processes. This use case can be implemented by VO/VRE administrators who are aware of the data requirements or by other components of the infrastructure. On the other hand the possibility to move data "within" the infrastructure boundaries is a well wanted use case, cause it guarantees performance boost for common infrastructure processes like Data Transformation, Search etc.

Less well suited use cases

The components for Data Transfer described in this section are meant to be used to schedule and optimize transfers of heterogeneous data types. The emphasis is not given to optimization in terms of data transfer rate, cause this will need to involve optimization at the level of Data Source and Data Storages and as well at network level ( e.g. infrastructure nodes are not connected trough dedicated networks).

In addition, when dealing with transfer scheduling, the VO/VRE managers may need to consider scheduling overheads ( e.g. transfers queue or transfer plan creation time ) which advise against scheduling transfer of small quantity of data.