Data Transfer 2

From Gcube Wiki
Jump to: navigation, search

Overview

The implementation of a reliable data transfer mechanisms between the nodes of a gCube-based Hybrid Data Infrastructure is one of the main objectives when dealing with large set of multi-type datasets distributed across different repositories.

To promote an efficient and optimized consumption of these data resources, a number of components have been designed to meet the data transfer requirements.

This document outlines the design rationale, key features, and high-level architecture, the options for their deployment and as well some use cases.

Key features

The components belonging to this class are responsible for:

Point to Point transfer
direct transfer invocation to a gCube Node
reliable data transfer between Infrastructure Data Sources and Data Storages
by exploiting the uniform access interfaces provided by gCube and standard transfer protocols
automatic transfer optimization
by exploiting best available transfer options between invoker and target nodes
advanced and extensible post transfer processing
plugin - oriented implementation to serve advanced use case

Design

Philosophy

In a Hybrid Data e-Infrastructure, transferring data between nodes is a crucial feature. Given the heterogeneous nature of datasets, applications, networks capabilities and restrictions, efficient data-transfers can be problematic. Negotiation of best-efficient-available-transfer method is essential in order to achieve a global efficiency, while an extensible set of post operations lets application serve specific needs among common ones. We wanted to provide a light-weight solution that could fit the evolving nature of the infrastructure and its communities needs.

Architecture

The architecture of Data Transfer is a simple client server model, respectively implemented by data-transfer-library and data-transfer-service.

The service is a simple REST server, implemented as a gCube SmartGear web application. It is designed exploiting the plugin design pattern, so that it can be easily extended without introducing hard dependencies between the architecture's components. It also exposes interfaces aimed to :

  • get node's capabilities in terms of available plugins and supported transfer method;
  • submit and monitor transfer requests

The client is a java library that :

  • contacts the server asking for its capabilities in order to decide the best efficient transfer way between the two sides;
  • submits a transfer request through the selected transfer channel
  • monitors its status/outcome.

Optionally, a transfer request can specify an ordered set of plugin invocation, that are executed after the transfer has been completed. These execution are monitored by the client as part of the atomic transfer process.

The picture summarize the described architecture.

Data Transfer Architecture

Deployment

As described in the section [Architecture], the Data Transfer suite is composed by three main components :

  • data-transfer-library is a java library that can be used by any other java application. Thus the two need to be in the same classpath.
  • data-transfer-service is a gCube Smart Gear REST service, thus it needs to be hosted in a Smart Gear installation.
  • plugin implementations are discovered at runtime by the service, thus they need to be deployed along with the data-transfer-service itself.

Large Deployment

Small Deployment

Use Cases

For more information on how to interact with Data Transfer service see How to use Data Transfer 2.

Well suited use cases

The Data Transfer is designed mainly for point to point transfers, so its use it's most suited for moving datasets where they need to be stored and processed. The automatic transfer means negotiation mostly helps in situations where the remote nodes are in heterogeneous networks and relative restrictions (i.e. cloud computing). Implemented plugins can help to deploy and prepare data locally to those storage technologies that have particular needs (i.e. Thredds).


Less well suited use cases

More advanced use cases such as scheduled tasks and parallel transfers with strong requirements on data coherence between nodes are beyond the scope of the current implementation, and needs to be managed by the user application.