Data Transformation Service Specification

From Gcube Wiki
Jump to: navigation, search

Overview

A fundamental part of the gCube Data Manipulation Facilities consists of the gCube Data Transformation Service. The gDTS is presumed to offer a lot of benefits in many aspects of gCube. Presentation layer benefits from the production of alternative representations of multimedia documents. Generation of thumbnails, transformations of objects to specific formats that are required by some presentation applications and projection of multimedia files with variable quality/bitrate are just some examples of useful transformations over multimedia documents. In addition, as conversion tool for textual documents, it can offer online projection of documents in html format and moreover any other downloadable formats as pdf or ps. Annotation UI can implement a more straightforward on selected logical groups of content types (e.g. images) without caring about the details of the content and the support offered by the browsers. Finally, by utilizing the functionality of Metadata Broker, homogenization of metadata with variable schemas can be achieved.

Transformation processes are described by transformation programs, which are XML documents. In order to realize complex transformations, each transformation program can reference other transformation programs and use them as “black-box” components in the transformation process it defines.

Key features

Automatic transformation path identification
It provides automatic transformation discovery. Given the content type of a source object and the target content type, gDTS finds out the appropriate transformation to use. In addition, gDTS is able to dynamically form a path of a number of transformation steps to produce the final format.
Fine-grained sub typing of formats
Providing an extensive freedom of supported types and on the parameters of them (e.g. resolution, fps etc).
Pluggable algorithms for content transformation
A wide variety of pluggable converters in order to transform digital objects between arbitrary content types. The transformations that can be employed become available from the transformation programs which are plugged-in to the framework.
Exploitation of PE2ng Infrastructure
The integration with the PE2ng engine allows to have access to vast amounts of processing power and enables to handle virtually any transformation task thus consisting the standard Data Manipulation facility for gCube applications.

Design

Philosophy

gDTS suggests a transformation model:

  • describes how to identify the format of documents
  • formalizes the transformations that can be performed by the available converters
  • assists in automatic selection of appropriate conversion elements applicable to each transformation

Architecture

The main components of the gDTS are the Data Transformation Service, Operators, Handlers, Programs, Workflow Adaptor and the distributed Execution engine. The architecture is shown in the following figure:

Data Transformation Architecture

The primary endpoint of gDTS consists of the Data Transformation Service, acting as a planner, which will be used to create execution plans by means of contacting the transformation graph and forward the outcome of the transformation to the caller. The worker nodes that are handling the actual transformation are consisted of the Data Transformation Operators. Whenever a new data type is encountered by the planner, a new execution plan corresponding to a transformation sequence for this type is constructed.

The DTS Operators utilize the DTS Programs and DTS Handlers in order to materialize the actual transformation process. All the data transfer is done through GRS2.

Deployment

The deployment schema must be decided based on the nature of the transformation task.

Large Deployment

Transformation components are deployed over gCore containers. The gRS2 pipelining mechanism must also be part of the node. In case of a high frequency of received transformation tasks many Execution Engine instances have to be deployed. Such instances can be co-deployed for minimizing different types of overhead. A Planner instance can use only the co-deployed Execution Engine instance for the generation of execution plans. In case of a complex transformation the execution plan is high, it can exploit the computational resources of all nodes where an Execution Engine is deployed. The following figure depicts a deployment configuration with multiple transformation paths deployed, where execution can be performed in a distributed manner with each node exploiting the computational capabilities both of itself and also of all other nodes. Deployment is usually performed on a VO level.

Data Transformation Large Deployment

Small Deployment

In a smaller scale, with a small frequency of received requests, only one node with corresponding components is preferable.

Data Transformation Small Deployment

Use Cases

The suitability of the gCube Data Transformation Service is based primarily on the characteristics of the underlying environment and on the transformation task that has to be performed.

Well suited Use Cases

Search System provides indexing mechanism for the maintained documents. This may require transforming data from their source format to plain text so as to be processed by the appropriate indexing components. Data Transformation Service completes this tasks offering high performance and scalability.

Less well suited Use Cases

Less well suited cases are those where the transformation task includes few data elements to be transformed. In that case the tradeoffs are kept low, in contrast to heavy tasks.