Difference between revisions of "WP9 iMarine Data Management Software Consolidated Specifications"

From Gcube Wiki
Jump to: navigation, search
(Created page with 'Category:gCube Features == Overview == gCube is a software suite equipped with a rich array of services capable to interface with data sources having different characteristi…')
 
Line 12: Line 12:
 
The components part of the subsystem provide the following main key features:
 
The components part of the subsystem provide the following main key features:
  
 +
uniform model and access API over structured data
 +
    dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
 +
    plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
 +
 +
fine-grained access to structured data
 +
    horizontal and vertical filtering based on pattern matching;
 +
    URI-based resolution;
 +
    in-place remote updates;
 +
 +
scalable access to structured data
 +
    autonomic service replication with infrastructure-wide load balancing;
 +
 +
efficient and scalable storage of structured data
 +
    based on graph database technology;
 +
 +
rich tooling for client and plugin development
 +
    high-level Java APIs for service access;
 +
    DSLs for pattern construction and stream manipulations;
 +
 +
remote viewing mechanisms over structured data
 +
    “passive” views based on arbitrary access filters;
 +
    dynamically pluggable architecture of custom view management schemes;
 +
 +
uniform modelling and access API over document data
 +
    rich descriptions of document content, metadata, annotations, parts, and alternatives
 +
    transformations from model and API of key document sources, including OAI providers;
 +
    high-level client APIs for model construction and remote access;
 +
 +
uniform modelling and access API over semantic data
 +
    tree-views over RDF graph data;
 +
    transformations from model and API of key document sources, including SPARQL endpoints;
 +
 +
uniform modelling and access over biodiversity data
 +
    access API tailored to biodiversity data sources;
 +
    dynamically pluggable architecture of transformations from external sources of biodiversity data;
 +
    plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
 +
 +
efficient and scalable storage of files
 +
    unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
 +
    no intrinsic upper bound on file size;
 +
 +
standards-based and structured storage of files
 +
    POSIX-like client API;
 +
    support for hierarchical folder structures;
 +
 +
The components part of the subsystem provide the following main key features:
 +
 +
;Point to Point transfer
 +
:one writer-one reader as core functionality
 +
;Produce only what is requested
 +
:a producer-consumer model that blocks when needed and reduces the unnecessary data transfers
 +
;Intuitive stream and iterator based interface
 +
:simplified usage with reasonable default behavior for common use cases and a variety of features for increased usability and flexibility
 +
;Multiple protocols support
 +
:data transfer currently supports the following protocols: tcp and http
 +
;HTTP Broker Servlet
 +
:transfer results are exposed as an http endpoint
 +
 +
;Reliable data transfer between Infrastructure Data Sources and Data Storages
 +
:by exploiting the uniform access interfaces provided by gCube
 +
 +
;Structured and unstructured Data Transfer
 +
:both Tree based and File based transfer to cover all possible use-cases
 +
 +
;Transfers to local nodes for data staging
 +
:data staging for particular use cases can be enabled on each node of the infrastructure
 +
 +
;Advanced transfer scheduling and transfer optimization
 +
:a dedicated gCube service responsible for  data transfer scheduling and transfer optimization
 +
 +
;Transfer statistics availability
 +
:transfers are logged by the system and made available to interested consumers.
 +
 +
;
 
;workflow-oriented tabular data manipulation  
 
;workflow-oriented tabular data manipulation  
 
:user-defined definition and execution of workflows of data manipulation steps
 
:user-defined definition and execution of workflows of data manipulation steps

Revision as of 12:26, 8 January 2014

Overview

gCube is a software suite equipped with a rich array of services capable to interface with data sources having different characteristics both in terms of data types these sources offers (e.g. from document data, to statistical, biodiversity, and semantic data - see Data Access and Storage Facilities) and the heterogeneity of data belonging to the same type.

The goal of these specifications is to deal with the above heterogeneity and provide unified views over diverse data items through a number of dedicated services. To meet this goal a number of components have been designed.

This page outlines the design rationale and high-level architecture of such components.

Key Features

The components part of the subsystem provide the following main key features:

uniform model and access API over structured data

   dynamically pluggable architecture of model and API transformations to and from internal and external data sources; 
   plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs; 

fine-grained access to structured data

   horizontal and vertical filtering based on pattern matching; 
   URI-based resolution; 
   in-place remote updates; 

scalable access to structured data

   autonomic service replication with infrastructure-wide load balancing; 

efficient and scalable storage of structured data

   based on graph database technology; 

rich tooling for client and plugin development

   high-level Java APIs for service access; 
   DSLs for pattern construction and stream manipulations; 

remote viewing mechanisms over structured data

   “passive” views based on arbitrary access filters; 
   dynamically pluggable architecture of custom view management schemes; 

uniform modelling and access API over document data

   rich descriptions of document content, metadata, annotations, parts, and alternatives 
   transformations from model and API of key document sources, including OAI providers; 
   high-level client APIs for model construction and remote access; 

uniform modelling and access API over semantic data

   tree-views over RDF graph data; 
   transformations from model and API of key document sources, including SPARQL endpoints; 

uniform modelling and access over biodiversity data

   access API tailored to biodiversity data sources; 
   dynamically pluggable architecture of transformations from external sources of biodiversity data; 
   plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life; 

efficient and scalable storage of files

   unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding; 
   no intrinsic upper bound on file size; 

standards-based and structured storage of files

   POSIX-like client API; 
   support for hierarchical folder structures; 

The components part of the subsystem provide the following main key features:

Point to Point transfer
one writer-one reader as core functionality
Produce only what is requested
a producer-consumer model that blocks when needed and reduces the unnecessary data transfers
Intuitive stream and iterator based interface
simplified usage with reasonable default behavior for common use cases and a variety of features for increased usability and flexibility
Multiple protocols support
data transfer currently supports the following protocols: tcp and http
HTTP Broker Servlet
transfer results are exposed as an http endpoint
Reliable data transfer between Infrastructure Data Sources and Data Storages
by exploiting the uniform access interfaces provided by gCube
Structured and unstructured Data Transfer
both Tree based and File based transfer to cover all possible use-cases
Transfers to local nodes for data staging
data staging for particular use cases can be enabled on each node of the infrastructure
Advanced transfer scheduling and transfer optimization
a dedicated gCube service responsible for data transfer scheduling and transfer optimization
Transfer statistics availability
transfers are logged by the system and made available to interested consumers.
workflow-oriented tabular data manipulation
user-defined definition and execution of workflows of data manipulation steps
rich array of data manipulation facilities offered 'as-a-Service'
rich array of data mining facilities offered 'as-a-Service'
rich array of data visualisation facilities offered 'as-a-Service'
reference-data management support
uniform model for reference-data representation including versioning and provenance
data curation and enrichment support
species occurrence data enrichment with environmental data dynamically acquired by data providers
data provenance recording
standard-based data presentation
OGC standard-based Geospatial data presentation

Main Components

Tabular Data
this family of components provides:
Time Series
this family of components provides:
  • Time Series: a service for performing assessment and harmonization on time series.
  • Codelist Manager: a library for performing import, harmonization and curation on code lists.
Biodiversity Data
this family of components provides: