Tree-Based Access

From Gcube Wiki
Revision as of 22:16, 23 February 2012 by Fabio.simeoni (Talk | contribs) (Overview)

Jump to: navigation, search

A cluster of components within the system focuses on uniform access and storage of structured data of arbitrary semantics, origin, and size. These components form a distinguished subset of the Data Access and Storage subsystem dedicated to data access and storage.

This document outlines their design rationale, key features, and high-level architecture as well as the options for their deployment.

Overview

Access and storage of structured data can be provided under a uniform model of labelled trees and through a remote API of read and write operations over such trees.

A tree-oriented interface is ideally suited to clients that abstract over the domain semantics of the data. As such, it is primarily intended for a wide range of data management processes within the system. A generic interface may also be used by domain-specific clients, if its flexibility and the completeness of the associated tools avoid the limitations of more specific interfaces, particularly those that do not align with standard protocols and data models.

The tree interface is collectively provided by a set of system components in the Data Access and Storage subsystem. Some services expose the tree API, and use a variety of mechanisms to optimise their transfer in read and write operations (pattern languages, URI resolution schemes, in-place updates, and streamed processing). Other services build on the tree API to publish and maintain passive views over the data.

The services have dynamically extensible architectures, i.e. rely on independently developed plugins to adapt their APIs to a variety of back-ends both within and outside the system. They can also be widely replicated within the system and their replicas know how to leverage the Enabling Services to scale horizontally to the capacity of the remote back-ends. The same plugin mechanism yields a storage solution, in that a distinguished plugin can be used to store and access trees locally to individual replicas.

Finally, a rich set of libraries implement the models (trees, patterns, streams) and provide high-level facades to remote service APIs.

Key features

Tree-based access and storage components provide the following key features:

uniform model and access API over structured data
The model is based on edge-labelled and node-attributed trees, and the API is based on a suite of CRUD operations exposed by the Tree Manager service.
fine-grained access to structured data
The read operations can filter and prune trees based on a sophisticated pattern language, and they can resolve whole trees or arbitrary nodes from a URI scheme derived from local node identifiers. The write operations can perform updates in place, applying the tree model to the changes themselves (delta trees). Both read and write operations can work on individual trees as well as on arbitrarily large tree streams.
dynamically pluggable architecture of model and API transformations
The Tree Manager implements the interface against an open-ended number of data sources, from local sources to remote sources, including those that are managed outside its boundaries. The implementation relies on two-way transformations between the tree model and API of the service and those of individual sources. Transformations are implemented in plugins, libraries developed in autonomy from the service so as to extend its capabilities at runtime (hot deployment). Plugins may implement the API partially (e.g. for read-only access) and employ best-effort strategies in adapting individual operations to ad-hoc data source or to an open-ended class of data sources that align with model and API standards.
scalable access to remote data sources
The Tree Manager may be replicated within the system, and their replicas can autonomically balance their state by publishing records of their local activity (activation records) in the Directory Services, and by subscribing with those services for notifications of such publications. The Directory Services then act as infrastructure-wide load balancers for the replicas, pointing clients to least-loaded replicas first.
efficient and scalable storage of structured data
A distinguished plugin of the Tree Manager, the Tree Repository, offers tree storage at the service endpoints, using graph database technology (Neo4j) to avoid model impedance mismatches and to offer full coverage and efficient implementations of the service API.
flexible viewing mechanisms over structured data
The View Manager service uses the tree pattern language to maintain “passive views” of data sources that are accessible through the Tree manager service. Views can be published and maintained under generic regimes, but the service can be extended with plugins for custom management of specific classes of views.
rich tooling for client and plugin development
A rich set of libraries support the developments of Java clients and plugins, offering embedded DSLs for the manipulation of trees, patterns, and streams. They also simplify access to remote Tree Manager and View Manager endpoints, through high-level facades for their remote APIs.

Design

Philosophy

Discovery, indexing, transformation, transfer, presentation are key examples of data management subsystems that abstract in principle over the domain-specific semantics of the data. Other, equally generic system functions are based in turn on those subsystems, most noticeably search over indexing and process execution over data transfer. It is precisely in this generality that lies the main value proposition of the system as an enabler of data e-Infrastructures.

Directly or indirectly, all the processes mentioned above require access to the data. Like in small-scale systems, it is a requirement of good design that they do so against a uniform interface that aligns with their generality and encapsulates them from the variety of network locations, data models, and access APIs that characterise individual data sources.

Providing this interface is essentially an interoperability requirement. For consistency and uniform growth, the requirement is addressed in a dedicated place of the system’s architecture, i.e. for an open-ended number of subsystems rather than within individual subsystems.

This place is of course the Data Access and Storage subsystem and the components described below provide the required interface and the associated tools.

Architecture

Tree-based access and storage are collectively provided by the following components:

  • trees: a library that contains the implementation of the tree model and associated DSL, the tree pattern language and associated DSL, the URI protocol scheme, and tree bindings to and from XML and XML-related technologies;
  • streams: A library that contains the implementation of a DSL for stream conversion, filtering, and publication;
  • tree-manager: a suite of stateful Web Services that expose a tree-oriented API of read and write operations and implement it by delegation to dynamically deployable plugins for target data sources within and outside the system;
  • tree-manager-framework: a framework of local classes and interfaces for third-party plugin development;
  • tree-manager-library: a client library that implements a high-level facade to the remote API of the Tree manager service;
  • tree-repository: a plugin of the Tree Manager service for local tree storage in graph databases (Neo4j);
  • view-manager: a suite of stateful Web Services that use tree patterns to define and maintain passive views over data sources that can be accessed through the Tree Manager service;
  • view-manager-library: a client library that implements a high-level facade to the remote API of the View Manager service;

Deployment

Usually, a subsystem consists of a number of number of components. This section describes the setting governing components deployment, e.g. the hardware components where software components are expected to be deployed. In particular, two deployment scenarios should be discussed, i.e. Large deployment and Small deployment if appropriate. If it not appropriate, one deployment diagram has to be produced.

Large deployment

A deployment diagram suggesting the deployment schema that maximizes scalability should be described here.

Small deployment

A deployment diagram suggesting the "minimal" deployment schema should be described here.

Use Cases

The subsystem has been conceived to support a number of use cases moreover it will be used to serve a number of scenarios. This area will collect these "success stories".

Well suited Use Cases

Describe here scenarios where the subsystem proves to outperform other approaches.

Less well suited Use Cases

Describe here scenarios where the subsystem partially satisfied the expectations.