Difference between revisions of "Tree-Based Access"

From Gcube Wiki
Jump to: navigation, search
(Overview)
(Deployment)
Line 77: Line 77:
  
 
== Deployment ==
 
== Deployment ==
Usually, a subsystem consists of a number of number of components. This section describes the setting governing components deployment, e.g. the hardware components where software components are expected to be deployed. In particular, two deployment scenarios should be discussed, i.e. Large deployment and Small deployment if appropriate. If it not appropriate, one deployment diagram has to be produced.  
+
 
 +
All the components of the subsystem can be deployed on multiple hosts, and deployment can be dynamic at each host. Specifically:
 +
 
 +
* services can have multiple endpoints;
 +
* an endpoint may be co-deployed with multiple plugins;
 +
* the same plugin can be co-deployed with multiple endpoints;
 +
 
 +
There are no temporal constraints on the co-deployment of services and plugins. A plugin may be deployed on any given before or, more meaningfully, after the service.
 +
 
 +
The only absolute constraint on deployment is that the Tree Manager services be co-deployed. The same holds for the View Manager services.
 +
 
 +
Since the services are stateful, a single endpoint may generate and maintain a number of service instances:
 +
 
 +
* service instances of the Tree Manager services model accessible data sources. There are separate instances for read access and/or write access to sources;
 +
 
 +
* service instances of the View Manager services model view definitions over readable data source, hence over read-only instances of the Tree Manager services;
 +
 
 +
Service deployment schemes that seek to maximise capacity will take into account the usual implications of stateful services, whereby the capacity of a service instance:
 +
 
 +
* decreases as the number of instances increases;
 +
* decreases as the load of co-hosted instances and endpoints increases;
 +
 
 +
Endpoints of the Tree Manager services deserve their own considerations however:
 +
 
 +
* all service instances may execute streamed operations that rely on memory buffers for increased performance. Based on frequency and the average stream size, memory requirements may be higher than average;
 +
* all service instances manage subscriptions and emit notifications which add to the workload generated by requests for data access and storage;
 +
* service instances that target remote data sources (''mediator instances'') will hold to system resources for unpredictable length on average, based on the throughput of the target sources. Significant variability is expected across sources. 
 +
* service instances that store data locally (''anchored instances'') will require local storage accordingly. Furthermore, the performance of the graph database technology used by such instances is proportional to the memory that can be allocated to it.
 +
 
 +
Overall, services instances are expected to consume more resources than average (anchored instances), or to retain resources for longer than average (mediator instances). As such, they are expected to reduce more than averagely the capacity of other instances and co-deployed services, and to be similarly affected by them.  
 +
 
 +
A deployment scheme that maximises capacity will tend to create less instances at given endpoints and will reduce co-deployment of other services. Furthermore:
 +
 
 +
* to increase performance of anchored instances, the scheme will particularly seek isolation for high-demand instances that store datasets locally.
 +
* to increased data scale for anchored instances, the scheme may consider data sharding across a cluster, though deployment becomes static.
 +
* to increase capacity for mediator instances, as well as increase resource sharing, the scheme may replicate endpoints and rely on the state balancing mechanisms built in the service. The upper bound in this case remains the capacity of the remote data sources.
  
 
=== Large deployment ===
 
=== Large deployment ===

Revision as of 00:04, 25 February 2012

A cluster of components within the system focuses on uniform access and storage of structured data of arbitrary semantics, origin, and size. These components form a distinguished subset of the Data Access and Storage subsystem dedicated to data access and storage.

This document outlines their design rationale, key features, and high-level architecture as well as the options for their deployment.

Overview

Access and storage of structured data can be provided under a uniform model of labelled trees and through a remote API of read and write operations over such trees.

A tree-oriented interface is ideally suited to clients that abstract over the domain semantics of the data. As such, it is primarily intended for a wide range of data management processes within the system. A generic interface may also be used by domain-specific clients, if its flexibility and the completeness of the associated tools avoid the limitations of more specific interfaces, particularly those that do not align with standard protocols and data models.

The tree interface is collectively provided by a set of system components in the Data Access and Storage subsystem. Some services expose the tree API, and use a variety of mechanisms to optimise their transfer in read and write operations (pattern languages, URI resolution schemes, in-place updates, and streamed processing). Other services build on the tree API to publish and maintain passive views over the data.

The services have dynamically extensible architectures, i.e. rely on independently developed plugins to adapt their APIs to a variety of back-ends both within and outside the system. They can also be widely replicated within the system and their replicas know how to leverage the Enabling Services to scale horizontally to the capacity of the remote back-ends. The same plugin mechanism yields a storage solution, in that a distinguished plugin can be used to store and access trees locally to individual replicas.

Finally, a rich set of libraries implement the models (trees, patterns, streams) and provide high-level facades to remote service APIs.

Key features

Tree-based access and storage components provide the following key features:

uniform model and access API over structured data
The model is based on edge-labelled and node-attributed trees, and the API is based on a suite of CRUD operations exposed by the Tree Manager service.
fine-grained access to structured data
The read operations can filter and prune trees based on a sophisticated pattern language, and they can resolve whole trees or arbitrary nodes from a URI scheme derived from local node identifiers. The write operations can perform updates in place, applying the tree model to the changes themselves (delta trees). Both read and write operations can work on individual trees as well as on arbitrarily large tree streams.
dynamically pluggable architecture of model and API transformations
The Tree Manager implements the interface against an open-ended number of data sources, from local sources to remote sources, including those that are managed outside its boundaries. The implementation relies on two-way transformations between the tree model and API of the service and those of individual sources. Transformations are implemented in plugins, libraries developed in autonomy from the service so as to extend its capabilities at runtime (hot deployment). Plugins may implement the API partially (e.g. for read-only access) and employ best-effort strategies in adapting individual operations to ad-hoc data source or to an open-ended class of data sources that align with model and API standards.
scalable access to remote data sources
The Tree Manager may be replicated within the system, and their replicas can autonomically balance their state by publishing records of their local activity (activation records) in the Directory Services, and by subscribing with those services for notifications of such publications. The Directory Services then act as infrastructure-wide load balancers for the replicas, pointing clients to least-loaded replicas first.
efficient and scalable storage of structured data
A distinguished plugin of the Tree Manager, the Tree Repository, offers tree storage at the service endpoints, using graph database technology (Neo4j) to avoid model impedance mismatches and to offer full coverage and efficient implementations of the service API.
flexible viewing mechanisms over structured data
The View Manager service uses the tree pattern language to maintain “passive views” of data sources that are accessible through the Tree manager service. Views can be published and maintained under generic regimes, but the service can be extended with plugins for custom management of specific classes of views.
rich tooling for client and plugin development
A rich set of libraries support the developments of Java clients and plugins, offering embedded DSLs for the manipulation of trees, patterns, and streams. They also simplify access to remote Tree Manager and View Manager endpoints, through high-level facades for their remote APIs.

Design

Philosophy

Discovery, indexing, transformation, transfer, presentation are key examples of data management subsystems that abstract in principle over the domain-specific semantics of the data. Other, equally generic system functions are based in turn on those subsystems, most noticeably search over indexing and process execution over data transfer. It is precisely in this generality that lies the main value proposition of the system as an enabler of data e-Infrastructures.

Directly or indirectly, all the processes mentioned above require access to the data. Like in small-scale systems, it is a requirement of good design that they do so against a uniform interface that aligns with their generality and encapsulates them from the variety of network locations, data models, and access APIs that characterise individual data sources.

Providing this interface is essentially an interoperability requirement. For consistency and uniform growth, the requirement is addressed in a dedicated place of the system’s architecture, i.e. for an open-ended number of subsystems rather than within individual subsystems.

This place is of course the Data Access and Storage subsystem and the components described below provide the required interface and the associated tools.

Architecture

Tree-based access and storage are collectively provided by the following components:

  • trees: a library that contains the implementation of the tree model and associated DSL, the tree pattern language and associated DSL, the URI protocol scheme, and tree bindings to and from XML and XML-related technologies;
  • streams: A library that contains the implementation of a DSL for stream conversion, filtering, and publication;
  • tree-manager: a suite of stateful Web Services that expose a tree-oriented API of read and write operations and implement it by delegation to dynamically deployable plugins for target data sources within and outside the system;
  • tree-manager-framework: a framework of local classes and interfaces for third-party plugin development;
  • tree-manager-library: a client library that implements a high-level facade to the remote API of the Tree manager service;
  • tree-repository: a plugin of the Tree Manager service for local tree storage in graph databases (Neo4j);
  • view-manager: a suite of stateful Web Services that use tree patterns to define and maintain passive views over data sources that can be accessed through the Tree Manager service;
  • view-manager-library: a client library that implements a high-level facade to the remote API of the View Manager service;

Deployment

All the components of the subsystem can be deployed on multiple hosts, and deployment can be dynamic at each host. Specifically:

  • services can have multiple endpoints;
  • an endpoint may be co-deployed with multiple plugins;
  • the same plugin can be co-deployed with multiple endpoints;

There are no temporal constraints on the co-deployment of services and plugins. A plugin may be deployed on any given before or, more meaningfully, after the service.

The only absolute constraint on deployment is that the Tree Manager services be co-deployed. The same holds for the View Manager services.

Since the services are stateful, a single endpoint may generate and maintain a number of service instances:

  • service instances of the Tree Manager services model accessible data sources. There are separate instances for read access and/or write access to sources;
  • service instances of the View Manager services model view definitions over readable data source, hence over read-only instances of the Tree Manager services;

Service deployment schemes that seek to maximise capacity will take into account the usual implications of stateful services, whereby the capacity of a service instance:

  • decreases as the number of instances increases;
  • decreases as the load of co-hosted instances and endpoints increases;

Endpoints of the Tree Manager services deserve their own considerations however:

  • all service instances may execute streamed operations that rely on memory buffers for increased performance. Based on frequency and the average stream size, memory requirements may be higher than average;
  • all service instances manage subscriptions and emit notifications which add to the workload generated by requests for data access and storage;
  • service instances that target remote data sources (mediator instances) will hold to system resources for unpredictable length on average, based on the throughput of the target sources. Significant variability is expected across sources.
  • service instances that store data locally (anchored instances) will require local storage accordingly. Furthermore, the performance of the graph database technology used by such instances is proportional to the memory that can be allocated to it.

Overall, services instances are expected to consume more resources than average (anchored instances), or to retain resources for longer than average (mediator instances). As such, they are expected to reduce more than averagely the capacity of other instances and co-deployed services, and to be similarly affected by them.

A deployment scheme that maximises capacity will tend to create less instances at given endpoints and will reduce co-deployment of other services. Furthermore:

  • to increase performance of anchored instances, the scheme will particularly seek isolation for high-demand instances that store datasets locally.
  • to increased data scale for anchored instances, the scheme may consider data sharding across a cluster, though deployment becomes static.
  • to increase capacity for mediator instances, as well as increase resource sharing, the scheme may replicate endpoints and rely on the state balancing mechanisms built in the service. The upper bound in this case remains the capacity of the remote data sources.

Large deployment

A deployment diagram suggesting the deployment schema that maximizes scalability should be described here.

Small deployment

A deployment diagram suggesting the "minimal" deployment schema should be described here.

Use Cases

The subsystem has been conceived to support a number of use cases moreover it will be used to serve a number of scenarios. This area will collect these "success stories".

Well suited Use Cases

Describe here scenarios where the subsystem proves to outperform other approaches.

Less well suited Use Cases

Describe here scenarios where the subsystem partially satisfied the expectations.