Data Access and Storage Facilities
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system.
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
This document outlines the rationale and high-level architecture of such components.
Overview
Collectively, data access components provide three key facilities:
- the ability to store data in resources managed by the system;
- the ability to access data that is stored in resources managed by the system;
- the ability to access data that is stored in resources managed externally to the system;
The facilities are provided over data with heterogeneous structure, size, and semantics:
- from unstructured data to structured data and semi-structured data;
- from small data sets to large and very large data sets;
- from document data, to statistical, biodiversity, and semantic data;
and in compliance with the following non-functional requirements:
- the requirement of secure access;
- the requirement of scalable and efficient access;
- the requirement of standards-based access;
In summary, the data access components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.
Key Features
- uniform model and access API over structured data
- dynamically pluggable architecture of transformations to and from internal and external data sources;
- standards-based plugins for document, biodiversity, statistical, and semantic data sources;
- fine-grained access to structured data
- horizontal and vertical filtering based on pattern matching;
- URI-based resolution;
- in-place remote updates;
- scalable access to structured data
- autonomic service replication with infrastructure-wide load balancing;
- efficient and scalable storage of structured data
- based on graph database technology;
- rich tooling for client and plugin development
- high-level Java APIs for service access;
- DSLs for pattern construction and stream manipulations;
- remote viewing mechanisms over structured data
- “passive” views based on arbitrary access filters;
- dynamically pluggable architecture of custom view management schemes;
- uniform modelling and access over document data
- rich descriptions of document content, metadata, annotations, parts, and alternatives
- high-level client APIs for model construction and remote access;
- uniform modelling and access over biodiversity data
- dynamically pluggable architecture of transformations from internal and external data sources of biodiversity data (including species data and occurrence points);
- standards-based plugins for biodiversity data sources (including OBIS, GBIF and Catalogue of Life);
- perform search over different repositories of biodiversity
- domain specific query language tailored to biodiversity data sources;
- unified and integrated view over heterogeneous and complementary data sources;
- TODO features for file access and storage
- Roberto please add with respect to File Storage API
Subsystems
Data access components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:
- the Tree-Based Access subsystem
- groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
- The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
- the Document Access subsystem
- groups components that implement access and storage facilities for structured data with document semantics and arbitrary origin and size;
- The subsystem builds on the Tree-Based Access subsystem to provide uniform access to document data for domain-specific processes.
- the Biodiversity Access subsystem
- groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
- The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
- the File-Based Access subsystem
- groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
- The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.