Data Access and Storage Facilities
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system.
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
This document outlines the design rationale and high-level architecture of such components.
Overview
Collectively, data access and storage components provide three key facilities:
- the ability to store data in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources outside the management regime of the system;
The facilities are provided over data with heterogeneous structure, size, and semantics:
- from unstructured data to structured data and semi-structured data;
- from small data sets to large and very large data sets;
- from document data, to statistical, biodiversity, and semantic data;
and in compliance with the following non-functional requirements:
- the requirement of secure access;
- the requirement of scalable and efficient access;
- the requirement of standards-based access;
In summary, the data access and storage components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.
Key Features
- uniform model and access API over structured data
- dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
- plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
- fine-grained access to structured data
- horizontal and vertical filtering based on pattern matching;
- URI-based resolution;
- in-place remote updates;
- scalable access to structured data
- autonomic service replication with infrastructure-wide load balancing;
- efficient and scalable storage of structured data
- based on graph database technology;
- rich tooling for client and plugin development
- high-level Java APIs for service access;
- DSLs for pattern construction and stream manipulations;
- remote viewing mechanisms over structured data
- “passive” views based on arbitrary access filters;
- dynamically pluggable architecture of custom view management schemes;
- uniform modelling and access API over document data
- rich descriptions of document content, metadata, annotations, parts, and alternatives
- transformations from model and API of key document sources, including OAI providers;
- high-level client APIs for model construction and remote access;
- uniform modelling and access API over semantic data
- tree-views over RDF graph data;
- transformations from model and API of key document sources, including SPARQL endpoints;
- uniform modelling and access over biodiversity data
- access API tailored to biodiversity data sources;
- dynamically pluggable architecture of transformations from external sources of biodiversity data;
- plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
- efficient and scalable storage of files
- unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
- no intrinsic upper bound on file size;
- standards-based and structured storage of files
- POSIX-like client API;
- support for hierarchical folder structures;
Subsystems
Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:
- the Tree-Based Access subsystem
- groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
- The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
- A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
- the Biodiversity Access subsystem
- groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
- The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
- the File-Based Access subsystem
- groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
- The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.