Spatial Data Discovery and Access

From Gcube Wiki
Jump to: navigation, search

Discovering geo-spatial information about some points of the ocean can be fundamental for performing data analysis on species behaviour and preferences. Furthermore it can be useful to scientists belonging to different communities who want to share their data or to have a complete vision of the environmental setup of some zones. Geospatial Data Discovery includes functionalities for retrieving environmental information associated to some points or zones, in terms of physical and chemical properties.

This document outlines the design rationale and high-level architecture of such components.

Overview

Geospatial Data Discovery provides the following facilities:

  • the ability to retrieve information which is yet stored and available in the infrastructure;
  • the ability to retrieve information which is stored on remote sites, collecting geospatial data;
  • the ability to generate information for some points by using kriging when necessary.

The entire system is then based on the following kind of information datasets:

  • Stored datasets: some physical or chemical features layer, containing information at a certain resolution, for a certain time period;
  • Remote datasets: some physical or chemical features layer, stored on a remote site, to be discovered by means of extraction techniques. They will refer to certain time and resolution;
  • Potential datasets: some physical or chemical features which are not stored for some points, but that can be produced by means of geospatial functions like kriging or resampling

In summary, the Geospatial Data Discovery system provide distributed, researchable or to be generated, environmental features associated to geographical points or areas.

Key Features

uniform model and access API over structured data
dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
fine-grained access to structured data
horizontal and vertical filtering based on pattern matching;
URI-based resolution;
in-place remote updates;
scalable access to structured data
autonomic service replication with infrastructure-wide load balancing;
efficient and scalable storage of structured data
based on graph database technology;
rich tooling for client and plugin development
high-level Java APIs for service access;
DSLs for pattern construction and stream manipulations;
remote viewing mechanisms over structured data
“passive” views based on arbitrary access filters;
dynamically pluggable architecture of custom view management schemes;
uniform modelling and access API over document data
rich descriptions of document content, metadata, annotations, parts, and alternatives
transformations from model and API of key document sources, including OAI providers;
high-level client APIs for model construction and remote access;
uniform modelling and access API over semantic data
tree-views over RDF graph data;
transformations from model and API of key document sources, including SPARQL endpoints;
uniform modelling and access over biodiversity data
access API tailored to biodiversity data sources;
dynamically pluggable architecture of transformations from external sources of biodiversity data;
plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
efficient and scalable storage of files
unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
no intrinsic upper bound on file size;
standards-based and structured storage of files
POSIX-like client API;
support for hierarchical folder structures;

Subsystems

Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:

the Tree-Based Access subsystem
groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
the Biodiversity Access subsystem
groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
the File-Based Access subsystem
groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.


Subsystems