Occurrence Data Reconciliation
A service for performing assessment and harmonization on occurrence points of species. The aim is to provide users with an interface and methods for assessing if occurrence points are repeated, anomalous or for performing some processing and aggregation on such data. This document outlines the design rationale, key features, and high-level architecture, as well as the options deployment.
Overview
The goal of this service is to offer a single entry for processing, assessing and harmonizing occurrence points belonging to species observations. Data can come from the Species Discovery Service or they could be uploaded from a user by means of a web interface.
The service is able to interface to other infrastructural services in order to expand the number of functionalities and applications to the data under analysis.
Design
Philosophy
This represents an endpoint for users who want to process species observation in order to explore their coherence and to extract some hidden properties from collected data coming from difference sources. This is meant as a complement of other services for species and occurrence points analysis.
Architecture
The subsystem comprises the following components:
- Inputs Managers: a set of internal processors which manage the variety of inputs that could come from users or from other services;
- Occurrence Point Processors: a set of internal objects which can invoke external systems in order to process data or extract hidden properties from them. These include Clustering, Anomaly Points Detection etc.;
- Occurrence Points Enrichment: a connector to another d4Science service dealing with the enrichment of occurrence points with associated information about the chemical and physical characteristics of the sea or the earth;
- Occurrence Points Operations: a connector to another d4Science interface which is able to operate on tabular data, by performing visualization, aggregation and transformations.
- Processing Orchestrator: an internal process which manages the interaction and the usage of the other components. It accepts and dispatches requests coming from outside the service.
A diagram of the relationships between these components is reported in the following figure:
Deployment
All the components of the service must be deployed together in a single node. This subsystem can be replicated in multiple hosts and on multiple scopes, this does not guarantee a performance improvement because this is a management system for a single input dataset.
Small deployment
The deployment follows the following schema as it needs the presence of other complementary services.
Use Cases
Well suited Use Cases
The subsystem is particularly suited when experiment have to be performed on occurrence points referring to a certain species or family. The set of operations which can be applied, even lying on state-of-the-art algorithms are studied and developed for managing such kind of information.
Subsystems
Data access and storage components cluster within the following subsystems, where each subsystem specializes along the structure or the semantics of the data: