Data Mining Facilities

From Gcube Wiki
Jump to: navigation, search

Overview

Data Mining facilities include a set of features, services and methods for performing data processing and mining on information sets. These features face several aspects of data processing ranging from modeling to clustering, from identification of anomalies to detection of hidden series. This set of services and libraries is used by the D4Science e-infrastructure to manage data mining problems even from a computational complexity point of view. Algorithms are executed in parallel and possibly distributed fashion, using the same D4Science nodes as working nodes. Furthermore, Services performing Data Mining operations are deployed according to a distributed architecture, in order to balance the load of those procedures requiring local resources.

By means of the above features, Data Mining aims to manage problems like (i) the prediction of the impact of climate changes on biodiversity, (ii) the prevention of the spread of invasive species, (iii) the identification of geographical and ecological aspects of disease transmission, (iv) the conservation planning, (v) the prediction of suitable habitats for marine species. By using the computational facilities of the D4Science e-Infrastructure, algorithms can run in a cost-effective way letting scientists perform more experiments and combine different techniques.

Key Features

The components part of the subsystem provide the following main key features:

parallel processing
parallelization of statistical algorithms using a map-reduce approach
cloud computing approach in a seamless way to the users
pre-cooked state-of-the-art data mining algorithms
algorithms oriented to biological-related problems supplied as-a-service
general purpose algorithms (e.g. Clustering, Principal Component Analysis, Artificial Neural Networks) supplied as-a-service
data trends generation and analysis
extraction of trends for biodiversity data
inspection of time series of observations on biological species
basic signal processing techniques to explore periodicities in trends
ecological niche modelling
algorithms to perform ecological niche modelling using either mechanistic or correlative approaches
species distribution maps generation

Specifications

DataMiner
a Service allowing the management of statistical data and multi-user requests for computation
DataMiner Algorithms
the complete list of algorithms supported by the DataMiner
How-to Implement Algorithms for DataMiner
How to implement algorithms for DataMiner
Statistical Algorithms Importer
a tool to import processes on DataMiner
DataMiner Installation
Installation guide for DataMiner
How to Interact with the DataMiner by client
Interacting with DataMiner from a thin client
Ecological Modeling
a set of methods for performing Data Mining operations. These include experiments and techniques categorization
Signal Processing
a set of methods to perform digital signal processing.
Statistical Manager
the previous gCube system for Cloud computing
DataMiner Pool Manager
Automatic installer system of algorithms

Current usage statistics

Data extracted using the gCube accounting system.

Overall number of process requests per month
~64 300
Most used process
BiOnym (~30,000 requests per month)
Overall number of users
~2000
Availability
99.7%


Last Update: Apr. 2017