Tabular Data Service

From Gcube Wiki
Revision as of 15:11, 12 December 2013 by Leonardo.candela (Talk | contribs)

Jump to: navigation, search

Overview

Tabular Data is a system that allows to manage the lifecycle of statistical data. It's main characteristics are:

  • Exposes its operation to the end user through a gCube Portlet;
  • Its operations can be invoked programmatically through a gCube Web Service (smart gears);
  • Provides several operations for importing/exporting from/to serveral sources/sinks;
  • Provides data validations and transformations capabilities;
  • Provides means for the management of data management processes automation.

Architecture

Tabular Data system architecture

Tabular data system is made of three main subsystems:

  • Tabular Data Portlet: human centric web application hosted on a gCube web portal. Allows management of the Tabular Data Service system on a per-user basis, allowing invocation of tabular data service methods and additional functionalities with external apps.
  • Tabular Data Service: main component of the tabular data architecture. It exposes several remote interfaces covering different areas of functionalities.
  • Data/Metadata backend: This is where raw tables data and metadata are stored and where the service keeps its management data.

The diagram shows also some of the main components that build up the service, which mainly relates to the operation management area:

  • Operation orchestrator: The operation orchestrator is a component that receives incoming call requests from the service interface and unwinds them into a sequence of operation call. The orchestrator may enforce policies and command automatic operations/validations according to its configuration.
  • Operation modules: Operation modules are classes that brings functionalities to the tabular data service. These functionalities can be reached directly with invocations on the Operation Interface. Operation modules can work directly with data on the Data backend or leverage the cube manager in order to create/clone tables or modify table metadata.
  • Cube manager: The cube manager is the lowest level component of the service stack. Its main responsibilites are managing the creation/modification of tables (and their metadata) and acting as a registry for all the created tables, allowing retrieval of tables metadata.

Model

Tables in the Tabular Data system are entities made of two separate elements:

  • Raw data: This can be imagined as the data contained in user provided CSV files
  • Metadata
    • Data structure: this metadata describes how data is structured (e.g.: columns number or column data type) and how raw data can be reached
    • Enriching Metadata: This metadata adds information on top of raw data and provides some context or additional information on top of it.

Raw data is managed directly by leveraging relational database services (PostgreSQL with Postgis extension). Metadata is managed and represented through a metadata model library called tabular-model. Tabular Model provides

  • a description for tables entities covering the minimum table structure description requrements
  • elements that helps in enriching tables with additional metadata (column labels, descriptions, table version, etc.)

Tabular Model is GWT friendly, which means that it can be used in GWT Web application on client side, since it's java beans are translatable into javascript code.

Expressions

Tabular-model provides a simple model for describing conditions on table data. Conditions can be expressed as a set of minimal constructs chained toghether with logical connectors. The data maodel used to compose expression relies on the composite design pattern which allows to build tree of expressions. Logical connectors are OR and AND and can take any number of child expression. Leaf expression are expression that describe a particular conditions. Each Leaf Expression can take an arbitrary set of parameters in order to be defined.

Starting from release 3.0.0 of tabular model a set of leaf expressions have been defined:

  • IsNull, Value is null;
  • ValueIsIn: Value contained in another column of another table;
  • Conditions on comparable values: Equals, Greater than, Lesser than, Not Equals;
  • Conditions on text values: contains text, text equals, text mathes SQL regexp.

An expression can be evaluated by an Evaluator, which is simply an object that, by processing a given expression, performs some action and returns a custom result. Two evaluators have been provided:

A client can obtain an instance of a Evaluator using its related EvaluatorFactory. Usage of expressions and evaluators is explained in the following code snippet taken from a test case of the evaluator-description project:

	DescriptionExpressionEvaluatorFactory evaluatorFactory = getDescriptionEvaluatorFactory();
 
	ColumnReference targetColumnReference = createColumnReference();
	ColumnReference anotherTargetColumnReference = createColumnReference();
 
	// Other expression
	Expression isNull = new IsNull(targetColumnReference);
	Expression columnIsIn = new ValueIsIn(targetColumnReference, anotherTargetColumnReference);
 
	// Text expression
	Expression textContains = new TextContains(createColumnReference(), new TDText("test"));
	Expression textEquals = new TextEquals(createColumnReference(), new TDText("test"));
	Expression textMatchRegexp = new TextMatchSQLRegexp(targetColumnReference, new TDText("[a-b]*"));
 
	// Comparable
	Expression equals = new Equals(targetColumnReference, new TDDate(new Date()));
	Expression greaterThan = new Equals(targetColumnReference, new TDInteger(5));
	Expression lessThan = new LessThan(targetColumnReference, new TDNumeric(5.1f));
	Expression notEquals = new NotEquals(targetColumnReference, new TDBoolean(false));
 
	// Composite
	Expression and = new And(textContains, isNull, columnIsIn, textContains);
	Expression or = new Or(and, textEquals, textMatchRegexp, equals, greaterThan, lessThan, notEquals);
 
	String description = evaluatorFactory.getEvaluator(or).evaluate();

LeafExpression are usually parameterized in terms of column references or typed values. Column references are objects that allow to reference a column of a tabular data table. Typed values must comply with allowed data types for table columns, therefore boxed custom types are used for the representation of those values.

Service

Operations

Operation modules are a group of Java classes that provide, each one, a single functionality to the system. Functionalities provided by operation modules may fall under one of these categories:

  • Import
  • Export
  • Transformation
  • Validation

Each Operation takes an input, which is a set of parameters. These parameters may include a tabular data table or a column of a tabular data table or none of them (like in the import case). Along with additional parameters, each operation must belong to one of these categories:

  • Void scoped: does not require a table to compute
  • Table scoped: requires a target table to compute
  • Column scope: requires a target table column to compute

Each operation produce, as a result of its computation, a table and zero or more collateral tables. The create tables are always a new table probably created by first cloning the input table, if any is provided.

Operation modules leverages cube manager capabilities in order to create new tables, clone existing ones or modify the structure or additional metadata of tables. Operation modules can work with raw data directly on the DB, therefore data experts can rely on their SQL knowledge.


Orchestrator

Cube Manager

Main Cube Manager link