Legacy applications integration

Context

The Geospatial Cluster goal is to:

Data discovery of internal/external geospatial data repositories
Data access to discovered data
Data processing of discovered/accessed data
Data visualization of discovered/accessed/processed data

This page focuses on the geospatial data processing of discovered/accessed data with Legacy Applications.

Objectives

The main objectives for the geospatial data processing are

Define enrichment needs of bio-ecological or activity occurrences with environmental data: OBIS Ocean Physics, VTI, VME
Designing and planning implementation for enrichment capacity

Advanced geospatial analytical and modelling features - e.g. R geospatial, reallocation, aggregation

Defining advanced geospatial processes required in reallocation, aggregation, interpolation
Designing and planning implementation for geospatial processes capacity

What are Legacy Applications

Legacy Applications are existing software applications written in third party languages such as R, IDL, MatLab, Python. Legacy Applications can not be re-written in Java as:

legacy applications come a very specific knowledge domain hard to transfer to coders
Time and resource consuming
Converted applications would have poor maintainability
Time-to-market too long
Limitations on the number of applications supported

Examples of legacy applications are those written in R, IDL and MatLab. These are common software packages used for science applications development.

The computing resources and interfaces for Legacy Applications

The OGC Web Processing Service allows exposing processing services over geospatial data 52North has implemented a WPS Java framework where processing algorithms (e.g. spatial resampling, temporal aggregation, etc.) are implemented as WPS processes that can invoked by clients. This implementation does not provide underlying computing resources besides the server hosting the WPS implementation. The scalability is not ensured and QoS/SLA cannot be guaranteed.

The Hadoop Map/Reduce model is used to provide the processing resources where:

Processes can be map/reduce pure implementations using Java libraries packed at runtime and deployed by Hadoop Map/Reduce. This approach is not applicable to Legacy Applications
Processes can be third party or other languages (bash, python, etc.) using Map/Reduce Streaming (pipes)

Coupling both allows exposing geospatial processing services using the OGC WPS interface and exploit scalable processing resources.

Legacy Applications thus exploit the Hadoop Map/Reduce streaming, a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.

Types of input parameters in WPS-Hadoop

In order

The structure we propose is as depicted below (for a processing step named "align", provided as an example)

The application directory follows a set of best practices

for its folders and files structure
for its descriptive metadata

so to ease the subsequent deployment of the application to the WPS-hadoop environment.

The application.xml file has two main blocks:

the job template section
and the workflow template section.

The first part is to define the job templates in the workflow XML application definition file. The second would not be used and it is just there to pave the road (if needed) to support workflows with Oozie)

Our unique processing block of the workflow needs a job template.

A proposed example contains the XML lines below:

<jobTemplate id="align">
	<streamingExecutable>/application/align/run</streamingExecutable> <!-- processing trigger -->		
	<defaultParameters> <!-- default parameters of the job -->
		<!-- Default values are specified here, for testing purposes only! -->
		<parameter id="param1">2</parameter>	<!-- no default value -->
		<parameter id="param2">4</parameter>
	</defaultParameters>
	<defaultJobconf>
		<property id="app.job.max.tasks">1</property>	<!-- Maximum number of parallel tasks -->
		</defaultJobconf>
</jobTemplate>

We could provide tools to test a job on the local workstation.

Once done, this is packaged as a jar file, and stored in a repository accessible from the WPS-hadoop server. When a processing request is triggered, WPS-hadoop deploys, via the hadoop streaming, that jar file, and the legacy application is invoked.

Hadoop clusters serving legacy applications only need to have R (and/or IDL, MatLab, Octave, etc.) installed.

The procedure is applied and validated for the iMarine partners through:

SimpleTestNono application (IRD, Norbert Billet)
...

Legacy applications integration

Contents

Context

Objectives

What are Legacy Applications

The computing resources and interfaces for Legacy Applications

Types of input parameters in WPS-Hadoop

Navigation menu

Views

Personal tools

gCube Wiki

gCube features

gCube documentation

Integration and Distribution

Search

Tools