Legacy applications integration users feedback

From Gcube Wiki
Revision as of 18:09, 16 May 2014 by Julien.barde (Talk | contribs) (Analysis of needs)

Jump to: navigation, search

In this document, we try to give some clarifications about our (IRD + FAO) current approach in iMarine regarding the use of a common WPS server with legacy applications. The following point of view can be considered as the one of data managers working in marine domain laboratories who would like to deploy processes on a WPS server and use the server as an opportunity to make these processes used by additional people with various datasets.

These feedback & guidelines can apply regardless of the server location: in or out iMarine infrastructure. Indeed, beyond the iMarine framework, we think that the issues we are currently facing and the related solutions we are currently building remain interesting for other people sharing the same goal.

Main objective

Our initial goals consists in enabling some IRD processes:

  • 1) to be described so that users can discover existing processes and get the source code,
  • 2) to become more generic in order to be reused by additional applications, people and datasets,
  • 3) to be executed on a common WPS server. Indeed, raw processes, even on a WPS, can be useless for the community when they are not described and / or updated.
  • 4) to be deployed by data managers without having to open & read the source code (by using only metadata: describeProcess.xml).

Analysis of needs

Indeed, more than delivering as much processes as possible, the blocking points for a community aiming at sharing processes consist in setting up generic methods to solve the main challenges which are listed below:

  • challenges for developpers / data managers:
    • to enable the processes of their colleagues to accept various data formats as inputs (including complex data format like WFS & GML). This often requires a modification of the native data input reading function (used to create a R object / R data in our case): we fixed this by adding a data format abstraction function using libraries like gdal/ogr. A similar function can be added to deliver multiple formats for the data output writing.
    • to enable these processes to be executed with various datasets having a compliant data structure even if they come from different data sources (and not only datasets from the source the script was created for). This can be achieved by adding a data structure mapping function (e.g. to manage the semantic of labels given to some columns or codelists as well as data types ..). This enables to manage the mapping between the dataset used as input (a priori unknown) and the data structure expected by the process (known). By doing so we aim to enable, for example, FAO datasets to be used as inputs of IRD processes.
    • keep tracks of each execution of a process by writing the metadata describing the inputs, the process itself and the outputs. To create such metadata, we suggest a function writing RDF metadata (by using Jena in R with rrdf package). Off course, according to the needs it could be interesting to do it differently (in Java with OGC metadata). With RDF, the goal is to enable the outputs of WPS server to be delivered as linked Open Data.
    • to enable the WPS server to be used within other applications. This is facilitated by connecting our data sources (like databases managed with Postgres / Postgis) and related views / subsets to spatial data servers (like Mapserver, Constellation) with proper standards for interoperability (for both format and access protocols). In this case this has to be specific to data sources and related tools.
    • use WPS process for pre-calculation of large sets of indicators (to generate the content of applications like Atlas, fact sheets...)
  • challenges for users:
    • find and describe a set of relevant metadata elements required to understand how to use properly each process,
    • once on the WPS server, it's not obvious that users (colleagues and partners) will be able to execute their processes. As previously said, to facilitate the use of each process by everybody, there is a need to enable various data formats and access protocols to be used in the data inputs. Once previous data format abstraction and data structure mapping functions are in place, WPS metadata have to be written accordingly.
    • being able to use the WPS server through friendly clients like Terradue Web Client (http://wps01.i-marine.d4science.org/client.html), GIS desktop applications...indeed most of the users won't be able to write a proper WPS URL by themselves.

Some arguments regarding the last point (dealing with the skills of users in marine ecology lab), we think we will better reach our goal if we solve the following blocking points:

    • data formats: enabling the users to provide the input dataset of the process in friendly data format is crucial (being able to work "as usual" is crucial for some of them): csv, excel, shape files...if not some people won't use it. As it's been said already this is possible by using a data abstraction library like ogr (through Rgdal in R).
    • access protocols: in addition to data formats, if we want the WPS server to be used with these formats, we think the user should be able to provide the dataset by:
      • uploading its dataset with any kind of usual data format (cf previous item). Upload is simple, everybody understands. We want to indicate that upload is possible by indicating a specific data input parameter in OGC WPS metadata. Upload can use basic http/ftp protocols which is a good option for everybody because 1) it's simple and 2) not all data sources are accessible with sophisticated acces protocols described below.
      • WFS is just an access protocol which is required for a SDI (like WCS, OPeNDAP...), but, for now, ignored by most of the colleagues in our domain who won't be able to deliver the dataset this way. Moreover not all research units have spatial data servers (Mapserver, Geoserver...). Even if some of them have such tools, not all implementations of WFS enable the "outputformat" option to be used with data formats like CSV, shp....Indeed It seams that the only official format for WFS is GML (others are used as extension of the specification, see http://mapserver.org/fr/ogc/wfs_server.html) and, anyway, none of our colleagues are working with WFS access protocol whatever the underlying data format,
       ** data structures: most of users are interested in the processes set up by their colleagues. However processes are usually created to fit specific needs (a specific data structure coming from a specific data source) and with a specific programming language (R, Matlab, Java, IDL..). Codes need to be enriched to become more generic (being on a WPS server is not enough). Most of the users only know a single programming language and won't be able to modify the code of the original process. This can be fixed by embeding a function enabling the mapping between data structures in the process. Done in R for iMarine.
       ** metadata: we want to use the metadata to be able to re-execute an indicator ("replicable science") with the same input dataset (by adding new criteria to filter the input dataset) and the outputs to be used by application like x-Search or Fact sheet generator.

Regarding the use of WFS: there is no discussion about the interest of WFS for data interoperability, it's an obvious need for an infrastructure like iMarine. We want to manage both approaches:

   *WFS for machines or people who knows about it
   *usual protocols (http or upload) for the others (and the majority)

To summarize, according to our discussions, out of technical aspects, the current challenge becomes to get a method facilitating the deployment of some existing processes (using examples of IRD or FAO) and their use by a wider community (machines and people) by delivering a set of R functions facilitating their use by a WPS server:

   *abstraction of data format and access protocols
   *abstraction of the data structure (mapping function)
   *writing of metadata

We are convinced of the interest of having a WPS server in an infractucture like iMarine but the worst scenario would be to spend lot of time collecting and deploying the processes of our partners and to tell them that they can only use them through WFS and GML format but not in the way they are using them usually.