Legacy applications integration users feedback

From Gcube Wiki
Revision as of 19:00, 16 May 2014 by Julien.barde (Talk | contribs) (Example of R functions)

Jump to: navigation, search

In this document, we try to give some clarifications about our (IRD + FAO) current approach in iMarine regarding the use of a common WPS server with legacy applications. The following point of view can be considered as the one of data managers working in marine domain laboratories who would like to deploy processes on a WPS server and use the server as an opportunity to make these processes used by additional people with various datasets.

These feedback & guidelines can apply regardless of the server location: in or out iMarine infrastructure. Indeed, beyond the iMarine framework, we think that the issues we are currently facing and the related solutions we are currently building remain interesting for other people sharing the same goal.

Main objective

Our initial goals consists in using the iMarine WPS server to enable some of IRD processes:

  • 1) to be described so that users can discover existing processes and get the source code,
  • 2) to become more generic in order to be reused by additional applications, people and datasets,
  • 3) to be executed on any common WPS server by suggesting some good practices to the scientists. Indeed, raw processes, even on a WPS, can be useless for the community when they are not described and / or enriched .
  • 4) to be deployed by data managers without having to open & read the source code (by using only metadata: describeProcess.xml).

Analysis of needs

Here is a summmary of the main needs:

  • setting up a catalog of processes (through proper metadata)
  • enabling a search engine to discover processes
  • enabling each process to become more generic than the original version created for a given data source and related datasets
  • enabling each process to be executed online, through a WPS server, without having to deal with the underlying langage.


Challenges

Indeed, more than delivering as much processes as possible, the blocking points for a community aiming at sharing processes consist in setting up generic methods to solve the main challenges which are listed below:

  • challenges for developpers / data managers:
    • to enable the processes of their colleagues to accept various data formats as inputs (including complex data format like WFS & GML even if very few scientists are using them). This often requires a modification of the native data input reading function (used to create a R object / R data in our case): we fixed this by adding a data format abstraction function using libraries like gdal/ogr. A similar function can be added to deliver multiple formats for the data output writing.
    • to enable these processes to be executed with any dataset having a data structure complying with the one expected by the process on the WPS server even if the datasets come from different data sources with a different semantic(and not only datasets from the source the script was created for). This can be achieved by adding a data structure mapping function (e.g. to manage the semantic of labels given to some columns or codelists as well as data types ..). This enables to manage the mapping between the dataset used as input (a priori unknown) and the data structure expected by the process (known). By doing so we aim to enable, for example, FAO datasets to be used as inputs of IRD processes.
    • lineage: keep tracks of each execution of a process by writing the metadata describing the inputs, the process itself and the outputs. To create such metadata, we suggest a function writing RDF metadata (by using Jena in R with rrdf package). Off course, according to the needs it could be interesting to do it differently (in Java with OGC metadata). With RDF, the goal is to enable the outputs of the iMarine WPS server to be delivered as linked Open Data.
    • to manage interoperability issues to enable the WPS server to be used within other applications. This is facilitated by connecting the data sources (like databases managed with Postgres / Postgis or netCDF-CF files) and related views / subsets to spatial data servers (like Mapserver, Constellation) implementing proper standards for interoperability between SDIs (for both format and access protocols: GML/WFS, netCDF/OPeNDAP). In this case this has to be done by data managers and this is specific to each data sources and related spatial servers.
    • use WPS process for pre-calculation of large sets of indicators (to generate the content of applications like Atlas, fact sheets...)
    • set up a search engine to enable process discovery. We suggest to use either OGC or RDF metadata to reuse existing search engines (geonetwork, XSearch..)
  • challenges for users:
    • find and describe a set of relevant metadata elements required to understand how to use properly describe each process,
    • once on the WPS server, it's not obvious that users (colleagues and partners) will be able to execute their processes. Indeed, as previously said, to facilitate the use of each process by everybody, there is a need to enable various data formats and access protocols to be used in the data inputs. Once previous data format abstraction and data structure mapping functions are in place, WPS metadata and related input parameters have to be written accordingly.
    • being able to use the WPS server through friendly clients like Terradue Web Client ([1]), GIS desktop applications...indeed most of the users won't be able to write a proper WPS URL by themselves.


Some arguments regarding the last point (dealing with the skills of users in marine ecology lab), we think we will better reach our goal if we solve the following blocking points:

  • data formats: enabling the users to provide the input dataset of the process in friendly data format is crucial (being able to work "as usual" is crucial for some of them): csv, excel, shape files...if not some people won't use it. As it's been said already this is possible by using a data abstraction library like ogr (through Rgdal in R).
  • access protocols: in addition to data formats, if we want the WPS server to be used with these formats, we think the user should be able to provide the dataset by:
      • uploading its dataset with any kind of usual data format (cf previous item). Upload is simple, everybody understands. We want to indicate that upload is possible by indicating a specific data input parameter in OGC WPS metadata. Upload can use basic http/ftp protocols which is a good option for everybody because 1) it's simple and 2) not all data sources are accessible with sophisticated acces protocols described below.
    • WFS is just an access protocol which is required for a SDI (like WCS, OPeNDAP...), but, for now, ignored by most of the colleagues in our domain who won't be able to deliver the dataset this way. Moreover not all research units have spatial data servers (Mapserver, Geoserver...). Even if some of them have such tools, not all implementations of WFS enable the "outputformat" option to be used with data formats like CSV, shp....Indeed It seams that the only official format for WFS is GML (others are used as extension of the specification, see http://mapserver.org/fr/ogc/wfs_server.html) and, anyway, none of our colleagues are working with WFS access protocol whatever the underlying data format,
  • data structures: most of users are interested in the processes set up by their colleagues. However processes are usually created to fit specific needs (a specific data structure coming from a specific data source) and with a specific programming language (R, Matlab, Java, IDL..). Codes need to be enriched to become more generic (being on a WPS server is not enough). Most of the users only know a single programming language and won't be able to modify the code of the original process. This can be fixed by embeding a function enabling the mapping between data structures in the process. Done in R for iMarine.
  • metadata: we want to use the metadata to be able to re-execute an indicator ("replicable science") with the same input dataset (by adding new criteria to filter the input dataset) and the outputs to be used by application like x-Search or Fact sheet generator.


Current blocking points in the framework of iMarine

  • Updating the metadata (describeProcess.xml) on the WPS server is too complicated:
    • either the administrator of the iMarine WPS has to understand the R code to write a Java Class to enable it on the grid / Hadoop,
    • or we, IRD or FAO, have to deploy it directly and write the Java code: this is not sustainable and very few organizations will have the programming skills.
  • Regarding the use of WFS: there is no discussion about the interest of WFS for data interoperability, it's an obvious need for an infrastructure like iMarine. We want to manage both approaches:
    • WFS for machines or people who knows about it
    • Usual protocols (http or upload) for the others (and the majority)
    • We are convinced of the interest of having a WPS server in an infractucture like iMarine but the worst scenario would be to spend lot of time collecting and deploying the processes of our partners and to tell them that they can only use them through WFS and GML format but not in the way they are using them usually.

Schema and current R package

To summarize, according to our discussions, out of technical aspects, the current challenge becomes to get a method facilitating the deployment of some existing processes (using examples of IRD or FAO) and their use by a wider community (machines and people) by delivering a set of R functions facilitating their use by a WPS server:

  • abstraction of data format and access protocols: Data reading function & Data writing function
  • abstraction of the data structure (Data Structure Mapping function)
  • Lineage (for replicable science): Metadata writing function


IRD current R package with related functions is made available online [2]. For now this package focuses on Fisheries activities indicators (with datasets related to Tuna Atlas in the case of IRD). Same pproach will be followed with netCDF-CF data sources to enrich biological observations with environmental parameters through a R process.

The following schema illustrates wich functions we suggest to add to enrich each process on a WPS server. Structure code R WPS.png


Example of R functions

              * SPARQL Query:
   FAO2URIFromEcoscope <- function(FAOId) {
 if (! require(rrdf)) {
   stop("Missing rrdf library")
 }
 
 if (missing(FAOId) || is.na(FAOId) || nchar(FAOId) == 0) {
   stop("Missing FAOId parameter")
 }
 
 sparqlResult <- sparql.remote("http://ecoscopebc.mpl.ird.fr/joseki/ecoscope", 
                               paste("PREFIX ecosystems_def: <http://www.ecoscope.org/ontologies/ecosystems_def/> ", 
                                     "SELECT * WHERE { ?uri ecosystems_def:faoId '", FAOId, "'}", sep="")
 )
 if (length(sparqlResult) > 0) {
   return(as.character(sparqlResult[1, "uri"]))
 } 
 
 return(NA)
  }


    *  Writing RDF statements:
  buildRdf <- function(rdf_file_path, rdf_subject, titles=c(), descriptions=c(), subjects=c(), processes=c(), data_output_identifier=c(), start=NA, end=NA, spatial=NA) {
 #data_input=c(), 
 if (! require(rrdf)) {
   stop("Missing rrdf library")
 
 }
 
 store = new.rdf(ontology=FALSE)
 
 add.prefix(store,
            prefix="resources_def",
            namespace="http://www.ecoscope.org/ontologies/resources_def/")
 
 add.prefix(store,
            prefix="ical",
            namespace="http://www.w3.org/2002/12/cal/ical/")
 
 add.prefix(store,
            prefix="dct",
            namespace="http://purl.org/dc/terms/")
 
 #type
 add.triple(store,
            subject=rdf_subject,
            predicate="http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
            object="http://www.ecoscope.org/ontologies/resources_def/indicator")
 #process
 add.triple(store,
            subject=rdf_subject,
            predicate="http://www.ecoscope.org/ontologies/resources_def/usesProcess",
            object=processes)
 
 #has_data_input
 #add.data.triple(store,
 #            subject=rdf_subject,
 #           predicate="http://www.ecoscope.org/ontologies/resources_def/has_data_input",
 #           data=data_input)
 
 #has_data_input
 add.data.triple(store,
                 subject=rdf_subject,
                 predicate="http://purl.org/dc/elements/1.1/identifier",
                 data=data_output_identifier)
 
 
 #title
 for (title.current in titles) {
   if (length(title.current) == 2) {
     #here we know the language attribute
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/title",
                     lang=title.current[1],
                     data=title.current[2])
   } else {
     #here we dont know 
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/title",
                     data=title.current)
   }
 }
 #description
 for (description.current in descriptions) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://purl.org/dc/elements/1.1/description",
                   data=description.current)
 }
 
 if (! is.na(start)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://www.w3.org/2002/12/cal/ical/dtstart",
                   data=start)
 }
 
 if (! is.na(end)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://www.w3.org/2002/12/cal/ical/dtend",
                   data=end)
 }
 
 for (subject.current in subjects) {
   URI <- FAO2URIFromEcoscope(subject.current)
   if (! is.na(URI)) {
     add.triple(store,
                subject=rdf_subject,
                predicate="http://purl.org/dc/elements/1.1/subject",
                object=URI)
   } else {
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/subject",
                     data=subject.current)
   }
 }
 
 
 
 if (! is.na(spatial)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://purl.org/dc/terms/spatial",
                   data=spatial)
 }
 
 save.rdf(store=store, filename=rdf_file_path)
 }