Legacy applications integration users feedback

From Gcube Wiki
Revision as of 12:30, 22 May 2014 by Julien.barde (Talk | contribs) (Analysis of needs for scientists)

Jump to: navigation, search

In this document, we try to give some clarifications about our (IRD + FAO) current approach in iMarine regarding the use of a common WPS server with legacy applications. The following point of view can be considered as the one of data managers working in marine domain and related laboratories who would like to deploy processes on a Processing server (WPS or not) and use the server as an opportunity to make their processes used by additional people (scientists, statisticians..) as well as with different datasets. Indeed, once deployed, it takes some additional work to make the original "raw processes" really usable by anybody considering the vairous skills of potential users and or the various structures of potential input datasets.

These feedback & guidelines apply regardless of the server location: in or out iMarine infrastructure. Indeed, beyond the iMarine framework, we think that the issues we are currently facing and the related solutions we are currently building remain interesting for other people sharing the same goal.

Main Goals

As legacy applications (processes) provider, our initial goals consist in using the iMarine WPS server to enable data processes to be:

  • 1) well described, in standard way so that users can easily discover and execute them, even access back to source code
  • 2) enriched to be as generic as possible in order to spread their use to a wider user community, with other datasets and applications.
  • 3) deployable in autonomy either by data managers (IRD, FAO...) or developers, in abstraction of the process business logic, ie without having to understand the details of the source code, and by just relying on the process metadata/description (inputs/outputs) handled in WPS by the DescribeProcess document.
  • 4) executable in standard way, making their execution possible with any WPS client, thus ensuring their usability for target applications
  • 5) usable in ergonomic way, ensuring friendliness to the end user (from input data selection to output recovery) by enabling a user-friendly WPS client

Actors and Roles

The above objectives have to be clearly aligned with a definition of the actors/roles and the overall Legacy Application Integration business process, starting from:

  • users (scientists, statisticians)
    • provide the actual data process / business logic specific to their domain of expertise, and
    • work in interaction with data managers to ensure good practices (goals 1 and 2)
  • data managers
    • liaise with users (suggesting good practices, improve process genericity) to ensure goals 1 and 2 are fulfilled, and
    • be able to deploy the process in relative abstraction of the process businesss logic (goal 3), relying essentially on the process description (DescribeProcess)
    • act as mediator between users and developers to fulfill goal 5
  • iMarine developers
    • deploy the process in complete abstraction of the process businesss logic (goal 3), relying essentially on the process description (DescribeProcess)
    • in case data managers are able to proceed to the deployment, ensure goal 3 can be fulfilled by them (i) in abstraction of any knowledge of the core WPS components, and without any need of intervention of data managers on core IT aspects, (ii) ensuring the eventual core technology limitations are well identified and known by data managers
    • ensure goal 4 is fulfilled
    • implement requirements highlighted by users/data managers to fulfill goal 5
Roles

Analysis of needs for scientists

Here is a summmary of the main needs:

  • friendly tools for data processing (same GUIs whatever the programming languages),
  • setting up a catalog of processes (through proper metadata),
  • enabling a search engine to enable processes discovery,
  • enabling each process to become more generic than the original version created for a given data source & related datasets (meaning specific data structures & data formats),
  • enabling each process to be executed online, through a WPS server, without having to deal with the underlying programming language.


Indeed, it is key to take into account the heterogeneity of users skills in marine ecology lab, and we think that iMarine WPS will better reach its goal if we solve the following blocking points:

  • metadata:
    • metadata are required for processes catalogs and discovery
    • we want to use the metadata to be able to re-execute an algorithm or indicator ("replicable science") with the same input dataset (when necessary by adding new criteria to filter the input dataset) and some of the outputs to be used as input of other processes, or by applications like the Tabular data Manager, x-Search or Fact sheet generator.
  • data formats: enabling the users to provide the input dataset of the process in a friendly way is crucial. This includes:
    • makes processes also compatible with simple formats well-known by end users (being able to work "as usual" with their favorite format is crucial for some of them): CSV, excel, shape files...if not, there is a high risk that some people won't use it. As example, R COST package for DCF data is using mainly CSV files.
    • Complex formats like GML (& more generally having the knowledge to build WFS requests) or SDMX are too exclusive, so they should not be absolutely seen as "the protocol/formats to use", by excluding other "simple" formats
      • Currently, reading such data formats is possible by using generic functions offered by data abstraction libraries such as GDAL/OGR (through Rgdal in R), and packages such as RFigisGeo and rsdmx, used in IRD and FAO processes.
      • Use complex formats (mostly associated with services like OGC WFS / SDMX WS) is fine only if "data browsing" can be facilitated for the user in WPS Clients, to avoid that users should come with their WFS, SDMX URLs, etc (that probably data managers would have to provide to them!) to execute a process (not user-friendly, users will not use it). This notion of data browsing is crucial and should be enabled by the iMarine WPS client, in close link to process metadata interpretation (DescribeProcess).
  • access protocols: in addition to data formats, if we want the WPS server to be used with simple formats, we think the user should be able to provide the dataset by:
    • uploading its dataset with any kind of usual data format (cf previous item). Upload is simple, everybody understands. We want to indicate that upload is possible by using (ad hoc convention) a specific data input parameter in OGC WPS metadata. Upload can use basic http/ftp protocols which is a good option for everybody because 1) it's simple and 2) not all data sources are accessible with some sophisticated acces protocols described below.
    • WFS is just another access protocol which is required for a SDI (like WCS, OPeNDAP...), but:
      • for now, it is ignored by most of the end users who won't be able to deliver the dataset in this way,
      • Moreover not all research units have spatial data servers (Mapserver, Geoserver...),
      • Even if some of them have such tools, not all implementations of WFS enable the "outputFormat" option (Geoserver does) to be used with data formats like CSV, Shapefiles,etc...Indeed It seems that the only official OGC format for WFS is GML (others are used as extension of the specification, see [1]) and, anyway, none of our colleagues are working with WFS access protocol whatever the underlying data format,
    • SDMX for accessing statistical data
      • most of end users do not have the knowledge to build their own SDMX URLs. Same as for WFS, there is a benefit only if "data browsing" is facilited, and the building of the data URL is hidden to the user
      • which institution uses SDMX data? we can cite international and european institutions that disseminate statistics. On the scientific side, laboratories do not use SDMX. This however reinforces the need to make processes as generic as possible, statisfying different user communities by allowing them working with their favourite formats.
  • data structures: most of users are interested in using the processes set up by their colleagues with other datasets. However processes are usually created to fit specific needs (a specific data structure coming from a specific data source) and with a specific programming language (R, Matlab, Java, IDL..). Codes need to be enriched to become more generic (being on a WPS server is not enough) and be thus executed with similar data structures. Most of the users only know a single programming language and won't be able to modify the code of the original process. This can be fixed by embeding a function enabling the mapping between data structures in the process. Done in R for iMarine.

Enrichment of Legacy applications: challenges for processes providers (scientists and related IT teams)

Indeed, more than delivering as much processes as possible, the blocking points for a community aiming at sharing processes consist in setting up generic methods to solve the main challenges which are listed below:

  • challenges for developpers / data managers:
    • to enrich the processes of their colleagues so that they can accept additional data formats:
      • Data inputs reading (including complex data format like WFS & GML, and SDMX). This often requires a modification of the native data input reading function (used to create a R object / R data in our case). For WFS/GML, we fixed this by relying on a data format abstraction function using libraries like gdal/ogr (see function from readWFS function in RFigisGeo jointly developed by IRD and FAO). For SDMX, FAO relies on the rsdmx R package.
      • Data outputs writing Similar functions can be added to deliver process output in multiple ouput formats.
    • to enable these processes to be executed with any dataset having a data structure complying with the one expected by the process on the WPS server even if the datasets come from different data sources with a different semantic (and not only datasets from the source the script was created for). This can be achieved by adding a data structure mapping function (e.g. to manage the semantic of labels given to some columns or codelists as well as data types). This enables to manage the mapping between the dataset used as input (a priori unknown) and the data structure expected by the process (known). By doing so we aim to enable, for example, FAO datasets to be used as inputs of IRD processes (tests with IRD Tuna Atlas indicators), and reciprocally (reallocation experiments with IRD Tuna data)
    • Metadata: we suggest to reuse existing WPS metadata elements and to create additional metadata elements by setting up dedicated functions writing metadata (RDF by using Jena API in Java or in R with rrdf package to enable the outputs of the iMarine WPS server to be delivered as Linked Open Data, OGC metadata with Geotoolkit/Apache SIS for INSPIRE...). We distinguish three kinds of metadata:
      • 1) about the process itself: this is achieved by writing a compliant OGC describeProcess.xml file corresponding to the Legacy applications charcteristics and potential enrichments (cf above). This is key as it will enable to set up a search engine for processes discovery. Native OGC WPS metadata could be transformed in RDF to reuse existing search engines (Geonetwork, XSearch..)
      • 2) about the WPS outputs: in this case the goal is to annotate the output files with the proper topics (species, fishing gears, ...) so that outputs can be made available in metadata catalogs and browsed through various search engines (eg Geonetwork, xSearch),
      • 3) about the lineage: keep tracks of each execution of a process by writing the metadata describing the input parameters, the process itself and the outputs (based on the WPS "execution report)".
    • to manage interoperability issues to enable the WPS server to be used within other applications. This is facilitated by connecting the data sources (like databases managed with Postgres / Postgis or netCDF-CF files) and related views / subsets to spatial data servers (like Mapserver, Constellation) implementing proper standards for interoperability between SDIs (for both format and access protocols: GML/WFS, netCDF/OPeNDAP). In this case this has to be done by data managers and this is specific to each data sources and related spatial servers.
    • use WPS process for pre-calculation of large sets of indicators (to generate the content of applications like Web sites like Atlas, fact sheets...).
  • challenges for users:
    • find a set of relevant metadata elements for processes which are required to describe and understand them as well as to figure out how to use properly the discovered processes,
    • once on the WPS server, it's not obvious that users (colleagues and partners) will be able to execute their processes online. Indeed, as previously said, to facilitate the use of each process by everybody, there is a need to enable various data formats and access protocols, either simple or complex, to be used in the data inputs. Off course, once previous data format abstraction and data structure mapping functions are in place, WPS metadata and related input parameters have to be written accordingly.
    • being able to use the WPS server through friendly clients like the iMarine Statistical Manager, the Terradue Web Client ([2]), GIS desktop applications...indeed most of the users won't be able to write a proper WPS URL by themselves.

Current blocking points with the WPS Server in the framework of iMarine

  • Even if we know what we want to do, we currently face a set of issues that slow down our work, and prevent from fulfilling the above objectives.
  • This section intends to list all blocking points that we have to remain aware of, in order to:
  1. consider the current state of WPS technologies available through iMarine & find pragmatic solutions to fulfill the above objectives, with the finality to satisfy user community (short-term)
  2. put the Legacy applications in a context of continuous improvement process, inventorying technology blocking points that should be tackled in the future (medium/long-term)
  3. feed the sustainability plan, business model, and delineate the actual scope of available technologies for a better and rational promoting to other users/institutions

WPS general issues

  • the different WPS servers implementations (52 North, Geoserver, Geotoolkit....) are not implementing the OGC specifications (WPS, WFS..) in the same way. It can sometimes be difficult to identify what is strictly OGC compliant from what is adhoc/implementation-specific. This is very confusing for users providing their processes,
  • with 52 North Server and related Java API: the current WPS Server doesn't enable to write any kind of describeProcess.xml from R codes: Java methods should be enriched to better manage Literal or Complex Input or Ouput paramaters and other WPS metadata elements,
  • Regarding the use of WFS as a WPS input data format: there is no discussion about the interest of WFS for data interoperability, it's an obvious need for an infrastructure like iMarine. However, we want to manage both approaches:
    • WFS for machines or people who knows about it. However, depending on version, sharing WFS among existing implementations (Geoserver, MapServer, 52°North...) is still a big challenge,
    • Usual protocols (http or upload) for the others (and the majority),

iMarine WPS server issues

  • adding processes / or maintaining existing ones on the WPS server is currently too complicated:
    • When deployment is performed by developers
      • the administrator/developers of the iMarine WPS has to understand the R code to write a Java Class to enable it on the infrastructure:
      • This approach is not sustainable taking account that processes may emanate from different scientific domains & expertises
      • Through the Legacy application integration, the integrity of the Process description is not preserved while it must be, ie (1) once deployed on the WPS server, the DescribeProcess document differs from the original one, (2) The DescribeProcess of the same process is different wether the process is deployed on a normal WPS server vs. an WPS-Hadoop server: it should be the same from a user perspective. Clarifications should be given on this, and report here eventual limits/constraints (limitations inherited from WPS 52North? WPS-Hadoop specific constraints?) and investigation tracks for future.
    • Considering that deployment may be performed by / delegated to data managers:
      • the current approach is too heavy (e.g. need of additional Java class, access to Terradue servers) and not sustainable especially because very few organizations will have the required skills to take over the deployment, hence the need to move towards facilitating the deployment, keeping in mind the finality to delegate the deployment of processes to data managers (direct interaction with users in their laboratory/unit, responsivity & prompt support to users, process maintenance/track change). To fulfill this facilitation, there is a need to align ongoing and futur developments with the finality of process "uploader" (taking as example of material the WPS 52North R process uploader)
      • need a status update on porting the WPS components from Terradue to CNR, and potential impacts for data managers for their path to deploy processes by their own (that have yet spent time to understand how to access to servers for deployment testings in autonomy)
  • Generally, we are convinced of the interest of having a WPS server in an infractucture like iMarine but the worst scenario would be to spend lot of time collecting and deploying the processes of our partners and - in addition - to tell them that they can only use them through WFS and GML format but not in the way they are using them usually.
  • At this stage, writing and updating describeProcess.xml should also be facilitated.

WPS Client issues

  • As highlighted in goal 4, there is a need to ensure that iMarine WPS can be exploited by any WPS Client that data managers/institutions may want to use, especially in connection with other applications, or process chains (e.g. use GeoToolKit Java WPS Client)
  • In order to achieve goal 5, there is a need to clarify which the WPS Client application considered as reference for iMarine, where efforts should be oriented (AFAIK They are 2 clients currently used Statistical Manager? Terradue WPS Client?).
  • This last clarification is even more important that it should drive the real WPS testing by users and data managers, the collection of requirements by them & finally the requested implementations/enhancements by developers. Beyond the business process of Legacy Application integration (stricto sensus), there is a need to boost the testing of client applications, and the interaction between users/data managers vs. developers to improve the user-friendliness of the iMarine WPS Client application, through enabling widgets for data inputs selection & process output recovery.

Sustainibility aspects

  • who is in charge of deploying the process ?
    • Scientists can't...
    • Data managers should be able to...
      • Close interaction with end users (scientists, statisticians)
      • interaction that facilitates improving quality of process between data managers & users (e.g. improve genericity) without entering in the details of the source code that remains under the expertise of end user (scientist, statistician)
      • Better Responsivity, Prompt support to users & maintenance/track change of deployed processes
    • Developers have to (at least when data managers could not do it, which is the current situation)
      • Need of prompt support to data managers (hence to users)
      • Need complete abstraction from the process business logic (that maybe be clearly out of the expertise of developers)
  • Underlying questions?
    • in case developers are the deployers: how long should it take to deploy the process when a proper "desribeProcess.xml" is provided ?
    • what are the gaps to move towards further delegation to data managers for process deployment ?
    • infrastructure considerations: who is going to manage/maintain the WPS server? (related to the discussions to move WPS from T2 to CNR)
    • "Bodies" that could help to a Continuous improvement process:
      • small working group of data managers (exchanges on issues, good practices, synergies, common data processes) ~ group that reflects the current common work undertaken by IRD,FAO
      • small working group of developers (technology issues, improvements, technology survey, connection to OGC)

Summary of the guidelines and related schema (current implementation in R package)

To summarize, according to our discussions, out of technical aspects, the current challenge becomes to get a set of generic methods (by delivering a set of R functions) facilitating both the deployment of existing processes on a WPS server (using examples of IRD or FAO) as well as their use by a wider community (machines and people):

  • abstraction of data format and access protocols: Data reading function & Data writing function,
  • abstraction of the data structure: Data Structure Mapping function. There is an example as well in the statistical manager with the "Geo Processing" / "Occurence Enrichment" process where we have a set of input parameters to enable the data structure mapping (eg: "LongitudeColumn","LatitudeColumn","ScientificNameColumn","TimeColumn"...)
  • Lineage (for replicable science): Metadata writing function which describes:
    • the execution of the process,
    • the output files with relevant metadata (RDF, OGC...) with proper tags (species, fishing gears...).


IRD current R package with related functions is made available online [3]. For now this package focuses on Fisheries activities indicators (with datasets related to Tuna Atlas in the case of IRD). Same approach will be followed with netCDF-CF data sources to enrich biological observations with environmental parameters through a R process.

The following schema illustrates wich functions we suggest to add to enrich each process on a WPS server. Structure code R WPS.png

Example of R functions

  • Metadata writing function:
    • SPARQL Query:

   FAO2URIFromEcoscope <- function(FAOId) {
 if (! require(rrdf)) {
   stop("Missing rrdf library")
 }
 
 if (missing(FAOId) || is.na(FAOId) || nchar(FAOId) == 0) {
   stop("Missing FAOId parameter")
 }
 
 sparqlResult <- sparql.remote("http://ecoscopebc.mpl.ird.fr/joseki/ecoscope", 
                               paste("PREFIX ecosystems_def: <http://www.ecoscope.org/ontologies/ecosystems_def/> ", 
                                     "SELECT * WHERE { ?uri ecosystems_def:faoId '", FAOId, "'}", sep="")
 )
 if (length(sparqlResult) > 0) {
   return(as.character(sparqlResult[1, "uri"]))
 } 
 
 return(NA)
  }

    • Writing RDF statements:

  buildRdf <- function(rdf_file_path, rdf_subject, titles=c(), descriptions=c(), subjects=c(), processes=c(), data_output_identifier=c(), start=NA, end=NA, spatial=NA) {
 #data_input=c(), 
 if (! require(rrdf)) {
   stop("Missing rrdf library")
 
 }
 
 store = new.rdf(ontology=FALSE)
 
 add.prefix(store,
            prefix="resources_def",
            namespace="http://www.ecoscope.org/ontologies/resources_def/")
 
 add.prefix(store,
            prefix="ical",
            namespace="http://www.w3.org/2002/12/cal/ical/")
 
 add.prefix(store,
            prefix="dct",
            namespace="http://purl.org/dc/terms/")
 
 #type
 add.triple(store,
            subject=rdf_subject,
            predicate="http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
            object="http://www.ecoscope.org/ontologies/resources_def/indicator")
 #process
 add.triple(store,
            subject=rdf_subject,
            predicate="http://www.ecoscope.org/ontologies/resources_def/usesProcess",
            object=processes)
 
 #has_data_input
 #add.data.triple(store,
 #            subject=rdf_subject,
 #           predicate="http://www.ecoscope.org/ontologies/resources_def/has_data_input",
 #           data=data_input)
 
 #has_data_input
 add.data.triple(store,
                 subject=rdf_subject,
                 predicate="http://purl.org/dc/elements/1.1/identifier",
                 data=data_output_identifier)
 
 
 #title
 for (title.current in titles) {
   if (length(title.current) == 2) {
     #here we know the language attribute
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/title",
                     lang=title.current[1],
                     data=title.current[2])
   } else {
     #here we dont know 
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/title",
                     data=title.current)
   }
 }
 #description
 for (description.current in descriptions) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://purl.org/dc/elements/1.1/description",
                   data=description.current)
 }
 
 if (! is.na(start)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://www.w3.org/2002/12/cal/ical/dtstart",
                   data=start)
 }
 
 if (! is.na(end)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://www.w3.org/2002/12/cal/ical/dtend",
                   data=end)
 }
 
 for (subject.current in subjects) {
   URI <- FAO2URIFromEcoscope(subject.current)
   if (! is.na(URI)) {
     add.triple(store,
                subject=rdf_subject,
                predicate="http://purl.org/dc/elements/1.1/subject",
                object=URI)
   } else {
     add.data.triple(store,
                     subject=rdf_subject,
                     predicate="http://purl.org/dc/elements/1.1/subject",
                     data=subject.current)
   }
 }
 
 
 
 if (! is.na(spatial)) {
   add.data.triple(store,
                   subject=rdf_subject,
                   predicate="http://purl.org/dc/terms/spatial",
                   data=spatial)
 }
 
 save.rdf(store=store, filename=rdf_file_path)
 }