Difference between revisions of "Archive Import Service"

From Gcube Wiki
Jump to: navigation, search
m (Writing Importers)
 
(77 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The Archive Import Service (AIS) is in charge of defining collections and importing their content into the gCube infrastructure, by interacting with the collection management service, the content management service and other services at the content management layer, such as the Metadata Manager Service. While the functionality it offers is, from a logical point of view, well defined and rather confined, the AIS must be able to deal with a large number of different ways to define collections and offer extensibility features so to accommodate types of collections and ways to describe them not known/required at the time of its initial design. These needs impact on the architecture of the service and on its interface. From an architectural point of view, the AIS offers the possibility to add new functionality using pluggable software modules. At the interface level, high flexibility is ensured by relying on a scripting language, AISL, used to define import tasks, rather than on a interface with fixed parameters. The language is designed to support the most common tasks needed during the definition of an import task, is based on a flexible data model used to describe the resources to import, and is itslef extensible with new features. Beside the interface based on AISL, the AIS also offers other interfaces, called adapters, that ease the specification of common tasks and are also pluggable (i.e. new adapters can be added to the AIS). As importing collections might be an expensive task, resource-wise, the AIS offers features that can be used used to optimize import tasks. In particular, it supports incremental import of collections. The description that follows introduces first the rationale behind the functionality of the AIS and its overall architecture, then describes its scripting language AIS, its extensibility features, and the concepts related to incremental import. Finally, it presents the interface of the service.
+
== Aim and Scope of Component ==
  
 +
The Archive Import Service(AIS) is dedicated to '''"batch" import of resources''' which are external to the gCube infrastructure into the infrastructure itself. The term ''"importing"'' refers to the description of such resources and their logical relationships (e.g. the association between an image and a file containing metadata referring to it) inside the [[GCube_Information_Organisation_Services|Information Organization stack of services]]. While the AIS is not strictly necessary for the creation and management of collections of resources in gCube, it makes possible in practice the creation of large collections, and their automated maintenance.
  
== Importing Resources ==
+
== Logical Architecture ==
 +
The task of importing a set of external resources is performed by building a description of the resources to import, in the form of a graph labelled on nodes and arcs with key-value pairs, called a '''''graph of resources (GoR)''''', and then processing this description accomplishing all steps needed to import the resources represented therein.
 +
The GoR is based on a custom data model, similar to the Information Object Model, and is built following a procedural description of how to build it expressed in a scripting language, called the '''Archive Import Service Language (AISL)'''. Full details about this language and how to write an import script are given in the [[Content_Import|Administrator's guide]]. The actual import task is performed by a chain of pluggable software modules, called "importers", that in turn process the graph and can manipulate it, annotating it with additional knowledge produced during the import. Each importer is dedicated to import resources interacting with a specific subpart of the Information Organization set of Services. For instance, the ''metadata importer'' is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service. The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers. This, and the pluggable nature of importers, makes possible to enable the import of new kind of content resources which might be defined in the infrastructure in the future.
  
Overall, the execution of an import task can be tought as divided in two main phases 1) Representation phase and 2) Import phase. During the first phase, the resources to be imported and their relationships are modeled using a graph-based model. This representation is used during the import phase by appropriate "importers", software modules  that encapsulate the logics needed to import specific kind of resources, interfacing with the appropriate services.
+
The logical architecture of the AIS can be then depicted as follows:
  
'''Representation phase'''
+
[[Image:AISDiagram.jpeg]]
During this phase, the resources to be imported are modeled using a graph based data model. The model contains three main constructs:  
+
collection, object and relationship, thus resembling the collection and information object models on which the content management services
+
are based. The constructs in the modeling language can be used to assemble a graph of resources, specifying their content and relationships.
+
Each construct has a type, and can be annotated with a number of properties. The type of the resource is a marker that is used by importers to select the resources they are dedicated. The properties are just name-value pairs. The values can be of any java type. Notice that only some types of resources and their properties are fixed in advance. In particular, the existing types are used to allow the import of content and metadata. To support the extension of functionality of the AIS, it is possible to define new types, annotated with different properties.
+
  
'''Import phase'''
+
=== Import state and incremental import ===
During this phase, an import-engine dynamically loads and invokes a number of importers, modules that are responsible to import specific kinds of resources. Each importer inspects the representation graph, identifies the resources for which it is responsible, and imports them. Beside performing the import, the importer may also perform some further annotation of the resources in the graph. These annotations are used in later execution of tasks that involve the same resources, and may also be exploited by other importers. For example, in the case of importing metadata related to some content, the content objects to which the metadata objects refer should have already been stored in the content-management service, and should have been annotated with the identifiers used by that service. Similar considerations hold for content collections. Importers enabled to handle content-objects and metadata objects are already provided by the AIS.
+
During the import of some resources, the corresponding GoR is kept updated with information regarding the actual resources created, such as their OIDs. The Graph of Resources is stored persistently by the service, so that a subsequent execution of the same import script is aware of the status of the import and can perform only the differential operations needed to maintain the status of the resources up-to-date. While this solution involves a partial duplication of information inside the infrastructure, it has been chosen because it introduces a complete decoupling between the AIS and other gCube services, which are thus not forced to offer additional information needed for incremental import in their interfaces.
Additional importers, dedicated to store specific kind of objects, can be added.  
+
  
At the end of the import phase, the resource graph created during the representation phase and annotated during the import phase is stored persistently.
+
== Service Implementation ==
 +
'''Note:''' this section refers to the Architecture of the AIS as a stateful gCube service. The status of the compoenent is however currently that of a standalone tool. Please see the section current limitations and known issues for further details.
  
 +
The AIS is a '''''stateful gCube service''''', following the factory approach. A stateless factory porttype, the '''<tt>ArchiveImportService</tt> porttype''', allows to create stateful instances of the '''<tt>ArchiveImportTask</tt> porttype'''. Each import task is responsible for the execution of a single import script. It performs the import, maintains internally the status of the import (under the form of an annotated graph of resources), and provides notification about the status of the task. The resources it is responsible for are kept up to date by re-executing the import script at suitable time intervals. Beside acting as a factory for import tasks, the ArchiveImportService porttype also offers additional functionality related to the management of import scripts. Import scripts are generic resources inside the infrastructure. The porttype allows to publish new scripts, list and edit existing scripts, and validate them from a syntactic point of view. The semantics of the methods offered by the two porttypes are described in the following.
  
 +
=== Archive Import Service Porttype===
 +
* '''<tt>String[] list()</tt>''': returns a list of import scripts identifiers for scripts currently available in the VRE (as generic resources);
 +
* '''<tt>Script load(String scriptId)</tt>''': returns the import script corresponding to the given identifier;
 +
* '''<tt>void save(Script script)</tt>''': saves an import script, by creating or updating a corresponding generic resource;
 +
* '''<tt>void delete(String scriptId)</tt>''': deletes the import script corresponding to the given identifier;
 +
* '''<tt>ValidationReport validate(String scriptSpecification)</tt>''': this method performs validation of the given input string, which is treated as an AISL script and undergoes parsing and other syntactic validation steps. ;
 +
* '''<tt>EndpointReferenceType getTask(String importScriptIdentifier)</tt>''': this method gets an instance of the ArchiveImportTask service dedicated to a given script.
  
 +
The Complex Types '''<tt>Script</tt>''' and '''<tt>ValidationReport</tt>''' used in the methods above represent respectively an AISL Script (characterized by a scriptId, a description and a content, i.e. the script itslef) and a syntactic validation report (containing a boolean validity flag, a message and other details like error row and column).
  
== Interface to the AIS ==
+
=== Archive Import Task Porttype ===
+
* '''<tt>void start(ImportOptions options)</tt>''': this method starts the import task with the given options. Options include a run mode (validate, build, simulation import, import);
The client interface to the AIS is based on a scripting language, called AISL (Archive Import Service Language). The language can be used directly to submit tasks to the service, but also to create parametric task-managers that are in charge of managing well defined tasks that only require a fixed number of parameters. The language is designed to create a representation graph, according to the model described above, in a user-friendly way.  
+
* '''<tt>void stop()</tt>''': this method  stops the import task;
 +
* '''<tt>TaskStatus getStatus()</tt>''': this method is included to get detailed information about the progress of an import task, including some information on errors happened during the import task execution. The Complex Type TaskStatus contains a number of informations about the task progress, e.g. execution state (new, running, stopped, failed), execution phase (parsing, building, importing) and eventually import phase (document, metadata...)  the number and types of graph objects currently created, imported and failed. For these last ones, it is also available detailed error information.
  
 +
=== Import Task Status Notification ===
 +
The ArchiveImportTask service maintains part of its state as a WS-resource. The properties of this resource are used to notify interested service about:
 +
* the current phase of the import (e.g. graph building, importing etc.);
 +
* the current number of objects created in the GoR;
 +
* the current number of objects imported successfully;
 +
* the current number of objects whose import failed.
  
'''AISL'''
+
=== Implementation Details ===
AISL is an interpreted scripting language with XML-based syntax. As most programming languages, AISL supports various flow-control structures, allows to manipulate variables and evaluate various kinds of expressions. However, the goal of an AISL script is not that of performing arbitrary computations, but to create a graph of resources. Representation objects (collections, objects and relationships) are first class entities in the language, that provides constructs to build them and assign them properties. Representation objects may resemble objects in oo-languages, in that their properties can be accessed as fields and assigned values, and a reference to representation-objects themselved can be stored in variables. A fundamental difference is that, once created, representation objects are never destroyed, even when the control-flow exits the scope in which they were created.  
+
The following UML diagram describes concisely how the AIS is organized from an implementation point of view. The service itself depends on a library whose packages are dedicated each to a specific functionality.  
  
Rather than describing in detail the syntax of the language, for which we refer to the documentation of the AIS, we provide here an example of script to define and import a collection.
+
[[Image:AISUML.jpg]]
  
<pre>
+
In particular:
<?xml version="1.0" ?>
+
* '''ais''' contains the classes related to the main functionality of the service, like defining the logical flow of a task execution
<program>
+
* '''language''' contains the classes related to parsing and executing AISL scripts
<createCollection type="content" name="contentCollection">
+
* '''datamodel''' contains the classes used in the description of graph of resources.
<property name="collectionName" expr="'TestContentCollection'"/>
+
* '''importing''' contains the classes used for the actual import, e.g. the so called importers.
<property name="isVirtual" expr="false"/>
+
* '''remotefile''' contains all classes used to access remote resources.
<property name="isUser" expr="true"/>
+
* '''util''' is a package collecting several different utility classes, like those used to manage the persistence of the GoR, caching, manipulating XML and HTML etc.
[...]
+
</createCollection>
+
  
<createCollection type="metadata" name="metadataCollection">
+
Each of these packages is further divided into multiple subpackages dedicated to specific functionality. These are not shown here to avoid visual cluttering.
<property name="collectionName" expr="'TestMetadataCollection'"/>
+
<property name="collectionDescription" expr="'A test collection'"/>
+
[...]
+
<property name="relatedContentCollection" expr="$contentCollection"/>
+
</createCollection>
+
  
<define name=”file” expr=”first(file(http:some.web.site/docs/metadata.xml))”/>
+
== Extensibility Features ==
<foreach name="a" in="xpath(dom($file),'//record')">
+
The functionality of the Archive Import Service can be extended in three main ways. It is possible to define new functions for the AISL, and to plug in software modules to interact with external resources through additional network protocols and to interact with new gCube content-related components.
<createObject type="content" name="contentObject" collections="contentCollection">
+
<property name="documentName" expr="'testDocument'"/>
+
<property name="content"
+
            expr="file(string(first(xpath:($a,'//uri/text()'))))"/>
+
<property name="hasMaterializedContent" expr="false"/>
+
<property name="isVirtualImport" expr="false"/>
+
</createObject>
+
  
 +
=== Defining Functions for the AISL ===
 +
AISL is a scripting language intended to create graphs of resources for subsequent import. Its type system and grammar, together with usage examples, are fully described in the [[Content_Import|Administrator's guide]]. Its main features are a tight integration with the AIS, in the sense that the creation of model objects are first citizens in the language, and the ability to treat in a way as much as possible transparent to the user some tasks which are frequent during import, like accessing files at remote locations. Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems. The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. Adding a new function amounts to two steps:
 +
# Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the <tt>AbstractAISLFunction</tt> class. See below for further details.
 +
# Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration
  
<createObject type="metadata" name="metadataObject"
+
The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method <tt>evaluate</tt> provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, this method should redirect to appropriate methods based on the number and types of the arguments.
collections="metadataCollection">  
+
<pre>
<property name="content" expr="string($a)"/>
+
public interface AISLFunction {
[...]
+
public String getName();
</createObject>
+
<createRelationship type="metadata" from="metadataObject"
+
public void setFunctionDefinitions(FunctionDefinition ... defs);
to="contentObject" name="rel"/>
+
public FunctionDefinition[] getFunctionDefinitions();
</foreach>  
+
 
</program>
+
public  Object evaluate(Object[] args) throws Exception;
 +
 +
public interface FunctionDefinition{
 +
Class<?>[] getArgTypes();
 +
Class<?> getReturnType();
 +
}
 +
 +
}
 
</pre>
 
</pre>
 +
A partial implementation of the <tt>AISLFunction</tt> interface is provided by the <tt>AbstractAISLFunction</tt> class. A developer can simply extend this class and then provide an appropriate constructor and implement the appropriate <tt>evaluate</tt> method. An example is given below. The function <tt>match</tt> returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is thus:
  
Here, the instructions createCollection, createObject and createRelationship define respectively collections and objects and assign them properties whose value are either constants and obtained by evaluating expressions. When a representation  construct is created this way, it is possible to specify 
+
<pre>
a name for a variable that is initialized to its value and can be used (while in scope) to refer to the construct, for instance to pass it to other constructs (as in the case of creating a relationship, that requires a reference to two objects) or to alter its properties. The define instruction defines a variable an initializes it to the result of the evaluation of the file function, which returns a remote file (see below for more details on this). The foreach instruction executes a bounded cycle over a list of values resulting from the evaluation of an expression, which in this case is a list of DOM nodes obtained by evaluating an XPath expression over a document. Another Xpath expression is evaluated inside the cycle, this time over a DOM fragment whose value is stored in a variable.
+
boolean match(string str, string pattern)
 +
</pre>
  
The script downloads an XML file at a given URI, parses it, and for each <record> element present in it:
+
the class <tt>Match.class</tt> below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method <tt>evaluate(Object[] args)</tt>, which must be implemented to comply with the interface <tt>AISLFunction</tt>, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded, there is no actual need for a separate evaluate method, here it has been added for clarity).
a)Creates a content-object whose content is the file at the address stored as a sub-element <uri> of the <record element>
+
b)Creates a metadata object whose content is the <record> element.
+
c)Associates them with a relationship
+
d)Stores the objects in appropriate collections.
+
  
 +
<pre>
 +
public class Match extends AbstractAISLFunction{
 +
 +
public Match(){
 +
setName("match");
 +
setFunctionDefinitions(
 +
new FunctionDefinitionImpl(Boolean.class, String.class, String.class)
 +
);
 +
}
 +
 +
public Object evaluate(Object[] args) throws Exception{
 +
return evaluate((String)args[0], (String)args[1]);
  
AISL is not typed, i.e. variables can be assigned values of any kind, and is the responsibility of the programmer to ensure proper type concordance. Basically, AISL does not even define its own type system, but exploits the type system of the underlying Java language. The language offers constructs for building objects of some basic types, like strings, sets, files, and integers. However, expressions in AISL can return any Java object. Specific functions can be thus used to build objects of given types in a controlled way. For instance, the built-in "dom()" function accepts as input a file object and produces a DOM document object. Similarly, the "xpath" function takes a DOM object and an expression, and returns the result of the evaluation of an xpath over the DOM node object as a set of DOM Node objects. In general, thus, types cannot be directly manipulated from the language, except for a few cases, but the variables of the language can be assigned with any java type, and objects of given type can be built using functions. This means that addon-functions can produce as result objects of user-defined types. These objects can be stored in representation objects as properties, and later used by specific addon-importers. However, AISL itslef provides a few special data-types, not present in the standard java libraries. In particular, the type
+
}
AISLFile is designed to work in combination with some AISL built-in functions and to optimize the management of files during import, in particolar with regard to memory and disk storage resources consumption. An AISLFile object encapsulates information about a given file, such as file length, and allow access to its content. However, the file content is not necessarily stored locally. If the file is obtained through the file() function, then the download of its actual content is deferred, and only performed when needed. When describing a large number of resources, as in the case of large collections, it is not feasible (or anyway not efficient) to store locally to the archive import service all contents that need to be imported. This is especially true for content that has to be imported without having to be processed in any way before the import itself. Even for those files that might need some processing (for instance for extracting information), it might be desirable for the AISL script-writer to be able to import the file, use for the time he needs it (e.g. pass it to the xslt() function) and then free the memory resources used to maintain it, without having to deal directly with the details. The AISL file offers a transparent mechanism to handle access to the content of remote files accessible through a variety of protocols,
+
by having a placeholder (an AISFile object) that can be treated as a file. Internally, an AISLFile implements a caching strategy for documents whose content has to be accessed at resource-description time. For files whose contents have to be handled only at import time, it offers a way to encapsulate easily inside a representation object all the information needed to pass to other service like the storage management or the content management service.
+
private Boolean evaluate(String str, String pattern){
 +
return str.matches(pattern);
 +
}
  
 +
}
 +
</pre>
  
'''Functions'''
+
=== Writing RemoteFile Adapters ===
To encapsulate complex operations, AISL provides functions, that work as in most programming languages. A number of built-in functions is already defined inside the language. These functions cover common needs arising during the definition of resource graphs, and are detailed below. The language can be expanded by adding new functions. Furthermore, AISL provides extensible-functions, meaning that their functionality can be extended to handle special kinds of arguments. For instance, the file() function allows to retrieve a list of files given a set of arguments that define their location. The built-in function is already able to deal with a number of protocols, including http, ftp and gridftp. However, the function can be extended to handle additional protocols, as explained in more thorough details later on. The motivation behind extensible functions is to keep the syntax of AISL as lean and transparent for the user as possible.  
+
When writing AISL scripts, the details of interaction with remote resources available on the network is hidden from the user, and encapsulated into the facilities related to a native data type of the language, the file type. The intention is to shield (almost) completely the user from such details, and presenting resources available through heterogeneous protocols via a homogeneous access mechanism.
  
<table>
+
A network resource is made available as file by invoking the <tt>getFile()</tt> function of the language. The function gets as argument a ''locator'', which is a string (and optionally some parameters needed for authentication), and resolves, based on the form of the locator, which protocol to use and how to access the resource. To allow for extensibility, the format of the locator is not fixed in advance, but depends on the specific remote file type, which must be able to recognize such format (see below). To avoid excessive resource consumption, remote resources are not downloaded straight away. Instead, a file object acts as a placeholder, and content is made available on demand. Other properties of the resource, like for instance its length, last modification date or hash signature, are instead gathered (and possibly cached so to limit network usage). Of course, the availability of this information is related to the capabilities offered by the network protocol at hand. Once downloaded, content is also cached.  
<tr>
+
<td>
+
Built-in functions:
+
</td>
+
</tr>
+
<tr>
+
<td>
+
first(),
+
get(index),
+
add(index, object),
+
remove(index),
+
size()
+
</td>
+
<td>
+
These functions all take as argument an object of type List and perfom some
+
operation on its elements, like getting the first object in the list, adding or removing an object, returning the size of the set.  
+
</td>
+
</tr>
+
<tr>
+
<td>
+
dom()
+
</td>
+
<td>
+
this function takes as input an AISFile object and returns a DOM document obtained by parsing the file (or null if the parsing fail, e.g. if the file is not a valid XML file).
+
</td>
+
</tr>
+
</tr>
+
<tr>
+
<td>
+
Built-in extensible functions
+
</td>
+
</tr>
+
<tr>
+
<td>
+
file()  
+
</td>
+
<td>
+
Accepts as input a string used to describe the location of some file, and returns a set of AISFile objects. There are no constraints set on the format of the string, which might be a URL or some other kind of identifier. Internally, the file function is able to deal with a number of communication protocols, including http, ftp, and gridftp.
+
</td>
+
</tr>
+
<tr>
+
<td>
+
string()  
+
</td>
+
<td>
+
This function provides a way to perform a custom string serialization of objects. It accepts as input an object of any type and returns an object of type String. If a suitable extension is provided for it for a given Java type, then it will be used to process tat type of object. Otherwise, the fuction will just return the result of invoking the java toString() function on its argument. As built-in functionality, the string() function recognizes DOM objects and provides for them a serialization
+
as XML document.  
+
</td>
+
</tr>
+
</table>
+
  
== Extensibility features ==
+
In order to '''''make a new protocol available to the AIS''''', it is sufficient to implement the <tt>RemoteFile</tt> interface and register the class to the <tt>RemoteFileFactory</tt> class. The class implementing the type must be able to "recognize" the format of locators that  it can deal with. More in detail, the class must have aa single argument constructor taking as parameter a locator, and the constructor must throw an exception if the format of the locator is not recognized. The function getFile() passes the locator to the RemoteFileFactory, which tries to instantiate a remote file of the most specific type by trying all registered remote file types (''late type binding'').
+
The entire architecture of the AIS is design to offer an high degree of extensibility at various levels. All these mechanisms are based on a plug-in style approach: the service doesn't have to be recompiled, redeployed or even stopped whenever an extension is added to it.  
+
  
At the level of the interface with the service, "adapters" can be incorporated in the service. In the simplest case, such adapters will just offer a simplified interface to certain functionality that has to be invoked often without many variations. For instance, in the case when the files to import are stored at certain directory accessible through ftp, or in the case when a single file available at some location describes the location of files and related metadata, it would be desirable to invoke the AIS with the bare number of parameters needed to perform the task, rather than having to write an AISL script from scratch.  
+
Some network protocols allow for a hierarchical, directory-style structuring of resources. This means basically that it is possible, from a given resource, to get a list of their children. For hierarchical resources, it is possible to implement the <tt>HierarchicalRemoteFile</tt> interface. If basic caching capabilities are acceptable, it is possible (but not mandatory) to extend instead the <tt>AbstractRemoteFile</tt> and <tt>AbstractHierarchicalRemoteFile</tt> classes. These classes already provide a standard implementation for a number of methods defined by the corresponding interfaces.
  
The AISL language itself provides extension mechanisms. New functions can be defined for the language, and existing extensible functions can be extended to treat a larger number of argument types. The representation model used to describe resources is fully flexible: new types can be defined for collections, objects and relationships, and all these representation constructs can be attached with arbitrary properties of arbitrary types.
+
=== Writing Importers ===
Regarding the importing phase, new objects types can be handled by definining pluggable importers. These importers will be invoked together with all other importers available to the AIS service, and handle specific types of representation objects. It is to be observed that the definition of new representation constructs and that of new importers are not disjoint activities. In order for new object types to have meaning, there must
+
Importer are software modules that process the graph of resources and decide about import actions, interfacing with some gCube component for content management, like for instance the [[Collection_Management|Collection Management Service]], the [[Content_Management|Content Management Service]] and the [[Metadata_Management|Metadata Management Service]]. Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the AIS and the services of the Information Organization stack responsible for managing that kind of resource. The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection.
be an appropriate importer which is able to handle them (otherwise they will just be ignored). In order for an importer to work with a specific
+
The Archive import Service already includes importers dedicated to the creation of content and metadata collections and to the creation of complex documents and metadata objects. Thus, the creation of a new importer is an activity which is only needed if a new kind of content model is defined over the InfoObjectModel (see [[Storage_Management|Storage Management]]) and facilities for its manipulation are offered by some new gCube component.
kind of object type, the importer must be aware of its interface, i.e. its properties. While properties attached to representation objects can be always accessed by name, in order to make easier the development of importers it is possible to define new representation constructs so that their properties can be accessed via getters and setters. During the description phase, the AISL interpreter will recognize if a new type is connected to a subclass of the related representation construct and build an object of the most specific class, so that later on the importers can work directly on objects of the types they recognize.  
+
  
 +
==== Defining Importer-Specific Types ====
 +
Writing a new importer requires to know how to interact with such component, and how to manipulate a Graph of Resources. The data model handled by AISL features three main types of constructs:
  
'''Plugin mechanism'''
+
* Resource
In order to support these extensibility mechanism, the AIS exploits a simple plugin pattern. The modules (Java classes) that have to be made available must extend appropriate interfaces/classes defined inside the AIS framework, and be defined in specific Java packages. To be compiled properly, these classes must of course be linked against the rest of the code of the AIS.
+
* Relationship
After compilation, the resulting .class files can be made available to an AIS instance by putting them into a special folder under the control of the AIS instance. The classes will be dynamically loaded and used (partially using the Java reflection facilities).
+
* Collection
  
In general, The main interfaces/classes that can be extended for producing  plugins are the following:
+
A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore, nodes (resources) can be organized into sets (collections), that can in turn be connected using relationships. All constructs of the model can be annotated with properties, which are name/value pairs. The constructs above correspond internally to the three classes <tt>Resource</tt>, <tt>ResourceCollection</tt> and <tt>ResourceRelationship</tt>. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. This can be done by subclassing the above mentioned classes. Which subtypes to implement, and the precise semantics of their properties, depend on the specific importer. For instance, the MetadataCollection Importer declares only one new type, the collection::metadata type, that specialized the type collection to allow for specific metadata collection-related properties. Notice that importers can also manipulate objects belonging to subclasses defined by other importers. For instance, the MetadataCollection importer needs to access properties of the ContentCollection subtype, defined by the ContentCollection importer, in order to be able to create metadata collections.
  
To extend the AISL language:
+
In order to define their own subtypes, importers must:
General Extension:
+
#Subclass the basic types as needed;
AbstractRule
+
#Register the classes in the <tt>GraphOfResources</tt> class. This automatically extends the language with the new types;
CompositeRule
+
#Publish the properties allowed for the new subtypes;
Function
+
Processor
+
  
To extend the representation model:
+
Regarding the last point, notice that the types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype. Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).
RepresentationObject
+
RepresentationObjectCollection
+
Relationship
+
  
To define a new Importer:
+
==== Defining Importer Logics ====
Importer
+
The actual logic of the import for a new importer is contained in a class that must simply implement the <tt>Importer</tt> interface, which is as follows:
 
+
<pre>
For a full description of these interfaces/classes, especially regarding the internal contracts/policies that the subclasses classes must follow, we refer the reader to the documentation of the AIS.
+
public interface Importer{
 
+
public String getName();
 
+
public void importRepresentationGraph(GraphOfResources graph) throws RemoteException, ExecutionInterruptedException;
 
+
}
== Support for incremental import ==
+
</pre>
 
+
The archive import service must fully support incremental import. With the term incremental import, we denote the fact that if the same collection is imported twice, the import task should only perform the actions needed to update the collection with the changes occurred within the two imports, and not re-import the entire collection. Incremental import requires two features. First, it must be possible to specify that a collection is the same an another collection already imported.
+
For the existing importers, this is achieved by specifying inside the description of a collection the unique identifier of the collection in the related service. For instance, for content collections, this would be the collectionID attached to each collection by the Collection Management Service. After the description of the new collection has been created, the service will compare this description with that resulting from the previous import, and decide which objects must be imported, which must be substituted, and which must be deleted from the collection.
+
 
+
When comparing two collection, the import service must know how to decide wheter two objects present inside a collection are actually the same object. In order to support this behaviour, the AIS must support two concepts, that of external-object-identity and that of content-equality.
+
 
+
External object identity is an external object-identifier. Two objects are considered to be the same if and only if they have the same external identifier. Notice that the external identifier is distinct from the internal-object identity that is used by services to distinguish between objects. Of course, the AIS must ensure a correspondence between internal and external identifiers. Thus, if an object with a given external identifier has been stored with a given internal identifier, then another object is imported with the same external identifier then the AIS will not create a new object, but
+
(eventually) update the contents/properties of the existing object.
+
 
+
External Identity is sufficient to decide whether an object to be imported already exists in a given collection. If an object already exists, and it has changed, then it will be updated. Deciding whether an object has changed also requires additional knowledge. For many properties attached to an object, the comparison is straightforward. However, for the actual content of an object this is less immediate. The content of an object (that is a file) might reside at the same location, but differ from that previously imported. Furthermore, comparing the old content with the new content
+
is an expensive task: it requires at least to fetch the entire content of the new object and thus is as expensive as just reimporting the content itslef. For this reason the AIS also supports the concept of content-identity.
+
A content-identifier can be specified. If two identical objects (i.e. having the same external-object-identity) have the same content-identifier, than the content is not reimported. If, on the other hand, the identifier differs,
+
then the new content is re-imported. A content identifier can be a signature of the actual content, or any other string.
+
 
+
If a content identifier is not specified, the AIS the content-identifier is obtained exploiting the following information location of the content, size of
+
the content (when available, depends on the particular protocol used), date of last modification of the content (when available, depends on the protocol used), an hash signature of the content
+
(when available from the server, depends on the protocol used. For instance, it is possible to get an MD5 signature of a file from an HTTP server).
+
  
 +
The first method must provide a human-readable name for the importer (for logging and status notification purposes). The second method will be passed, during operation, a GraphOfResources object, and must contain the logic needed for manipulating the objects in the graph, selecting the ones of interest and perform the actual import tasks.
  
 +
== Current Limitations and Known Issues ==
  
== Service Operations ==
+
The AIS is currently released as a '''''standalone client'''''. The class <tt>org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient</tt> contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an AISL-based script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.
  
The operations of the archive Import Service Are described in the following:
+
After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.
+
Signature
+
Description
+
ImportTaskResponse  submitImportTask(String AISLScript)
+
This operation allows to submit an import task defined via the AISL scripting Lanaguage.
+
ImportTaskResponse submitAdapterTask(String adapter, String[] adapterParameters)
+
This operation allows to submit an import task to a specific adapter, which must have been previously registered as plugin. The parameters needed by the adapter to perform the task are specified via an array of string parameters. The number and semantics of these parameters are not defined in advance, and may differ for each single adapter.
+
RegisterPluginResponse registerPlugin(String PluginURL)
+
This operation allows to remotely load a new plugin component into the AIS. The plugin can be loaded by specifiying an URL at which the compiled java class or jar file can be downloaded.
+

Latest revision as of 12:41, 18 June 2009

Aim and Scope of Component

The Archive Import Service(AIS) is dedicated to "batch" import of resources which are external to the gCube infrastructure into the infrastructure itself. The term "importing" refers to the description of such resources and their logical relationships (e.g. the association between an image and a file containing metadata referring to it) inside the Information Organization stack of services. While the AIS is not strictly necessary for the creation and management of collections of resources in gCube, it makes possible in practice the creation of large collections, and their automated maintenance.

Logical Architecture

The task of importing a set of external resources is performed by building a description of the resources to import, in the form of a graph labelled on nodes and arcs with key-value pairs, called a graph of resources (GoR), and then processing this description accomplishing all steps needed to import the resources represented therein. The GoR is based on a custom data model, similar to the Information Object Model, and is built following a procedural description of how to build it expressed in a scripting language, called the Archive Import Service Language (AISL). Full details about this language and how to write an import script are given in the Administrator's guide. The actual import task is performed by a chain of pluggable software modules, called "importers", that in turn process the graph and can manipulate it, annotating it with additional knowledge produced during the import. Each importer is dedicated to import resources interacting with a specific subpart of the Information Organization set of Services. For instance, the metadata importer is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service. The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers. This, and the pluggable nature of importers, makes possible to enable the import of new kind of content resources which might be defined in the infrastructure in the future.

The logical architecture of the AIS can be then depicted as follows:

AISDiagram.jpeg

Import state and incremental import

During the import of some resources, the corresponding GoR is kept updated with information regarding the actual resources created, such as their OIDs. The Graph of Resources is stored persistently by the service, so that a subsequent execution of the same import script is aware of the status of the import and can perform only the differential operations needed to maintain the status of the resources up-to-date. While this solution involves a partial duplication of information inside the infrastructure, it has been chosen because it introduces a complete decoupling between the AIS and other gCube services, which are thus not forced to offer additional information needed for incremental import in their interfaces.

Service Implementation

Note: this section refers to the Architecture of the AIS as a stateful gCube service. The status of the compoenent is however currently that of a standalone tool. Please see the section current limitations and known issues for further details.

The AIS is a stateful gCube service, following the factory approach. A stateless factory porttype, the ArchiveImportService porttype, allows to create stateful instances of the ArchiveImportTask porttype. Each import task is responsible for the execution of a single import script. It performs the import, maintains internally the status of the import (under the form of an annotated graph of resources), and provides notification about the status of the task. The resources it is responsible for are kept up to date by re-executing the import script at suitable time intervals. Beside acting as a factory for import tasks, the ArchiveImportService porttype also offers additional functionality related to the management of import scripts. Import scripts are generic resources inside the infrastructure. The porttype allows to publish new scripts, list and edit existing scripts, and validate them from a syntactic point of view. The semantics of the methods offered by the two porttypes are described in the following.

Archive Import Service Porttype

  • String[] list(): returns a list of import scripts identifiers for scripts currently available in the VRE (as generic resources);
  • Script load(String scriptId): returns the import script corresponding to the given identifier;
  • void save(Script script): saves an import script, by creating or updating a corresponding generic resource;
  • void delete(String scriptId): deletes the import script corresponding to the given identifier;
  • ValidationReport validate(String scriptSpecification): this method performs validation of the given input string, which is treated as an AISL script and undergoes parsing and other syntactic validation steps. ;
  • EndpointReferenceType getTask(String importScriptIdentifier): this method gets an instance of the ArchiveImportTask service dedicated to a given script.

The Complex Types Script and ValidationReport used in the methods above represent respectively an AISL Script (characterized by a scriptId, a description and a content, i.e. the script itslef) and a syntactic validation report (containing a boolean validity flag, a message and other details like error row and column).

Archive Import Task Porttype

  • void start(ImportOptions options): this method starts the import task with the given options. Options include a run mode (validate, build, simulation import, import);
  • void stop(): this method stops the import task;
  • TaskStatus getStatus(): this method is included to get detailed information about the progress of an import task, including some information on errors happened during the import task execution. The Complex Type TaskStatus contains a number of informations about the task progress, e.g. execution state (new, running, stopped, failed), execution phase (parsing, building, importing) and eventually import phase (document, metadata...) the number and types of graph objects currently created, imported and failed. For these last ones, it is also available detailed error information.

Import Task Status Notification

The ArchiveImportTask service maintains part of its state as a WS-resource. The properties of this resource are used to notify interested service about:

  • the current phase of the import (e.g. graph building, importing etc.);
  • the current number of objects created in the GoR;
  • the current number of objects imported successfully;
  • the current number of objects whose import failed.

Implementation Details

The following UML diagram describes concisely how the AIS is organized from an implementation point of view. The service itself depends on a library whose packages are dedicated each to a specific functionality.

AISUML.jpg

In particular:

  • ais contains the classes related to the main functionality of the service, like defining the logical flow of a task execution
  • language contains the classes related to parsing and executing AISL scripts
  • datamodel contains the classes used in the description of graph of resources.
  • importing contains the classes used for the actual import, e.g. the so called importers.
  • remotefile contains all classes used to access remote resources.
  • util is a package collecting several different utility classes, like those used to manage the persistence of the GoR, caching, manipulating XML and HTML etc.

Each of these packages is further divided into multiple subpackages dedicated to specific functionality. These are not shown here to avoid visual cluttering.

Extensibility Features

The functionality of the Archive Import Service can be extended in three main ways. It is possible to define new functions for the AISL, and to plug in software modules to interact with external resources through additional network protocols and to interact with new gCube content-related components.

Defining Functions for the AISL

AISL is a scripting language intended to create graphs of resources for subsequent import. Its type system and grammar, together with usage examples, are fully described in the Administrator's guide. Its main features are a tight integration with the AIS, in the sense that the creation of model objects are first citizens in the language, and the ability to treat in a way as much as possible transparent to the user some tasks which are frequent during import, like accessing files at remote locations. Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems. The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. Adding a new function amounts to two steps:

  1. Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the AbstractAISLFunction class. See below for further details.
  2. Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration

The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method evaluate provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, this method should redirect to appropriate methods based on the number and types of the arguments.

public interface AISLFunction {
	public String getName();
	
	public void setFunctionDefinitions(FunctionDefinition ... defs);
	public FunctionDefinition[] getFunctionDefinitions();

	public  Object evaluate(Object[] args) throws Exception;
	
	public interface FunctionDefinition{
		Class<?>[] getArgTypes();
		Class<?> getReturnType();
	}
	
}

A partial implementation of the AISLFunction interface is provided by the AbstractAISLFunction class. A developer can simply extend this class and then provide an appropriate constructor and implement the appropriate evaluate method. An example is given below. The function match returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is thus:

boolean match(string str, string pattern)

the class Match.class below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method evaluate(Object[] args), which must be implemented to comply with the interface AISLFunction, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded, there is no actual need for a separate evaluate method, here it has been added for clarity).

public class Match extends AbstractAISLFunction{
		
	public Match(){
		setName("match");
		setFunctionDefinitions(
			new FunctionDefinitionImpl(Boolean.class, String.class, String.class)
		);
	}
			
	public Object evaluate(Object[] args) throws Exception{
		return evaluate((String)args[0], (String)args[1]);

	}
	
	private Boolean evaluate(String str, String pattern){
		return str.matches(pattern);
	}

}

Writing RemoteFile Adapters

When writing AISL scripts, the details of interaction with remote resources available on the network is hidden from the user, and encapsulated into the facilities related to a native data type of the language, the file type. The intention is to shield (almost) completely the user from such details, and presenting resources available through heterogeneous protocols via a homogeneous access mechanism.

A network resource is made available as file by invoking the getFile() function of the language. The function gets as argument a locator, which is a string (and optionally some parameters needed for authentication), and resolves, based on the form of the locator, which protocol to use and how to access the resource. To allow for extensibility, the format of the locator is not fixed in advance, but depends on the specific remote file type, which must be able to recognize such format (see below). To avoid excessive resource consumption, remote resources are not downloaded straight away. Instead, a file object acts as a placeholder, and content is made available on demand. Other properties of the resource, like for instance its length, last modification date or hash signature, are instead gathered (and possibly cached so to limit network usage). Of course, the availability of this information is related to the capabilities offered by the network protocol at hand. Once downloaded, content is also cached.

In order to make a new protocol available to the AIS, it is sufficient to implement the RemoteFile interface and register the class to the RemoteFileFactory class. The class implementing the type must be able to "recognize" the format of locators that it can deal with. More in detail, the class must have aa single argument constructor taking as parameter a locator, and the constructor must throw an exception if the format of the locator is not recognized. The function getFile() passes the locator to the RemoteFileFactory, which tries to instantiate a remote file of the most specific type by trying all registered remote file types (late type binding).

Some network protocols allow for a hierarchical, directory-style structuring of resources. This means basically that it is possible, from a given resource, to get a list of their children. For hierarchical resources, it is possible to implement the HierarchicalRemoteFile interface. If basic caching capabilities are acceptable, it is possible (but not mandatory) to extend instead the AbstractRemoteFile and AbstractHierarchicalRemoteFile classes. These classes already provide a standard implementation for a number of methods defined by the corresponding interfaces.

Writing Importers

Importer are software modules that process the graph of resources and decide about import actions, interfacing with some gCube component for content management, like for instance the Collection Management Service, the Content Management Service and the Metadata Management Service. Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the AIS and the services of the Information Organization stack responsible for managing that kind of resource. The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection. The Archive import Service already includes importers dedicated to the creation of content and metadata collections and to the creation of complex documents and metadata objects. Thus, the creation of a new importer is an activity which is only needed if a new kind of content model is defined over the InfoObjectModel (see Storage Management) and facilities for its manipulation are offered by some new gCube component.

Defining Importer-Specific Types

Writing a new importer requires to know how to interact with such component, and how to manipulate a Graph of Resources. The data model handled by AISL features three main types of constructs:

  • Resource
  • Relationship
  • Collection

A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore, nodes (resources) can be organized into sets (collections), that can in turn be connected using relationships. All constructs of the model can be annotated with properties, which are name/value pairs. The constructs above correspond internally to the three classes Resource, ResourceCollection and ResourceRelationship. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. This can be done by subclassing the above mentioned classes. Which subtypes to implement, and the precise semantics of their properties, depend on the specific importer. For instance, the MetadataCollection Importer declares only one new type, the collection::metadata type, that specialized the type collection to allow for specific metadata collection-related properties. Notice that importers can also manipulate objects belonging to subclasses defined by other importers. For instance, the MetadataCollection importer needs to access properties of the ContentCollection subtype, defined by the ContentCollection importer, in order to be able to create metadata collections.

In order to define their own subtypes, importers must:

  1. Subclass the basic types as needed;
  2. Register the classes in the GraphOfResources class. This automatically extends the language with the new types;
  3. Publish the properties allowed for the new subtypes;

Regarding the last point, notice that the types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype. Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).

Defining Importer Logics

The actual logic of the import for a new importer is contained in a class that must simply implement the Importer interface, which is as follows:

public interface Importer{
	public String getName();
	public void importRepresentationGraph(GraphOfResources graph) throws RemoteException, ExecutionInterruptedException;
}

The first method must provide a human-readable name for the importer (for logging and status notification purposes). The second method will be passed, during operation, a GraphOfResources object, and must contain the logic needed for manipulating the objects in the graph, selecting the ones of interest and perform the actual import tasks.

Current Limitations and Known Issues

The AIS is currently released as a standalone client. The class org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an AISL-based script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.

After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.