The Tree Manager Framework

From Gcube Wiki
Revision as of 17:41, 20 June 2012 by Fabio.simeoni (Talk | contribs) (SourceBinder)

Jump to: navigation, search

The Tree Manager service may be called to persist or retrieve edge-labelled trees, either one at the time or many at once. Either way, the data is not necessarily stored locally to the service, or as trees. Instead, the data is most often held remotely, it is autonomously managed, and it is exposed by other services in a variety of forms and through different APIs.

The main value proposition of the service is that, in many cases and for many purposes, this variety of data sources may be ignored and the data uniformly accessed under a sufficiently general API and data model. This uniformity defines a basis for interoperability between service clients and data sources. It enables generic clients to implement cross-domain functions - including data indexing, transformation, discovery, transfer, browsing, viewing, etc. - over a single data model, against a single API, and with a consistent set of tools. Similar advantages can be granted to less generic clients that implement domain-specific or application-specific functions, provided that consensus is achieved around conventional uses of the tree model.

Within the service, uniformity is achieved with two-way transformations from the API and tree model of the service to those of the underlying data sources. Transformations are implemented in plugins, libraries developed in autonomy from the service so as to extend its capabilities at runtime. Service and plugins interact through a protocol defined by a set of local interfaces which, collectively, serve as a framework for plugin development. The framework is packaged as a stand-alone library, the tree-manager-framework, and is a dependency for both service and plugins.


Tree-manager-framework-overview.png


The library and all its transitive dependencies are available in our Maven repositories. Plugins that are managed with Maven, can resolve them with a single dependency:

<dependency>
  <groupId>org.gcube.data.access</groupId>
  <artifactId>tree-manager-framework</artifactId>
  <version>...</version>
  <scope>compile</scope>
</dependency>

In what follows, we address the plugin developer and describe the framework in detail, illustrating also design options and best practices for plugin development.

Overview

We start by overviewing the key components of the framework, their role in the design of a plugin, and their relationships.

The service and the plugin interact in order to notify each other of the occurrence of certain events. The service observes events that relate to its clients, first and foremost their requests; these translate in actions which the plugin must perform on data sources. Vice versa, the plugin may observe events that relate to the data source, first and foremost changes to their state; these need to be reported to the service. The framework defines the interfaces through which all these events may be notified.

There are three types of client requests that the service may relay to the plugin:

  • bind requests, where a client asks the plugin to connect to given sources. This type of client knows about the plugin and has in fact included in the request all the information that the plugin needs in order to establish a binding. The service delivers the request to a SourceBinder provided by the plugin and will expect back one Source for each bound source. The plugin configures Sources with information extracted or derived from the request, and inject them with other components that the service needs to access for later requests. Thereafter, the service manage Sources on the plugin's behalf.


Tree-manager-framework-bind-requests.png

  • read requests, where a client wants to retrieve data from a source previously bound to the plugin, and specifies either one or more identifiers to resolve (lookup requests), or patterns to match (query requests). This type of client may not know about the plugin at all, having simply discovered that the service gives read access to a source of interest. The service will resolve the plugin's Sources from requests and then deliver them to the SourceReaders provided by the Sources, expecting trees back. It is the plugin's job to translate requests for the API of the bound sources and to transform the results returned by the sources into trees.

Tree-manager-framework-read-requests.png


  • write requests, where a client wants to store data in a source previously bound to the plugin, either new data (add requests) or changes to existing data, including removals (update requests). This type of client typically knows about the plugin and what this expects from the data. The service resolves the plugin's Sources from requests and then deliver them to the SourceWriters provided by the Sources, passing trees to them. And again, it is the plugin's job to translate requests for the API of the bound sources.

Tree-manager-framework-write-requests.png


Note that the plugin must provide a SourceBinder and at least one SourceReader or SourceWriter.

Besides relaying client requests, the service also notifies the plugin of key events in the lifetime of its source bindings. It does so by invoking event-specific callbacks of the SourceLifecycle associated with the plugin's Sources. As we shall see, lifetime events include binding initialisation, reconfiguration, passivation, resumption, and termination.


Tree-manager-framework-service-events.png

 These are all the events that the service observes and pass on to the plugin. Others events, however, may be observed by the plugin, such as changes in properties or status of bound sources. These events are predefined as SourceEvents and the plugin reports them to the SourceNotifiers that the service injects in the plugin's Sources. The service has its own SourceConsumers that receive the notifications of the plugin. If useful to connect its components, the plugin can implement its own SourceEvents and SourceConsumers.


Tree-manager-framework-plugin-events.png

 All the key components of the plugins are introduced to the service through an implementation of the Plugin interface. From it, the service gets the plugin's SourceBinder and from the binder it obtains the plugin's Sources, their SourceLifecycles, their SourceReaders and SourceWriters. In addition, the Plugin exposes descriptive information about the plugin that the service publishes in the infrastructure and use in order to mange the plugin. Optionally, the plugin may implement PluginLifecycle, which extends Plugin with callbacks invoked by the service when the plugin is loaded and unloaded. This gives the plugin more control on its lifecycle.

To bootstrap the process of component discovery and find Plugin implementations, the service adopts the standard Java mechanism based on a ServiceLoader. Accordingly, the plugin includes a file META-INF/services/org.gcube.data.tmf.api.Plugin in its Jar. The files contains a single line with the qualified name of its Plugin implementation.

Tree-manager-framework-pliugin-discovery.png

 This completes our quick overview of the main interfaces and classes provided by the framework. Note at the outset that, besides implementing the interfaces that define the interaction protocol with the the service, the plugin is free to design and develop against the framework using any technology that seems appropriate to the task.

In the rest of this guide we look at the key components of the framework in more detail.

Key Design Issues

The framework has been designed to support a wide range of plugins. There are indeed many degrees of freedom in plugin design:

  • what sources can it bind to? the plugin may be dedicated to specific data sources (source-specific plugin), or it may target open-ended classes of data sources which publish data through standard APIs and in standard models (source-generic plugin);
  • what kind of trees does it accepts and/or returns? All plugins transform to trees and/or from trees, but what is the structure and intended semantics of those trees? Depending on the bound sources and the design of the plugin:
    • the plugin may be fully generic, i.e. transform a data model which is as general-purpose as the tree model (type-generic plugin). A plugin for data sources that expose arbitrary instances of XML data or RDF data through some standard API, for example, fall within this category. In this case, the meaning and shape of the trees may be unconstrained in principle, or it may be constrained only at the point of binding to specific data sources.
    • alternatively, the plugin may be extremely specific and transform a concrete data model into trees with well-defined structures, i.e. abiding to a set of constraints on edge labels and leaf values which is statically defined (type-specific plugin). In this case, the tree model of the service serves as a general-purpose carrier for the original data model. The plugin documentation will include the definition of its tree type, anything ranging from narrative to formal XML Schema definitions. The definition may be specific to the plugin, or it may reflect a wider consensus towards which the plugin and many others may converge, regardless of the variety of their bound sources.
    • most plugins will always work with a single tree type, but this is not a constraint imposed by the framework. The plugin may support transformations into a number of tree types and allow binding clients to indicate in their requests the type they desire sources to be bound with (multi-type plugin). The plugin may then embed multiple transformations, or take a more a dynamic approach, define a framework for transformers, and discover the transformers available on the classpath.
    • finally, the plugin may support a single transformation which outputs trees that can be assigned multiple types, from more generic to more specific types, e.g. a generic RDF type as well as a more specific type associated with some RDF schema.
  • what requests does it support? all plugins must accept at least one form of bind request, but a plugin may support many so as to cater for different types of bindings, or to support reconfiguration of a previous binding. Further, most plugins bind a single source per bind request, but some may bind many at once for some requests. Most plugins also support read requests but do not support write requests, typically because the bound sources are static, or grant write access only to privileged clients. In principle at least, the converse may apply and a plugin may grant only write access to the sources. Overall, a plugin may support one of the following access modes: read-only, write-only, or read-write.
  • what functional and QoS limitations does it have? Rarely will the API and tree model of the service prove functionally equivalent to those of the bound sources. Even if the plugin restricts its support to a particular access mode, e.g. read-only, it may not be able to support all the requests associated with that mode, or to support them all efficiently. Its bound sources, for example, may offer no lookup API because they they do not mandate or regulate the existence of identifiers in the data. Alternatively, they may offer no query API, or else support (the equivalent of) a subset of the patterns that clients may indicate in query requests. Again, the bound sources may not allow the plugin to retrieve, add, or update many trees at once. In some cases, the plugin may be able to compensate for differences, typically at the cost of reduction in QoS. For example, the plugin may be configured at binding time with queries that model lookups, differently for different bound sources. Similarly, it may partially transform patterns and then do local matches on the results returned by sources (2-phase match). Coming to write requests, the bound sources may not support partial updates, forcing the plugin to fetch the data and apply them locally. Or they may not support updates at all, or they may not support deletions, leaving the plugin with no obvious option but to fail update requests.

Answering these questions fixes some of the free variables in plugin design and and helps to characterise it ahead of implementation. Collectively, the answers define a profile for the plugin and should serve as a key element of its documentation.

Plugin, PluginLifecycle, and Environment

A plugin implements the following methods of the Plugin interface:

  • String name(): returns the name of the plugin. The service will publish it and its clients may use it to discover instances of the service which have been extended with the plugin.
  • String description(): returns a brief description of the plugin. The service will publish it so that it can be inspected and displayed by a range of clients;
  • List<Property> properties(): returns triples (name, value, description), all String-valued. The service will publish them and its clients may use them to identify instances of the service which have been extended with the plugin. The plugin decides what properties may be useful to clients for discovery, inspection, or display. For example, if the plugin is multi-type, it will probably list the types that it supports here. The implementation returns null or an empty list if it has no properties to publish;
  • SourceBinder binder(); returns the plugin's implementation of the SourceBinder interface. The service will relay bind requests to it.
  • List<String> requestSchemas(): the schemas of the bind requests that the plugin can process. These will be published by the service to instruct binding clients to formulate their bind requests. There are GUIs within the system that use the schema to generate forms for interactive formulation of bind requests. The implementation may return null and decide to document its expectations elsewhere. If it does not return null, it is free to use any schema language of choice, though the existing GUIs expect XML Schemas which is thus the recommended language. Note that, in the common case in which the plugin models requests with Java classes and use JAXB as standard data binding solution, it can easily generate schemas directly in the implementation of the method, using JAXB.generateSchema().
  • boolean isAnchored(): returns an indication of whether the plugin is anchored, i.e. stores data locally to the service. If true, the service will inhibit its internal replication schemes. In the common case in which the plugin targets remote data sources, the implementation will simply return false.

As mentioned above, a plugin that needs more control over its own lifetime can implement PluginLifecycle, which extends Plugin with the following callback methods:

  • void init(Environment) is invoked when the plugin is first loaded;
  • void stop(Environment) is invoked when the plugin is unloaded;

For example, the plugin may implement init() to to start up a DI container of the likes of Spring, Guice, of CDI.

Environment is implemented by the service to encapsulate access to the environment in which the plugin is deployed or undeployed. At the time of writing, it serves solely as a sandbox over the location of the file system which may be accessed by the plugin. Accordingly, it exposes only the following method:

  • File file(path), which returns a file with a given path relative to the storage location.

SourceBinder

Whenever clients request bindings of data sources, the service consults the Plugin implementation discussed above and obtains a SourceBinder. It then invokes its single method:

List<Source> bind(Element)

The SourceBinder attempts the binding on the basis of the information found in the client request, and it returns a list of corresponding Sources. Note that:

  • the service ignores the particular shape of the request and passes it to bind() as a DOM’s Element. The plugin may inspect the request with the DOM API, any other XML API, or by binding the request to some Java class, e.g. using JAXB;
  • the plugin may accept a single type of request or many alternative types;
  • the plugin must throw an InvalidRequestException if the request is unrecognised or otherwise invalid, and a generic Exception for any other problem that it may encounter in the execution of bind();
  • in most cases, the request will result in the binding of a single data source, providing precise coordinates to identify it (e.g. an endpoint address). In some cases, however, the request may provide less pinpoint information, and the plugin may identify and bind at once many data sources from it. This explains the List type for the return value.

The actual binding process may vary significantly across plugins. For many, it may be as simple as extracting the endpoint of some remote data access service from the request (and checking its availability). For others, it may require discovering such an endpoint through some registry. Yet for others it may be a complex process comprised of a number of local and remote actions.

Finally, it should be noted that:

  • the service may not use all the Sources returned by the plugin. In particular, it will discard Sources that the plugin has already bound in previous invocations of bind() (this may occur if two bind requests target overlapping sets of data sources, or because they are identical requests issued from two autonomous clients, or because one request is aimed explicitly to the re-configuration of sources already bound by the other). Whenever possible, the plugin should avoid side-effects or expensive work in bind(), e.g. engage in network interactions. Rather, it should defer expensive work in SourceLifecycle.init(), as the service will make this callback only for Sources that it effectively retains. The minimal amount of work that the plugin must do in bind() is really to identify resources and setting their SourceLifecycle. We discuss SourceLifecycle below.
  • the service sets SourceNotifiers and Environments on Sources when the SourceBinder returns them from bind(). Accordingly, if the plugin needs to access the file system or notify an event at binding time, it should do so in SourceLifecycle.init() rather than in bind(). This is a corollary of the recommendation made above, i.e. avoid actions with side-effects in bind().

Source

If Plugin implementations provide the service with information about the plugin, Sources provide it with information about the data sources that become bound to the plugin. They do so by implementing the following methods:

  • String id(): returns the identifier of the source, which the service uses to tell sources apart;
  • String name(): returns the descriptive name of the source, which the service publishes and clients may use it to discover the source;
  • String description(): returns a brief description of the source, which the service publishes for reporting purposes;
  • List<Property> properties(): returns arbitrary properties of the bound source as triples (name, value, description), all String-valued. The service publishes them and its clients may use them to discover the source. Implementations must return null or an empty list if they have no properties to publish;
  • List<QName> types(): returns all the tree types produced and/or accepted by the bound source, as discussed above. These are qualified names that characterises the edge labels and leaf values that the SourceReader produce and that the SourceWriter consumes. The service publishes the types and its clients may use them to discover sources that produce or consume data with expected properties;
  • Calendar creationTime(): returns the time in which the source was created (the source, not the Source object). The service publishes this information, but implementations can return null if the source does not expose this information;
  • boolean isUser(): indicates whether the source ought be to marked as a user-level source or a system-level source. This is not a security option as such, and it does not imply any form of authorisation or query filtering. It’s rather a marker that may be used by certain clients to exclude system sources from their processes. In the vast majority of cases, plugins will bind user-level sources. If appropriate, the may be configured by binding clients to bind system-level sources;

The Source properties exposed through methods above are static in nature, in that the plugin sets them at source binding time. Others are instead dynamic, in that the plugin may update them during the lifetime of the binding:

  • Calendar lastUpdate(): returns the time in which the source was last updated (the source, not the Source object). Implementations can return null if the source does not expose this information;
  • Long cardinality(): returns the number of elements in the source. Implementations can return null if the data source does not expose this information and implementations cannot derive it;

The service publishes dynamic properties along with static properties, but it also associates them with topics for notification. Clients can subscribe for changes to the source and be notified when these changes occur. The plugin is responsible for changing these properties and for firing the corresponding event to the service, which then takes over and does the rest. We discuss how plugin can fire events can be fired later on.

Besides descriptive information, Sources must provide the service with other components that are logically associated with it:

  • SourceLifecycle lifecycle(): returns the lifecycle of the source. The service invokes its methods to notify the occurrence of certain events in the source’s lifetime;
  • SourceReader reader(): returns the SourceReader of the source. The service invokes its methods to relay read requests to the plugin. Implementations can return null if the plugin does not support read requests. Note that in this case the plugin must support write requests;
  • SourceWriter writer(): returns the SourceWriter of the source. The service invokes its methods to relay write requests to the plugin. Implementations can return null if the plugin does not support write requests. Note that in this case the plugin must support read requests;

If the plugin extends the default implementations of SourceLifecycle, SourceReader, or SourceWriter, the methods above can be overridden to restrict their output to more specific classes. This avoids casts in components that access the implementations through Sources, e.g.:

@Override
public MyReader reader() {
  return (MyReader) super.reader();
}

Next, Sources allow the service to set and then access its implementations of Environment and SourceNotifier:

Environment environment();
void setEnvironment(Environment);
SourceNotifier notifier();
void setNotifier(SourceNotifier);

We have discussed above how plugins can use the Environment to access the deployment context of the plugin. We discuss later how they can use the SourceNotifier to notify the service of events that relate to the source.

Note also that Sources may be passivated to disk by the service, as we discuss in more detail below. Source is indeed a Serializable interface, and the final requirement is that implementations honour that interface.

The framework provides an AbstractSource class that implements the interface partially. Sources implementations can and should extend it to avoid plenty of boilerplate code (state variables, accessor methods, default values, implementations of equals(), hashcode(), and toString(), shutdown hooks, correct serialisation, etc.). AbstractSource simplifies also the management of dynamic properties, in that it automatically fires a change event whenever the plugin changes the time of last update of Sources.

At its simplest, a Source implementation may take the following form:

public class MySource extends AbstractSource {
 
	private static final long serialVersionUID = 1L;
 
	//your additional fields, if any
 
	public MySource(String id) {
		super(id);
	}
 
	@Override
	public List<QName> types() {
		//here factored-out in constants because fixed
		return Collections.singletonList(MyConstants.TYPE);
	}
	@Override
	public List<Property> properties() {
		//here factored-out in constants because fixed
		return MyConstants.PROPERTIES;
	}
 
	//your additional methods, if any
}

SourceLifecycle

SourceEvent, SourceNotifier, and SourceConsumer

SourceReader

SourceWriter