The Tree Manager Framework

From Gcube Wiki
Revision as of 00:00, 27 June 2012 by Fabio.simeoni (Talk | contribs) (SourceReader)

Jump to: navigation, search

The Tree Manager service may be called to persist or retrieve edge-labelled trees, either one at the time or many at once. Either way, the data is not necessarily stored locally to the service, or as trees. Instead, the data is most often held remotely, it is autonomously managed, and it is exposed by other services in a variety of forms and through different APIs.

The main value proposition of the service is that, in many cases and for many purposes, this variety of data sources may be ignored and the data uniformly accessed under a sufficiently general API and data model. This uniformity defines a basis for interoperability between service clients and data sources. It enables generic clients to implement cross-domain functions - including data indexing, transformation, discovery, transfer, browsing, viewing, etc. - over a single data model, against a single API, and with a consistent set of tools. Similar advantages can be granted to less generic clients that implement domain-specific or application-specific functions, provided that consensus is achieved around conventional uses of the tree model.

Within the service, uniformity is achieved with two-way transformations from the API and tree model of the service to those of the underlying data sources. Transformations are implemented in plugins, libraries developed in autonomy from the service so as to extend its capabilities at runtime. Service and plugins interact through a protocol defined by a set of local interfaces which, collectively, serve as a framework for plugin development.

The framework is packaged and distributed as a stand-alone library, the tree-manager-framework, and serves as a dependency for both service and plugins.


Tree-manager-framework-overview.png


The library and all its transitive dependencies are available in our Maven repositories. Plugins that are managed with Maven, can resolve them with a single dependency declaration:

<dependency>
  <groupId>org.gcube.data.access</groupId>
  <artifactId>tree-manager-framework</artifactId>
  <version>...</version>
  <scope>compile</scope>
</dependency>

In what follows, we address the plugin developer and describe the framework in detail, illustrating also design options and best practices for plugin development.

Overview

We start by overviewing the key components of the framework, their role in the design of a plugin, and their relationships.

The service and the plugin interact in order to notify each other of the occurrence of certain events:

  • the service observes events that relate to its clients, first and foremost their requests. These events translate in actions which the plugin must perform on data sources;
  • the plugin may observe events that relate to the data source, first and foremost changes to their state. These events need to be reported to the service.

In essence, the framework defines the interfaces through which all the relevant events may be notified.

The most important events are client requests, which can be of one of the following types:

  • bind requests, where client ask the plugin to "connect" to given sources. These clients know about the plugin and have included in the request all the information that the plugin needs in order to establish a binding. The service delivers the request to a SourceBinder provided by the plugin and expects back one Source instance for each bound source. The plugin configures Sources with information extracted or derived from the request, and configures them with other components that the service needs to access in future read and write requests. Thereafter, the service manages the Sources on behalf of the plugin.


Tree-manager-framework-bind-requests.png

  • read requests, where clients want to retrieve data from sources that have been previously bound to the plugin, specifying either one or more identifiers to resolve (lookup requests) or a pattern to match (query requests). These clients may not be aware of the plugin, having only discovered that the service gives read access to sources of interest. The service resolves Sources from requests and then relays them to associated SourceReaders, expecting trees back. It is the job of the plugin to translate the requests for the API of the bound sources and to transform the results returned by the sources into trees.

Tree-manager-framework-read-requests.png


  • write requests, where clients want to store data in sources that have been previously bound to the plugin, either new data (add requests) or changes to existing data, including deletions (update requests). These clients typically knows about the plugin and what type of trees it expects to add or update. The service resolves Sources from requests and then relays them to associated SourceWriters. Again, it is the job of the plugin to translate the requests for the API of the bound sources.

Tree-manager-framework-write-requests.png


Besides relaying client requests, the service also notifies the plugin of key events in the lifetime of its source bindings. It does so by invoking event-specific callbacks of a SourceLifecycle associated with the Sources. As we shall see, lifetime events include binding initialisation, reconfiguration, passivation, resumption, and termination.


Tree-manager-framework-service-events.png

 These are all the events that the service observes and passes on to the plugin. Others events may be observed directly by the plugin, including changes in properties or status of bound sources. These events are predefined as SourceEvent instances and the plugin reports them to SourceNotifiers that the service itself configures on Sources. The service also registers its own SourceConsumers with SourceNotifiers to receive event notifications. If useful for its own design, the plugin may also implement its own SourceEvents and SourceConsumers.


Tree-manager-framework-plugin-events.png

 All the key components of the plugin are introduced to the service through an implementation of the Plugin interface. From it, the service obtains the SourceBinders and, from the binder, the Sources, their SourceLifecycles, their SourceReaders and their SourceWriters. In addition, the Plugin implementation exposes descriptive information about the plugin that the service publishes in the infrastructure and uses in order to mange the plugin. For increased control over its own lifecycle, the plugin may implement the PluginLifecycle interface, which extends Plugin with callbacks invoked by the service when it loads and unloads the plugin.

To bootstrap the process of component discovery and find Plugin implementations, the service uses the ServiceLoader mechanism defined by the language. Accordingly, the plugin includes a file META-INF/services/org.gcube.data.tmf.api.Plugin in its Jar, where the files contains a single line with the qualified name of its Plugin or PluginLifecycle implementation.

Tree-manager-framework-pliugin-discovery.png

 This completes our quick overview of the main interfaces and classes provided by the framework. Note that, besides implementing the interfaces that define the interaction protocol with the the service, the plugin is free to design and develop against the framework using any technology that seems appropriate to the task.

In the rest of this guide we look at the key components of the framework in more detail.

Key Design Issues

The framework has been designed to support a wide range of plugins. There are indeed many degrees of freedom in plugin design:

  • what sources can it bind to? the plugin may
  • be dedicated to specific data sources (source-specific plugin)
  • target open-ended classes of data sources which publish data through standard APIs and in standard models (source-generic plugin);
  • what kind of trees does it accepts and/or returns? All plugins transform to trees and/or from trees, but the structure and intended semantics of those trees may vary substantially across plugins:
  • the plugin may be fully general transform a data model which is as general-purpose as the tree model (type-generic plugin). A plugin for data sources that expose arbitrary instances of XML data or RDF data through some standard API, for example, fall within this category. In this case, the meaning and shape of the trees may be unconstrained in principle, or it may be constrained only at the point of binding to specific data sources.
  • the plugin may be extremely specific and transform a concrete data model into trees with well-defined structures, i.e. abiding to a statically defined set of constraints on edge labels and leaf values (type-specific plugin). In this case, the tree model of the service is used as a general-purpose carrier for the original data model. The plugin documentation will include the definition of its tree type, anything ranging from narrative to formal XML Schema definitions. The definition may be specific to the plugin or else reflect wider consensus.
  • most plugins will always work with a single tree type, but this is not a constraint imposed by the framework. The plugin may support transformations into a number of tree types and allow binding clients to indicate in their bind requests the type they desire sources to be bound with (multi-type plugin). The plugin may embed the transformations , or else take a more a dynamic approach and define a framework for transformers which are discoverable on the classpath.
  • the plugin may support a single transformation which outputs trees that can be assigned multiple types, from more generic to more specific types, e.g. a generic RDF type as well as a more specific type associated with some RDF schema.
  • what requests does it support?
  • all plugins must accept at least one form of bind request, but a plugin may support many so as to cater for different types of bindings, or to support reconfiguration of a previous binding;
  • most plugins bind a single source per bind request, but some may bind many at once for some requests;
  • most plugins support read requests but do not support write requests, typically because the bound sources are static, or grant write access only to privileged clients. In principle at least, the converse may apply and a plugin may grant only write access to the sources. Overall, a plugin may support one of the following access modes: read-only, write-only, or read-write.
  • what functional and QoS limitations does it have? Rarely will the API and tree model of the service prove functionally equivalent to those of the bound sources. Even if the plugin restricts its support to a particular access mode, e.g. read-only, it may not be able to support all the requests associated with that mode, or to support them all efficiently. For example, its bound sources may:
  • offer no lookup API because they they do not mandate or regulate the existence of identifiers in the data;
  • offer no query API, or else support (the equivalent of) a subset of the patterns that clients may indicate in query requests;
  • not allow the plugin to retrieve, add, or update many trees at once.;
  • not support updates or not support partial updates or not support deletions.
In some cases, the plugin may be able to compensate for differences, typically at the cost of reduction in QoS. For example, the plugin:
  • it may be configured at binding time with queries that model lookups, differently for different bound sources;
  • it may partially transform patterns and then do local matches on the results returned by sources (2-phase match).
  • if the bound sources do not support partial updates, it may fetch the data first and then apply them locally;
In other cases, such as then bound sources do not support deletions, the plugin has not other obvious option but to fail requests.

Answering the questions above fixes some of the free variables in plugin design and helps to characterise it ahead of implementation. Collectively, the answers define a "profile" for the plugin and the presentation of this profile should have a central role in its documentation.

Plugin, PluginLifecycle, and Environment

A plugin implements the following methods of the Plugin interface:

  • String name(): returns the name of the plugin. The service will publish it and its clients may use it to discover instances of the service which have been extended with the plugin.
  • String description(): returns a brief description of the plugin. The service will publish it so that it can be inspected and displayed by a range of clients;
  • List<Property> properties(): returns triples (name, value, description), all String-valued. The service will publish them and its clients may use them to identify instances of the service which have been extended with the plugin. The plugin decides what properties may be useful to clients for discovery, inspection, or display. For example, if the plugin is multi-type, it will probably list the types that it supports here. The implementation returns null or an empty list if it has no properties to publish;
  • SourceBinder binder(); returns the plugin's implementation of the SourceBinder interface. The service will relay bind requests to it.
  • List<String> requestSchemas(): the schemas of the bind requests that the plugin can process. These will be published by the service to instruct binding clients to formulate their bind requests. There are GUIs within the system that use the schema to generate forms for interactive formulation of bind requests. The implementation may return null and decide to document its expectations elsewhere. If it does not return null, it is free to use any schema language of choice, though the existing GUIs expect XML Schemas which is thus the recommended language. Note that, in the common case in which the plugin models requests with Java classes and use JAXB as standard data binding solution, it can easily generate schemas directly in the implementation of the method, using JAXB.generateSchema().
  • boolean isAnchored(): returns an indication of whether the plugin is anchored, i.e. stores data locally to the service. If true, the service will inhibit its internal replication schemes. In the common case in which the plugin targets remote data sources, the implementation will simply return false.

As mentioned above, a plugin that needs more control over its own lifetime can implement PluginLifecycle, which extends Plugin with the following callback methods:

  • void init(Environment) is invoked when the plugin is first loaded;
  • void stop(Environment) is invoked when the plugin is unloaded;

For example, the plugin may implement init() to to start up a DI container of the likes of Spring, Guice, of CDI.

Environment is implemented by the service to encapsulate access to the environment in which the plugin is deployed or undeployed. At the time of writing, it serves solely as a sandbox over the location of the file system which may be accessed by the plugin. Accordingly, it exposes only the following method:

  • File file(path), which returns a file with a given path relative to the storage location.

SourceBinder

Whenever clients request bindings of data sources, the service consults the Plugin implementation discussed above and obtains a SourceBinder. It then invokes its single method:

List<Source> bind(Element)

The SourceBinder attempts the binding on the basis of the information found in the client request, and it returns a list of corresponding Sources. Note that:

  • the service ignores the particular shape of the request and passes it to bind() as a DOM’s Element. The plugin may inspect the request with the DOM API, any other XML API, or by binding the request to some Java class, e.g. using JAXB;
  • the plugin may accept a single type of request or many alternative types;
  • the plugin must throw an InvalidRequestException if the request is unrecognised or otherwise invalid, and a generic Exception for any other problem that it may encounter in the execution of bind();
  • in most cases, the request will result in the binding of a single data source, providing precise coordinates to identify it (e.g. an endpoint address). In some cases, however, the request may provide less pinpoint information, and the plugin may identify and bind at once many data sources from it. This explains the List type for the return value.

The actual binding process may vary significantly across plugins. For many, it may be as simple as extracting the endpoint of some remote data access service from the request (and checking its availability). For others, it may require discovering such an endpoint through some registry. Yet for others it may be a complex process comprised of a number of local and remote actions.

Finally, it should be noted that:

  • the service may not use all the Sources returned by the plugin. In particular, it will discard Sources that the plugin has already bound in previous invocations of bind() (this may occur if two bind requests target overlapping sets of data sources, or because they are identical requests issued from two autonomous clients, or because one request is aimed explicitly to the re-configuration of sources already bound by the other). Whenever possible, the plugin should avoid side-effects or expensive work in bind(), e.g. engage in network interactions. Rather, it should defer expensive work in SourceLifecycle.init(), as the service will make this callback only for Sources that it effectively retains. The minimal amount of work that the plugin must do in bind() is really to identify resources and setting their SourceLifecycle. We discuss SourceLifecycle below.
  • the service sets SourceNotifiers and Environments on Sources when the SourceBinder returns them from bind(). Accordingly, if the plugin needs to access the file system or notify an event at binding time, it should do so in SourceLifecycle.init() rather than in bind(). This is a corollary of the recommendation made above, i.e. avoid actions with side-effects in bind().

Source

If Plugin implementations provide the service with information about the plugin, Sources provide it with information about the data sources that become bound to the plugin. They do so by implementing the following methods:

  • String id(): returns the identifier of the source, which the service uses to tell sources apart;
  • String name(): returns the descriptive name of the source, which the service publishes and clients may use it to discover the source;
  • String description(): returns a brief description of the source, which the service publishes for reporting purposes;
  • List<Property> properties(): returns arbitrary properties of the bound source as triples (name, value, description), all String-valued. The service publishes them and its clients may use them to discover the source. Implementations must return null or an empty list if they have no properties to publish;
  • List<QName> types(): returns all the tree types produced and/or accepted by the bound source, as discussed above. These are qualified names that characterises the edge labels and leaf values that the SourceReader produce and that the SourceWriter consumes. The service publishes the types and its clients may use them to discover sources that produce or consume data with expected properties;
  • Calendar creationTime(): returns the time in which the source was created (the source, not the Source object). The service publishes this information, but implementations can return null if the source does not expose this information;
  • boolean isUser(): indicates whether the source ought be to marked as a user-level source or a system-level source. This is not a security option as such, and it does not imply any form of authorisation or query filtering. It’s rather a marker that may be used by certain clients to exclude system sources from their processes. In the vast majority of cases, plugins will bind user-level sources. If appropriate, the may be configured by binding clients to bind system-level sources;

The Source properties exposed through methods above are static in nature, in that the plugin sets them at source binding time. Others are instead dynamic, in that the plugin may update them during the lifetime of the binding:

  • Calendar lastUpdate(): returns the time in which the source was last updated (the source, not the Source object). Implementations can return null if the source does not expose this information;
  • Long cardinality(): returns the number of elements in the source. Implementations can return null if the data source does not expose this information and implementations cannot derive it;

The service publishes dynamic properties along with static properties, but it also associates them with topics for notification. Clients can subscribe for changes to the source and be notified when these changes occur. The plugin is responsible for changing these properties and for firing the corresponding event to the service, which then takes over and does the rest. We discuss how plugin can fire events can be fired later on.

Besides descriptive information, Sources must provide the service with other components that are logically associated with it:

  • SourceLifecycle lifecycle(): returns the lifecycle of the source. The service invokes its methods to notify the occurrence of certain events in the source’s lifetime;
  • SourceReader reader(): returns the SourceReader of the source. The service invokes its methods to relay read requests to the plugin. Implementations can return null if the plugin does not support read requests. Note that in this case the plugin must support write requests;
  • SourceWriter writer(): returns the SourceWriter of the source. The service invokes its methods to relay write requests to the plugin. Implementations can return null if the plugin does not support write requests. Note that in this case the plugin must support read requests;

If the plugin extends the default implementations of SourceLifecycle, SourceReader, or SourceWriter, the methods above can be overridden to restrict their output to more specific classes. This avoids casts in components that access the implementations through Sources, e.g.:

@Override
public MyReader reader() {
  return (MyReader) super.reader();
}

Next, Sources allow the service to set and then access its implementations of Environment and SourceNotifier:

Environment environment();
void setEnvironment(Environment);
SourceNotifier notifier();
void setNotifier(SourceNotifier);

We have discussed above how plugins can use the Environment to access the deployment context of the plugin. We discuss later how they can use the SourceNotifier to notify the service of events that relate to the source.

Note also that Sources may be passivated to disk by the service, as we discuss in more detail below. Source is indeed a Serializable interface, and the final requirement is that implementations honour that interface.

The framework provides an AbstractSource class that implements the interface partially. Sources implementations can and should extend it to avoid plenty of boilerplate code (state variables, accessor methods, default values, implementations of equals(), hashcode(), and toString(), shutdown hooks, correct serialisation, etc.). AbstractSource simplifies also the management of dynamic properties, in that it automatically fires a change event whenever the plugin changes the time of last update of Sources.

At its simplest, a Source implementation may take the following form:

public class MySource extends AbstractSource {
 
	private static final long serialVersionUID = 1L;
 
	//your additional fields, if any
 
	public MySource(String id) {
		super(id);
	}
 
	@Override
	public List<QName> types() {
		//here factored-out in constants because fixed
		return Collections.singletonList(MyConstants.TYPE);
	}
	@Override
	public List<Property> properties() {
		//here factored-out in constants because fixed
		return MyConstants.PROPERTIES;
	}
 
	//your additional methods, if any
}

SourceLifecycle

The SourceLifecycle interface define the following callbacks:

  • void init()<?code>: called by the service during bind requests to initialise the <code>Sources previously bound by the SourceBinder. As discussed above, this is place to perform actions that are expensive or generate side-effects. If the plugin needs to perform remote interactions or have some tasks to schedule, this is where it should do. It should also report any failure it encounters, so that the service can relay it to the binding client as the outcome of the request;
  • void reconfigure(Element): called by the service during bind requests to reconfigure Sources previously bound by the plugin. As discussed above, this occurs when the SourceBinder returns a Source that it had already produced in previous bind requests. In this case, the service will use the old Source to relay the request and simply discard the one returned last by the SourceBinder. If the plugin does not support reconfiguration, it must throw an InvalidRequestException. If instead reconfiguration fails, the plugin must instead throw a generic Exception;
  • void stop(): called by the service if it is shutting down, or if it is passivating Sources to storage to release some memory. If the plugin has scheduled tasks for the management of the Sources, this is a good time to gracefully stop them;
  • void resume(): called by the service when Sources are revived from storage, either because the service has been restarted after a shutdown, or because the Sources had been passivated to release memory resources but are now needed by service clients. If the plugin has scheduled tasks for the management of the Sources, this is a good time to re-start them. If the attempt fails, the plugin should throw the failure so that the service can relay it clients;
  • void terminate(): called by the service to signal that its clients no longer need to access Sources. If the plugin has some resources to release, this is the time to do it, typically after invoking stop() to gracefully stop any scheduled tasks that may be running.

Plugins that need to implement only a subset of the callbacks above can extend LifecycleAdapter and override only the callbacks of interest. Note also that, like Source, SourceLifecycle is a Serializable interface. The implementation must honour that interface.

Finally, note that all the callbacks assume that SourceLifecycles have access to the associated Sources. Typically, implementations adopt the following pattern:

 public class MyLifeCycle extends LifecycleAdapter {
 
	private static final long serialVersionUID = 1L;
 
	private final MySource source;
 
	//additional fields, if any...
 
	public MyLifeCycle(MySource source) {
		this.source = source;
	}
 
	//callbacks and additional methods, if any...
}


SourceEvent, SourceNotifier, and SourceConsumer

SourceEvent is a tagging interface for objects that represent events that relate to data sources and that may only be observed by the plugin. In the interface, two such events are pre-defined as constants:

  • SourceEvent.CHANGE: this event occurs in correspondence with a change to the dynamic properties of a Source, such as its cardinality or the time of its last update;
  • SourceEvent.REMOVE: this event occurs when a data source is no longer available. Note that this is different from the event that occurs when clients indicate that access to the source is no longer needed (cf. SourceLifetime.terminate());

The plugin may have the means to observe these events, e.g. because the data source offers subscription mechanisms, or because it exposes its cardinality and the plugin polls it, or even because the plugin offers write-access to the source and thus observes directly when the source and its cardinality change.

In all these cases, the plugin should report events to the SourceNotifier that the service has set on the Sources, invoking its method:

void notify(SourceEvent);

Note again that, when Sources extend AbstractSource, changing their time of last update automatically fires SourceEvent.CHANGE events. Unless there are no other reasons to notify events to the service, the plugin may never have to invoke notify() explicitly.

Note also that, as we have already noted above, the service will inject a SourceNotifier in the plugin's Sources only after these are returned to it by SourceBinder.bind(). Any attempt to notify events prior to that moment will fail. For this reason, if the plugin needs to change dynamic properties at binding time, then it should do so in SourceLifecycle.init().

SourceNotifier has a second method that can be invoked to subscribe consumers for SourceEvent notifications:

void subscribe(SourceConsumer,SourceEvent...)

This method subscribes a SourceConsumer to one or more SourceEvents. Normally, plugins will not have to invoke it, as the service will subscribe its own SourceConsumers with the SourceNotifiers.

However, the plugin is free to use the available support for event notification within its own codebase. In this case, the plugin can define its own SourceEvents and implement and subscribe its own SourceConsumers. In this case, SourceConsumers must implement the single method:

void onEvent(SourceEvent...)

which is invoked by the SourceNotifier with one or more SourceEvents. Normally, the subscriber will receive single event notifications, but the first notification after subscription will carry the history of all the events previously notified by the SourceNotifier.

Auxiliary APIs

All the previous interfaces provide a skeleton around the core functionality of the plugin, which is to transform the API and the tree model of the service to those of the bound sources. The task requires familiarity with three APIs defined outside the framework:

  • the tree API, with which the plugin constructs and deconstructs the edge-labelled tree that it accepts in write requests and/or returns in read requests;
  • the pattern API, with which the plugin constructs and deconstructs the patterns that characterise the trees returned by read requests. If the plugin supports such requests, it must ensure that it returns only trees that match given patterns, and in fact only the matching portions of those trees;
  • the stream API, with which the plugin models the data streams that flow in and out of the plugin. Streams are used in read requests and write requests that take or return many data items at once, such as trees, tree identifiers, or even paths to tree nodes. The streams API models such data streams as instances of the Stream interface, a generalisation of the standard Java Iterator interface which reflects the remote nature of the data. Not all plugins need to implement stream-based operations from scratch, as the framework offers synthetic implementations for them. These implementations, however, are derived from those that work with one data item at the time, hence have very poor performance when the data source is remote. Plugins should use them only when native implementations are not an option because the bound sources do not offer any stream-based or paged bulk operation. When they do, the plugin should really feed their transformed outputs into Streams. In a few cases, the plugin may need advanced facilities provided by the streams API, such as fluent idioms to convert, pre-process or post-process data streams.

Documentation on working with trees, tree patterns, and streams is available elsewhere, and we do not replicate it here. The tree API and the pattern API are packaged together in a trees library available in our Maven repositories. The streams API is packaged in a streams library also available in the same repositories. If the plugin also uses Maven for build purposes, these libraries are already available in your classpath as indirect dependencies of the framework.

SourceReader

A plugin implements the SourceReader interface to provide a tree view of the data in a bound source. To begin with, a SourceReader implements the following "lookup" methods:

  • Tree get(String,Pattern): returns a tree with a given identifier and pruned with a given Pattern.
The implementation must throw an UnknownTreeException if the identifier does not identify a tree in the source, and an InvalidTreeException if a tree can be identified but does not match the Pattern;
  • Stream<Tree> get(Stream<String>,Pattern): returns trees with given identifiers and pruned with a given Pattern.
The implementation must throw a generic Exception if it cannot produce the stream at all, though it must simply not add to it whenever trees cannot be identified or do not match the Pattern.

In addition, a SourceReader implements the following "query" method:

  • Stream<Tree> get(Pattern): returns trees pruned with a given Pattern.
Again, the implementation must throw a generic Exception if it cannot produce the stream at all, though it must simply not add to it whenever trees do not match the Pattern.

SourceWriter