GCube Document Library (2.0)

From Gcube Wiki
Revision as of 20:59, 10 February 2011 by Fabio.simeoni (Talk | contribs) (Simple Projections)

Jump to: navigation, search

The gCube Document Library (gDL) is a client library for storing, updating, deleting and retrieving document description in a gCube infrastructure.

The gDL is a high-level component of the subsystem of gCube Information Services and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:

  • the gCube Document Model (gDM) defines the basic notion of document and the gCube Model Library (gML) implements that notion into objects;
  • the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the Content Manager Library (CML) can model such trees as objects and dispatch them to the read and write operations of the Content Manager (CM) service;
  • the CM implements these operations by translating trees to and from the content models of diverse repository back-ends.

The gDL builds on the gML and the CML to implement a local interface of CRUD operations that lift those of the CM to the domain of documents, efficiently and effectively.

Preliminaries

The core functionality of the gDL lies in its operations to read and write documents. The operations trigger interactions with remote services and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:

  • when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of relevant documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of document projections.
  • when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL streams this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally filter one stream into an other regardless of their origin. The facilities are collected into the stream DSL, an embedded domain-specific language for stream processing.

Understanding document projections and the stream DSL is key to reading and writing documents effectively. We discuss these preliminary concepts first, and then consider their use as input and outputs of the operations of the gDL.

Projections

A projection is a set of constraints over the properties of documents in the gDM. It can be used to match documents, i.e. identify documents whose properties satisfy the constraints of the projection.
Projections and matching are used in the read operations of the gDL:

  • as a means to characterise relevant documents (projections as types);
  • as a means to specify what parts of relevant documents should be retrieved (projections as retrieval directives).

The constraints of a projection take accordingly two forms:

  • include constraints apply to properties that must be matched and retrieved;
  • filter constraints apply to properties that must be matched but not retrieved.

Note: in both cases, the constraints take the form of 'predicates' of the [[Content_Manager_Library|Content Manager Library] (CML)]. The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the gCube Content Management architecture.

As a simple example of the implications, a projection may define an include constraint over the name of metadata elements and a filter constraint over the time of their last update.
It may then be used to:

  • characterise documents with metadata elements that match both constraints;
  • retrieve of those documents only the name of matching metadata elements, excluding any other document property, including other inner elements and their properties.

All projections in the gDL have the Projection interface, which can be used in element-generic computations to access their constraints. To build projections, however, clients deal with one of the following implementation of the interface:

  • DocumentProjection
  • MetadataProjection
  • AnnotationProjection
  • PartProjection
  • AlternativeProjection

A further implementation of the interface:

  • PropertyProjection

allows clients to express constraints on the generic properties of any of the elements of the gDM.

Simple Projections

Clients create projections with the factory methods of the Projections companion class (a static import improves legibility and is recommended):

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;...
DocumentProjection dp = document();
 
MetadataProjection mp = metadata();
 
AnnotationProjection annp = annotation();
 
PartProjection pp = part();
 
AlternativeProjection altp = alteranative();

The projections above do not specify any include or filter constraints on the elements of the corresponding type. For example, dp matches all documents, regardless of their properties, inner elements, and properties of their inner elements. Similarly, mp matches all metadata elements of any document, regardless of their properties, and pp matches all the parts of any document, regardless of their properties. Thus the factory methods of the Projections class return empty projections".

Clients may specific include constraints with the method with() on projections. For document projections, for example:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().with(NAME);

With the above, the client specifies the simplest form of constraint, that which requires the element to have given properties, here a name. Since this is an include constraint, the client is effectively expressing an interest only in this property, regardless of the existence and values of other properties. Used as a parameter in the read operations of the gDL, this projection is translated into a directive to retrieve only the names of document(s) that have one.

Note: properties are conveniently represented by constants in the Projections class. The constants are not strings, however, but dedicated Property objects that are specific to the type of projection. In particular, trying to use properties that are undefined for the type of elements targeted by the projection is illegal and the error is detected statically.

Inclusion constraints of this form may be expressed at once on multiple properties, e.g.:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().with(NAME,LANGUAGE,BYTESTREAM);


Besides inclusion constraints, clients may specify filter constraints with the method where() on projections, e.g:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().where(NAME,LANGUAGE);

Now, the client still requires documents to have a name and a language but he retains an interest in the other properties of matching documents. Used as a parameter in the read operations of the gDL, this projection is translated into a directive to retrieve all the properties of documents with a name.


Include and filter constraints can be combined, and the projections classes follow a builder pattern to add readability to the combinations. In particular, with() and where() return the very projection on which they are invoked and may be used as follows:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().with(NAME,SCHEMA_URI)
                                  .where(BYTESTREAM);

Here, the client requires documents to have a name and an embedded bytestream that conforms to a schema, but he has an interest in processing only document names and schema URIs (e.g. for display purposes). Used as a parameter in the read operations of the gDL, this projection retrieves the requested information but avoids the transmission of bytestreams.

Optional Modifiers

Moving now beyond the simple existence of properties, another common requirement is to indicate the optionality of properties. Clients may wish to include certain properties, or equivalently filter by certain properties, if and only if these actually exists. In this case, clients can use the opt() of the Projections class as a constraint modifier, as this example illustrates:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().with(NAME,opt(SCHEMA_URI))
                                  .where(BYTESTREAM);

This projection differs from the previous one only because of the optionality constraint on the existence of a schema for the document's bytestream. Used as a parameter in the read operations of the gDL, this projection retrieves the name all documents that include a bytestream, but also their schema URI if they happen to have one.

A common use of optional modifier is with bytestream, which clients may wish either to find included in the document or else referred to with a URL:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
...
DocumentProjection dp = document().with(opt(BYTESTREAM),opt(URL));

Used as a parameter in the read operations of the gDL, this projection retrieves at most the bytestream and its URL for those documents that have both, only one of the two if the other is missing, and nothing at all if they are both missing.

Note: The API allows optional modifiers in filter constraints too, but their application is rather pointless in this context (they will never elements from retrieval).

Advanced Projections

In more advanced forms of projections, clients may wish to specify constraints on properties other than mere existence. In these cases, they can use overloads of with() and <code<where()</code> that take as parameters Predicates that capture the desired constraints. As mentioned above, predicates are defined in the CML and gDL clients need to become acquainted with the range of available predicates and how to build them.


In one common case, clients need to constrain the value of an element property, as in the following example:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
import static org.gcube.contentmanagement.contentmanager.stubs.model.constraints.Constraints.*;import static org.gcube.contentmanagement.contentmanager.stubs.model.predicates.Predicates.*;...
DocumentProjection p = document().with(LANGUAGE,text(is("it"));

The client uses here the predicate text(is("it")) to constrain the language of documents to match the ISO639 code for italian. As documented in the CML, it does after importing the embedded predicate language defined by the static methods in its Predicates and Constraints classes.

Clients may also wish to constrain the cardinality of properties, as in the following example:

Streams

Local and Remote Iterators

Stream Language

Pipes and Filters

Grouping and Unfolding

Operations

Reading Documents

Adding Documents

Updating Documents

Deleting Documents

Views

Transient Views

Persistent Views

Creating Views

Discovering Views

Using Views

Advanced Topics

Caches

Buffers