GCube Information Organisation Services (LEGACY)

From Gcube Wiki
Jump to: navigation, search

The gCube Information Organisation Services is the family of subsystems implementing the services supporting the management (storage, organisation, description and annotation) of information. These services implement the notion of Information Objects, i.e. logical unit of information potentially consisting of and linked to other Information Objects as to form compound objects. in the services are organised in three main functional areas: (i) the storage and organisation of such Information Objects and their constituents (Content and Storage Management); (ii) the management of the metadata objects (actually implemented as a kind of Information Object) potentially equipping each Information Object (Metadata Management); and (iii) the management of the annotations objects (actually implemented as a kind of Information Object) potentially enriching each Information Objects (Annotation Management).

The gCube Content Model

While other infrastructures for the manipulation of content in Grid-based environments, like gLite, provide basic file-system like functionality for content manipulation, the Information Organization services are aimed to provide more high-level functionality, built on top of gLite or other storage facilities. Content is stored and organized following a graph-based data model, the Information Object Model, that allows finer control of content w.r.t. a file based view, by incorporating the possibility to annotate content with arbitrary properties and to relate different content unities via arbitrary relationships. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.

Information Object Model

The elementary constructs of the model are information-objects (a node of the graph) and object references (the arcs). The ER Diagram in Figure 1 describes the model.

Figure 1. Information Object ER Model
  • An Information Object (IO) represents an elementary information unity. It is uniquely identified by an Object Identifier (OID), is labelled with a name1 and a type2 and Information optionally annotated with a number of properties. These properties are simple key-type-value associations. Finally, it can be associated with a raw-content. The raw content of an object is content of any kind. The model hides the actual storage details of the content of an object, that can be for instance stored as a file in gLite or as BLOB-field in a database, or maintained in storage facilities not under direct control of the Information Organization Services, e.g. as file stored in a remote server and accessible through some protocol like http, ftp or gridftp.

An object reference “links” two Information Objects. Each object might (i) reference many other objects and (ii) be referenced by many objects (m-n relationship). A reference is directed, it is labelled with a type attribute, called primary role, a secondary role, that may optionally further specify the function of the primary role and a position attribute, that allows to build ordered graph structures. It can also be associate with a number of other properties. The information-object model introduced above is exposed to higher level Information Organization Services by a component called the Storage Management Service (cf. Section 6.3). The generality of this simple information model allows to build complex data-structures. The services within the Information Organization stack build on top of this model to offer an organized, high-level view of content. This is done by attaching specialised semantics to the labels used to annotate Information Objects and references.

Document and Collection Models

It is easy to build, starting from this model, a document model in which complex documents, composed of various, eventually nested subparts, are represented as chains of Information Objects linked via appropriate relationships. For instance, an HTML document that includes a number of images may be modelled as a complex object that provides references to Information Objects (containing the images). The positioning attribute present in the information-object model helps in representing an aggregate object made up of parts that have to be fitted together in a certain order. A dedicated component in the Information Organization family, the Content Management Service (cf. Section 6.4), exposes the document model to other services. In a similar way, specific, complex metadata (like indexes, multimedia features) can be represented as separate Information Objects that are associated to the object they describe via appropriate relationships. For instance, a reference type may be “indexes” with a role name that gives additional information, like “full-text index”. The same representation mechanisms are also used to instantiate a concept of collection. Collections are the basic data structure used to organize information inside the Information Organization Services. Each collection is characterized by a collection identifier, labelled with a number of specific properties, and contains a number of documents. More specifically, a document can only exist as part of a given collection. Collections can in turn be nested, i.e. a collection can appear as member of another collection. A collection can be static (or materialized), that is contain a statically defined number of objects, that are added to it or deleted from it explicitly, or be virtual. The content of a virtual collection is not determined statically, but rather specified through declarative membership predicates that define which objects currently present in the gCube information-objects space are part of the collection. Its contents are thus determined dynamically at the moment when the collection is accessed by evaluating the membership predicates. For example, it is possible to define the collection of all objects having a certain MIME type (e.g. pdf).

Metadata and Annotation Models

The metadata and annotation models are based upon a number of characterisations of the primitives which define the gCube storage model, namely Information Objects and directed, binary relationships between such objects. Based on such characterisations, we can define the metadata and annotation models so as to follow closely the intuition and yet preserve a degree of semi-formal rigour.

  • The primary role of a relationships is a characterisation of its intended semantics. We assume that the secondary role of a relationship is an optional specialisation of the primary role. Whenever convenient, we say that a relationship with (primary or secondary) role R is an R-relationships and that its source and target are an R-source or an R-target. If R1 and R2 are, respectively, the primary and secondary role of a relationship, then a R2-relationship is also an R1-relationship.
Figure 1. Relationship Model
  • A relationships R is a dependency for its source (target) if the existence of the latter depends on the existence of at least one R-target (source). In this case, we also speak of a dependent source (dependent target). As a pragmatic corollary, an R-source (R-target) is deleted when its last R-target (R-source) is deleted.
  • We say that a relationship is exclusive for its source (target) if it cannot relate the latter to more than one target (source). Otherwise, we say that it is repeatable on its source (target).
Figure 2. Exclusive Relations Model
  • Relationships with primary role is-member-of (IMO) give the semantics of collections to their targets and that of collection members to their sources.
  • We assume that IMO-relationships are repeatable on members, so that a object can be a member of more than one collection. We also assume that IMO-relationships dependencies on their members, so that a member is deleted when its last collection is deleted. Finally, we assume that collections cannot be members of other collections in turn.
  • We expect the members of a collection to share enough similarities to be homogeneously processed, such as content formats and/or relationships. In particular, we speak of an R-collection to characterise a collection whose members are all sources (targets) of R-relationships.
Figure 3. Membership Model
  • We say that a relationship R preserves membership if it relates members of some collection C to members of the same or a different collection C'. If C is an R-collection, then we say that R is a R-mapping from C to C'.
  • Since objects can be members of multiple collections, we assume that relationships on members may hold in the scope of some but not all of those collections. Scope is most usefully and tractably modelled when it is lifted to entire collections. In particular, we say that a collection C is in the scope of a collection C' if there is an R-mapping from C to C' and there is an R-relationship between C and C'.
Figure 4. Membership Preservation Model
  • We say that an object is a document if it models intellectual content. We then say that a collection is a document collection if all its members are documents.

The Metadata Model

  • We identify a type of relationships with primary role is-Described-by (IDB) which give to targets the semantics of metadata about the corresponding sources. In particular, we say that an IDB-source is a metadata object, or simply metadata, for the corresponding IDB-target. We expect IDB-targets to be documents, but do not require it.
  • We constrain IDB-relationships to be exclusive for their targets but repeatable for their sources. Accordingly, a metadata object can describe one and only one object but an object can have an arbitrary number of metadata objects. We also constrain IDB-relationships to preserve membership, so that a metadata object describes a member of some collection if and only if it is a member of some other collection in turn.
  • We say that a collection M is a metadata collection of type R for a collection C if M is an R-collection in the scope of C for some secondary role R of IDB. Specifically, (i) all the members of M are metadata for members of C, and (ii) M is metadata for C.
Figure 1. The Metadata Model

The Annotation Model

  • We identify a secondary role is-Annotated-by (IDB) for the IDB relationship which give to targets the semantics of annotations of the corresponding sources. In particular, we say that an IAB-source is an annotation object, or simply annotation, for the corresponding IAB-target. We expect IDB-targets to be documents, but do not require it.
  • We say that a collection M is an annotation collection if M is a metadata collection of type IAB.

Overall Architecture

The Architecture of the Information Organization services is articulated over three fundamental layers, as illustrated in Figure 2, Base Layer, Storage Management Layer, Content Management Layer, Metadata and Annotation Management. Additional information about these layers and their functionality is provided below, while the technical details related to how to interact with the services at each layer are provided in the next sections.

Figure 2.Data Management in gCube

The Base Layer is an abstraction from basic Grid storage facilities, as found in existing Grid infrastructures. Concretely, the Base Layer is a storage container responsible for storing the information needed to maintain the info-object model in a variety of storage facilities, like a hierarchical file system, a relational database and/or Grid storage facilities based on gLite, and to provide a uniform access to them to the layer directly above it. Though distinct from the upper layer from a logical and architectural point of view, the base layer exposes functionality only to the Storage Management Layer. For this reason, it functionality is only available as a Java API, used internally and not directly accessible through a service interface1.

The Storage Management Layer is wrapped around the Base Layer and has the responsibility to (i) introduce Information Objects as an abstraction for manifold contentand associate storage properties to them; (ii) introduce generalized relationships between Information Objects, and managing their properties; (iii) preserve bindings to file-based content; (iv) maintain consistency of object relationships through appropriate constraints; and (v) provide a service interface for querying and updating the content, its associated storage properties and relationship information. This interface is exposed by a single service, the Storage Management Service.

The Content, Metadata and Annotation Management area contains the services the higher-level gCube services interact with for content, metadata and annotation manipulation. They are composed of a number of services that, building on top of the basic info-object model exposed by the storage management layer, provide high-level information-organization models/views. It is at this level that the actual semantics of entities used by gCube services (like collections, documents and metadata) are attached

to the Information Objects and their relationships. In particular, the services that are part of this layer provide:

  • A document-oriented view of content, based on the document model described above. The functionality offered to manipulate documents includes basic document storage, lookup and update.
  • Collection management functionality, as sketched above, used to organize documents into complex data sets.
  • Fundamental metadata association.

Regarding notification facilities, services at higher level in the gCube infrastructure must be enabled to monitor changes at various levels in the content management layer. In particular, documents (as managed by the Content Management Service) and Collections (as managed by the Collection Management Service) should be monitored. The services in charge of maintaining content resources must then propagate appropriate events corresponding to changes. Event Notification in gCube is based on WS-Notifications. Furthermore, the gCube infrastructure provides facilities for the brokerage of registrations to notifications. However, in the case of content management services it does not make sense to employ such facilities. As they manipulate an extremely large amount of objects it is not possible to publish individually each of these objects as a topic in the Information System. Thus, the services must expose directly methods that allow other services to register to specific kind of events on specific objects. At this level is also present another service, that provides means to manage efficiently the functionality exposed by this layer. This takes care of Archive import, i.e. the creation and update of collections of documents starting from content available outside the infrastructure.