Difference between revisions of "Storage Management"

From Gcube Wiki
Jump to: navigation, search
Line 1: Line 1:
 
==Storage Management==
 
==Storage Management==
gCube is strongly data-oriented. Contrary to other environments for the grid, the approach taken within the gCube system to the management of files and other resources which goes in the direction of a finer and more powerful control. Resources available in the infrastructure are described using a powerful data model, the so called Information-Object model. This allows to store, even at the lowest level of interaction, not only the raw content files, but also a plentiful of other meta-informations related to them, like properties and relationships of various nature to other resources. Furthermore, on the base of this simple but flexible model, it is possible to build more abstract and rich data models.
+
gCube is strongly data-oriented. Contrary to other environments for the grid, the approach taken within the gCube system to the management of files and other resources goes in the direction of a fine-grained control not only over files, but also over metadata referring to them and their relationships. Resources available in the infrastructure are described using a custom data model, the so called Information-Object model. This allows to store, even at the lowest level of interaction, not only the raw content files, but also a plentiful of other meta-informations, like properties and inter-relationships. On the base of this simple but flexible model, it is also possible to build more abstract and rich data models.
The Storage Management Service is a fundamental piece of a gCube architecture. Its role is to take care of the storage of data, and provide through its interface an abstraction to storage based on the info-object model. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.
+
The '''Storage Management Service''' is a fundamental piece of a gCube architecture. Its role is to take care of the storage of data, and provide through its interface an abstraction to storage based on the info-object model. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.
  
 
=== Reference Model ===
 
=== Reference Model ===
Line 13: Line 13:
  
 
=== Detailed Service Description ===
 
=== Detailed Service Description ===
The Storage Management Service provides the operations defined on Information-Objects by the Information-Object model introduced above. This includes assignment of storage properties, set up of inter-object relationships, and connection of an information object with a Raw Content (i.e. a file). The information internally maintained by the Storage Management Service amounts to all information required to describe existing Information Objects. The state of each Information Object is made up by its attributes and relationships (to other objects). This information is partially stored in a relational DBMS, whose logical schema is depicted in Figure 2.  
+
The Storage Management Service provides the operations defined on Information-Objects by the Information-Object model introduced above. This includes assignment of storage properties, set up of inter-object relationships, and connection of an information object with a raw content (i.e. a file). Part of this information is internally maintained by the Storage Management Service, stored in a relational DBMS, whose logical schema is depicted in Figure 2. The raw content of information objects can be stored internally, using a file-system based mechanism, or stored in a distributed way on the grid (for example reside at an ftp site).
  
 
[[Image:ERSchemaIOM.png|frame|center|Figure 2. ER Schema Used to Instantiate the Information Object Model ]]
 
[[Image:ERSchemaIOM.png|frame|center|Figure 2. ER Schema Used to Instantiate the Information Object Model ]]
Line 32: Line 32:
 
It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions.
 
It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions.
 
Whenever data needs to be transferred, e.g. to download a document, it is necessary to define how the raw file content should be made available. For this, It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions.
 
Whenever data needs to be transferred, e.g. to download a document, it is necessary to define how the raw file content should be made available. For this, It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions.
 
 
Whenever data needs to be transferred, e.g. to download a document, it is necessary to define how the raw file content should be made available. For this, several transfer protocols are supported. The preferred protocol is GridFTP (location starting with gsiftp://). Download of files is also supported for FTP (ftp://) and HTTP (http://) sites. Small files can be sent as part of a SOAP message (inmessage://) in base64 encoded form in the optional rawContent field. This transfer is limited by the GT4 container, which does support SOAP messages only up to a limited size of approximately 2 to 8 megabytes. Sending content as SOAP attachments is currently not possible, because GT4 will provide Attachment support starting with version 4.2. If the actual raw file content is not needed, e.g. because only the name of a document would be needed, a special location “/dev/null” can be specified. In this case, Storage Management will not perform any transfer at all, thus reducing the costs of this operation significantly.
 
In order to provide a flexible, yet stable API, optional parameters that do not affect directly the result of a service operation but only the way it is internally executed -(like specifying the best strategy to execute content transfer where more than one option is supported/available), can be supplied to many of the operations of the service in the form of StorageHints. Storage hints are simply key-value-pairs. The names of these hints and their allowed values are specified in the API of the service. The rationale behind hints is to allow future extensions and modifications to the service while still maintaining stable its interface. Hints which are unknown to the service are simply ignored. On the other hand, omitting to specify a hint is never harmful, as the service will adopt a default behavios if not instructed otherwise. When an operation accepts hints, then it also provides a way to retrieve a list of consumed hints, specifiying which hints were actually used internally.
 
 
 
=== Detailed Service Description ===
 
 
 
 
The Base and Storage Management Layers are exposed through a single service, the Storage Management Service. Internally, this service relies on the Content Management Library. Figure 1 shows the architecture of the Base and Storage Layers.
 
[[Image:F28gCubeBSLA.png|frame|center|Figure 1. gCube Base and Storage Layers Architecture]]
 
 
 
==Storage Management Service==
 
 
  
 
===Resources and Properties===
 
===Resources and Properties===
 
The Storage Management Service is implemented as a WSRF compliant stateless Web Service. While this service clearly maintains an internal state, this state is not related to the interaction within the service and a specific client, but coincides with the entire information space of the gCube infrastructure (i.e. all Information Objects stored in it). If the state of the service changes, the changes should be visible to all clients interacting with it. Furthermore, the mechanisms for maintaining state provided by the WSRF framework are not appropriate to handle in an efficient way the amount and complexity of the information the service has to maintain. For the same reason, the service does not publish any resource on the IS, as publishing a resource for each Information Object would yield a dramatic overhead to the system.
 
The Storage Management Service is implemented as a WSRF compliant stateless Web Service. While this service clearly maintains an internal state, this state is not related to the interaction within the service and a specific client, but coincides with the entire information space of the gCube infrastructure (i.e. all Information Objects stored in it). If the state of the service changes, the changes should be visible to all clients interacting with it. Furthermore, the mechanisms for maintaining state provided by the WSRF framework are not appropriate to handle in an efficient way the amount and complexity of the information the service has to maintain. For the same reason, the service does not publish any resource on the IS, as publishing a resource for each Information Object would yield a dramatic overhead to the system.
 +
 
===Functions===
 
===Functions===
 
The main functions supported by the Storage Management Service are:
 
The main functions supported by the Storage Management Service are:

Revision as of 10:57, 26 June 2009

Storage Management

gCube is strongly data-oriented. Contrary to other environments for the grid, the approach taken within the gCube system to the management of files and other resources goes in the direction of a fine-grained control not only over files, but also over metadata referring to them and their relationships. Resources available in the infrastructure are described using a custom data model, the so called Information-Object model. This allows to store, even at the lowest level of interaction, not only the raw content files, but also a plentiful of other meta-informations, like properties and inter-relationships. On the base of this simple but flexible model, it is also possible to build more abstract and rich data models. The Storage Management Service is a fundamental piece of a gCube architecture. Its role is to take care of the storage of data, and provide through its interface an abstraction to storage based on the info-object model. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.

Reference Model

The elementary constructs of the Information Object Model are information-objects and object references. They can be visualized respectively as nodes and arcs in a graph. An ER model clarifying how these constructs fit together is shown in Figure 1.

Figure 1. ER Schema Used to Instantiate the Information Object Model

An information object (IO) represents an elementary information unity. It is uniquely identified by an Object Identifier (OID), is labelled with a name1 and a type2 and Information optionally annotated with a number of properties. These properties are simple key-type-value associations. Finally, it can be associated with a raw-content. The raw content of an object is content of any kind. The model hides the actual storage details of the content of an object, that can be for instance stored as a file in gLite or as BLOB-field in a database, or maintained in storage facilities not under direct control of the Information Organization Services, e.g. as file stored in a remote server and accessible through some protocol like http, ftp or gridftp. An object reference “links” two information objects. Each object might (i) reference many other objects and (ii) be referenced by many objects (m-n relationship). A reference is directed, it is labelled with a type attribute, called primary role, a secondary role, that may optionally further specify the function of the primary role3, and a position attribute, that allows to build ordered graph structures. It can also be associate with a number of other properties. The generality of this simple information model allows to build complex data-structures. Services within the Information Organization stack build on top of this model to offer a more structured and specific view of data. In particular, they can specialize the semantics attached to the labels used to annotate information objects and references, and thus creates new connections and properties used to construct custom data structures.


Detailed Service Description

The Storage Management Service provides the operations defined on Information-Objects by the Information-Object model introduced above. This includes assignment of storage properties, set up of inter-object relationships, and connection of an information object with a raw content (i.e. a file). Part of this information is internally maintained by the Storage Management Service, stored in a relational DBMS, whose logical schema is depicted in Figure 2. The raw content of information objects can be stored internally, using a file-system based mechanism, or stored in a distributed way on the grid (for example reside at an ftp site).

Figure 2. ER Schema Used to Instantiate the Information Object Model

This schema models the essential features of the information-object model introduced in the previous section. An Information Object is characterized by an OID, comprises a number of storage properties, can be seen as an abstraction over different types of contents and can reference other Information Objects via relationships. It can be noticed that the type and the name of the Information Object are not modelled explicitly in this schema. This attributes are managed as if they were just another storage properties, and constrained (when necessary) only at the application level. The other properties attached to an Information Object (and their intended semantics) are in general not defined in advance. However, the storage management itself attaches semantics to a number of properties which are used to handle or optimize the storage of info.objects and the access to them. These are:

  • Owner identifies a gCube user who owns that Information Object. Typically, the user who has created an Information Object becomes the owner.
  • Permission is plain access/update/removal directive for non-owning users. It states whether other users may access (read), update or remove this Information Object. It has been introduced to support a plain security mechanism.
  • URL is an access pattern for external documents that physically reside in archives and are only reflected by a placeholder Information Object in storage management. It is essential for archive import, in particular, if those archives host content which is not imported into gCube storage but resides in the archive.

To relationships is also possible to attach general-purpose properties, in the form of type and role. The Storage Management Service also supports attaching to a relationship a delete-propagation rule (consulted upon removal of the referring object to determine whether to automatically delete the referred object). This facility provides efficient support for integrity constraints for for models build above the info-object model, that use references to represent complex information. Several propagation rules have been defined. As mentioned above, all references are directed. Hence it is necessary not only to specify that deletion cascades, but also the direction in which this deletion is cascaded. The following rules are defined:

  • delete-target-propagate – indicates that whenever the source object is deleted, the referenced object (the target object of the reference and the reference itself) will also be deleted and deletion cascades. Similar behaviour is also triggered, if not the source object, but just the reference is removed. It should be used carefully;
  • delete-target-if-single-appearance – indicates that whenever the source object is deleted and the target object does not appear in the same role in another reference, then this target object is *deleted, thus minimizing the danger of accidental object deletions;
  • delete-source-propagate – indicates that whenever the target object is deleted, the source object of the reference (and the reference itself) will also be deleted and deletion cascades. Similar behaviour is also triggered, if not the source object, but just the reference is removed. Should be used with extreme caution;
  • delete-source-if-single-appearance – indicates that whenever the target object is deleted, the source object will only be deleted, if it does not appear in the same role in another reference. For instance, if a document is member of several collections, it will not be deleted until the last collection which it references has been deleted.
  • no-delete-propagate – indicates that no additional deletion is triggered, whenever this reference is deleted.

It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions. Whenever data needs to be transferred, e.g. to download a document, it is necessary to define how the raw file content should be made available. For this, It is important to notice, however, that the storage manager is not responsible for maintaining the consistency of references among objects: the service ignores the semantics of roles introduced by higher-level services. This is the responsibility of the services that manage specific kind of objects (like complex documents). The Storage Management Service considers roles only in relation with the propagation of deletions.

Resources and Properties

The Storage Management Service is implemented as a WSRF compliant stateless Web Service. While this service clearly maintains an internal state, this state is not related to the interaction within the service and a specific client, but coincides with the entire information space of the gCube infrastructure (i.e. all Information Objects stored in it). If the state of the service changes, the changes should be visible to all clients interacting with it. Furthermore, the mechanisms for maintaining state provided by the WSRF framework are not appropriate to handle in an efficient way the amount and complexity of the information the service has to maintain. For the same reason, the service does not publish any resource on the IS, as publishing a resource for each Information Object would yield a dramatic overhead to the system.

Functions

The main functions supported by the Storage Management Service are:

  • createInfoObject() – which takes as input parameter a message containing the name and the type of the new Information Object to be created and returns the Information Object ID assigned to it;
  • removeInfoObject() – which takes as input parameter a message containing the Information Object ID and removes it and its related relations from the Information Objects space managed by this service (that is, the information space of the whole gCube infrastructure);
  • removeInfoObjects() – which takes as input parameter a message containing the list of Information Object IDs to be removed and returns the list of Information Object IDs representing the object that have not been removed and the error occurred;
  • hasRawContent() – which takes as input parameter a message containing the Information Object ID and returns a Boolean value indicating whether the Information Object contains raw content associated to it or not;
  • getInfoObject() – which takes as input parameter a message containing the Information Object ID, the URI representing a location the object will be stored and some storage hints and returns the relative Information Object;
  • getInfoObjects() – which takes as input parameter a message containing a list of Information Object IDs, the URI representing a location each object will be stored and some storage hints and returns the relative Information Objects;
  • associateRawDocument() – which takes as input parameter a message containing the Information Object ID, the URI the raw content can be downloaded or the raw content itself, and some storage hints and returns the storage hints consumed during the storage of the raw content;
  • updateRawDocument() – which takes as input parameter a message containing the Information Object ID, the URI the raw content can be downloaded or the raw content itself, and some storage hints and returns the storage hints consumed during the update of the raw content;
  • removeRawDocument() – which takes as input parameter a message containing the Information Object ID and removes the raw content associated to it;
  • addReference() – which takes as input parameter a message containing a specification of the reference in terms of the reference source and target Information Object (their IDs), the role of the referencing Information Object, the secondary role of the referencing Information Object (optional), the position this reference occupies in the referencing Information Object (optional) and the criteria governing the removal and creates this new reference between existing Information Objects;
  • addReferences() – which takes as input parameter a message containing a list of reference specifications in terms of the reference source and target Information Object (their IDs), the role of the referencing Information Object, the secondary role of the referencing Information Object (optional), the position this reference occupies in the referencing Information Object (optional) and the criteria governing the removal; it returns a report on the result of the references requested by producing a list equal to the input parameter in which entries are enriched with a Boolean value indicating whether the operation has been successfully or not, the error message (if any) and the sequence number of the entry in the list;
  • removeReference() – which takes as input parameter a message containing a specification of the reference in terms of the reference source and target Information Object (their IDs), the role of the referencing Information Object and, optionally, the secondary role of the referencing Information Object and removes the so identified reference from the existing references;
  • removeReferences() – which takes as input parameter a message containing a list of reference specifications in terms of the reference source and target Information Object (their IDs), the role of the referencing Information Object and, optionally, the secondary role of the referencing Information Object and returns a report on the result of the reference removals requested by producing a list equal to the input parameter in which entries are enriched with a Boolean value indicating whether the operation has been successfully or not, the error message (if any) and the sequence number of the entry in the list;
  • retrieveReferences() – which takes as input parameter a message containing the source Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns the list of references matching these characteristics by specifying the source and target Information Object IDs, the role, the secondary role and the criteria governing the removal;
  • retrieveReferencesBulk() – which takes as input parameter a message containing a list of reference specifications expressed in terms of the source Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns the list of references matching these characteristics. As this is a bulk operation, it might fail only for some of the elements requested. For this reason, each entry in the result list specifies, besides the source and target Information Object IDs, the role, the secondary role, the criteria governing the operation and the sequence number of the entry in the list, also a Boolean flag denoting whether the operation was successful for the corresponding element and a field reporting an eventual related error message. For each entry, a client should first check the success flag and handle the entry according to its value;
  • retrieveReferenceTargetOIDs() – which takes as input parameter a message containing a reference specification expressed in terms of the source Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns the list of Information Object IDs of those objects referenced by the specified reference;
  • retrieveReferenceTargetOIDsBulk() – which takes as input parameter a message containing a list reference specifications expressed in terms of the source Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns the list of reference specifications expressed in terms of source Information Object ID, role of the reference, secondary role of the reference, target Information Object ID or error message, and sequence number of the entry in the list of those objects matching the specified criteria;
  • retrieveReferredAll() – which takes as input parameter a message containing a reference specifications expressed in terms of the target Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns the list of all the references currently existing that have the specified object as target specified in terms of the source and target Information Object IDs, the role, the secondary role, the criteria governing the removal;
  • retrieveReferredAllBulk() – which takes as input parameter a message containing a reference specifications expressed in terms of the target Information Object ID, the role of the reference and, optionally, the secondary role of the reference (these last two parameters are used to apply a filtering to the result) and returns the list of all the references currently existing that have the specified object as target; As this is a bulk operation, it might fail only for some of the elements requested. For this reason, each entry in the result list specifies, besides the source and target Information Object IDs, the role, the secondary role, the criteria governing the removal and the sequence number of the entry in the list, also a Boolean flag denoting whether the operation was successful for the corresponding element and a field reporting an eventual related error message. For each entry, a client should first check the success flag and handle the entry according to its value;
  • retrieveReferredSourceOIDs() – which takes as input parameter a message containing a reference specifications expressed in terms of the target Information Object ID, the role of the reference and, optionally, the secondary role of the reference (these last two parameters are used to apply a filtering to the result) and returns the list of all the Information Object IDs that are in a reference with the specified object as target;
  • retrieveReferredSourceOIDsBulk() – which takes as input parameter a message containing a list of reference specifications expressed in terms of the target Information Object ID, the role of the reference and, optionally, the secondary role of the reference (these last two parameters are used to apply a filtering to the result) and returns the list of references having the specified objects as target; this list of references is specified in terms of the target Information Object ID, the role, the secondary role, the source Information Object ID or an error message, and the sequence number of the entries in the list;
  • retrieveOIDInRelationWithAll() – which takes as input parameter a message containing a reference specification consisting of a list of a Boolean value indicating whether the searched object is source or target in the reference, the Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns a list on Information Object IDs of those objects that have a reference with all the objects in the specification;
  • retrieveObjectInRelationWithAll() – which takes as input parameter a message containing a reference specification consisting of a list of a Boolean value indicating whether the searched object is source or target in the reference, the Information Object ID, the role of the reference and, optionally, the secondary role of the reference and returns a list on Information Object descriptions excluding their raw content;
  • setStorageProperty() – which takes as input parameter a message containing the Information Object ID, the property name, the property type and the property value and sets the given property on the specified Information Object;
  • unsetStorageProperty() – which takes as input parameter a message containing the Information Object ID and the property name and removes the given property of the specified Information Object;
  • retrieveStorageProperties() – which takes as input parameter a message containing the Information Object ID and returns the properties attached to the given Information Object;
  • retrieveStorageProperty() – which takes as input parameter a message containing the Information Object ID and the property name and returns the property name, value and type of the specified property;
  • retrieveOIDsByStorageProperty() – which takes as input parameter a message containing the property name and the property value and returns the list of Information Object IDs of those objects having the specified property attached;
  • retrieveObjectsByStorageProperty() – which takes as input parameter a message containing the property name and the property value and returns the list of Information Object Descriptions of those objects having the specified property attached;
  • retrieveOIDsHavingAllStorageProperties() – which takes as input parameter a message containing a list of property names and property values and returns the list of Information Object IDs of those objects having all the specified properties attached;
  • retrieveObjectsHavingAllStorageProperties() – which takes as input parameter a message containing a list of property names and property values and returns the list of Information Object Descriptions of those objects having all the specified properties attached;
  • createInfoObjectWithContent() – which takes as input parameter a message containing the name and the type of the new Information Object to be created, the URI the raw content can be downloaded from or the raw content itself, some storage hints, a list of references with other Information Objects, and a list of properties and returns the Information Object ID of the just created object together with the consumed storage hints and a Boolean flag indicating whether the operation is successful or not;
  • createInfoObjectsWithContent() – which takes as input parameter a message containing a list of Information Object specifications expressed in terms of the name and the type of each Information Object to be created, the URI the raw content can be downloaded from or the raw content itself, some storage hints, a list of references with other Information Objects, and a list of properties and returns the list of Information Object IDs of the just created object together with the consumed storage hints;