Digital Library Administration

From Gcube Wiki
Jump to: navigation, search

Content & Storage Management

Content Management strictly relies on Storage Management. Therefore it is a prerequisite to setup a running instance of Storage Management before Content Management can be successfully started. There are two possibilities to setup Storage Management: a simple one using Apache Derby as a database backend and an advanced one, where an existing database is used via JDBC.

Simple Setup of Storage Management using Apache Derby

Apache Derby is an open source relational database implemented entirely in Java and available under the Apache License, Version 2.0 with a small footprint of about 2 megabytes. It is sufficient to be used as a database backend for getting started with Storage Management. However, when much data is stored or some more elaborate backup & recovery strategies should get used, traditional (huge) RDBMS might be a better choice.

If Storage Management is deployed dynamically or manually from the GAR, it's default installation places a configuration file in $GLOBUS_LOCATION/etc/<Service-Gar-Filename>/StorageManager.properties that expects Derby to be available and have permissions to write at file under ./StorageManagementService/db/storage_db. Derby is started in embedded mode, for which it doesn't even need a username or password. Multiple connections from the same Java Virtual Machine are possible and are also quite fast, but no two Java VM can access the DB at the same time.

If all dependencies have been installed correctly, the container should start and create a new database if needed.

The lines defining the JDBC connection to the database in the above mentioned configuration files are:

DefaultRawFileContentManager=jdbc\:derby
DefaultRelationshipAndPropertyManager=jdbc\:derby
# derby settings (Default)
jdbc\:derby.class=org.diligentproject.contentmanagement.baselayer.rdbmsImpl.GenericJDBCDatabase
jdbc\:derby.params.count=4
jdbc\:derby.params.0=local_derby_storage_db
jdbc\:derby.params.1=org.apache.derby.jdbc.EmbeddedDriver
jdbc\:derby.params.2=jdbc\:derby\:./StorageManagementService/db/storage_db;create\=true
jdbc\:derby.params.3=5
By changing the
jdbc\:derby.params.2=jdbc\:derby\:./StorageManagementService/db/storage_db;create\=true
after derby\: you can choose another place to store the database.

In this setting, all relationships and properties as well as the raw file content are stored inside the Derby database. This is defined in the first two lines of the configurtaion snipped shown above.

Advanced Setup of Storage Management using an arbitrary relational JDBC database

Storage Management depends on the following external components:

  1. Apache Jakarta Commons Database Connection Pooling which requires itself Commons Pool and therefore also Commons Collections
  2. a JDBC-driver for the database to use.

The first one should get dynamically deployed, the second you will have to install since it depends only on the RDBMS you want to use. Most common choice is to use MySQL, since it is used for many of the gLite components as well like DPM or LFC, such that there is no need to set up another RDBMS. The corresponding JDBC driver is named Connector/J and is released under a dual-lincesing strategy like the MySQL RDBMS itself: a commercial license and the GNU General Public License. For this reason, neither the RDBMS nor the JDBC driver are directly distributed with the gCube software. The JDBC driver must be available to the container and therefore its .jar file(s) may need to be stored in $GLOBUS_LOCATION/lib/.

You will have to prepare the DBMS manually to create a new database that will get used for Storage Management. For this, you may also want to install mysql-client, MySQL Administrator, and MySQL Query Browser - or a database-independent tool like ExecuteQuery. On Scientific Linux 3, the following steps need to be performed:

apt-get install mysql-server mysql-client
mysqladmin create <dbname>
mysql --user=root <dbname>

This will install the MySQL server (if not already present) and the corresponding command-line client. The next line will create a new, empty database. The last line will connect to this database using the comand-line client. If the RDBMS has been set up to require a password for the local root account, use the option -p to be promted for the password. Once you are logged in, you have to create a new user with sufficient rights to connect, create new and alter tables and perform all kinds of selects, inserts, updates, delete from them in this database.

The easiest way to achieve this in MySQL is
GRANT ALL PRIVILEGES ON <dbname>.* TO '<username>'@'%' IDENTIFIED BY '<password>';
(MySQL has its very own syntax instead of CREATE USER here until version 5.0 - see [1] for more details.)

If you use MySQL versions < 5, it has by default a limited file size of individual database files of 4GB or even 2GB on some filesystems. This might become a problem if you either store many, many files or just a couple of huge files and MySQL might start to complain "Table is full". In this case, execute the SQL command

ALTER TABLE Raw_Object_Content MAX_ROWS=1000000000 AVG_ROW_LENGTH=1000;

to allocate pointers for bigger tables. See [2] for details.

Due to some inconvenience in the MySQL protocol for transfering BLOBs of several megabytes, you might have to increase the MAX_ALLOWED_PACKET variable in the my.cnf. On Scientif Linux this is located in /var/lib/mysql/ - see [3] for details.

For using MySQL, you can use the following lines in the above mentioned configuration file:

# local mysql settings (template for MySQL instances)
jdbc\:mysql_local.class=org.diligentproject.contentmanagement.baselayer.rdbmsImpl.GenericJDBCDatabase
jdbc\:mysql_local.params.count=4
jdbc\:mysql_local.params.0=local_mysql_db
jdbc\:mysql_local.params.1=com.mysql.jdbc.Driver
jdbc\:mysql_local.params.2=jdbc\:mysql\://127.0.0.1/storage_db?user\=THE_USER&password\=THE_PASS
jdbc\:mysql_local.params.3=100
You will have to change the line
jdbc\:mysql_local.params.2=jdbc\:mysql\://127.0.0.1/storage_db?user\=THE_USER&password\=THE_PASS
in order to use the correct IP-address of your server, the database name, the username and the password. This is nothing else than a regular JDBC connection (plus \ infront of each : to escape them in the Java property file) string, so if you are familiar with that, it should be quite simple to use; otherwise there is plenty of documentation how to make sense out of this, e.g. [4].

In addition, you have to set the Storage Manager to use this database by default. Therefore simply edit the lines on top to:

DefaultRawFileContentManager=jdbc\:mysql_local
DefaultRelationshipAndPropertyManager=jdbc\:mysql_local

If you do consistent renaming / copy & paste, there is no need to stick to the mysql_local.

Other databases, like PostgreSQL in a version > 8 or others should also work depending on there compliance to there standardization according to ANSI SQL92 and handling of BLOBs. But this has not been extensively tested yet.

Advanced Setup of Storage Management for protocol handlers

Storage Management is able to use a couple of other protocols to retrieve and store files. The default configuration contains the following entries:

# handlers
protocol.handler.count=4
protocol.handler.0.class=org.diligentproject.contentmanagement.baselayer.inMessageImpl.InMemoryContentManager
protocol.handler.1.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.FTPPseudoContentManager
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.HTTPPseudoContentManager
## Alternative for HTTPPseudoContentManager: Commons HTTPClient, requires additional libraries
# protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.CommonsHTTPClientPseudeContentManager
protocol.handler.3.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.GridFTPContentManager
## WARNING: do not run LocalFilesystemStorage handler on productive service, unless it is really, really secure!
# protocol.handler.4.class=org.diligentproject.contentmanagement.baselayer.filesystemImpl.LocalFilesystemStorage

The handlers are used in the order they are defined in the configuration file. If several handlers claim to handle the same protocol, only the one with the lowest number is used. The count must be in line with the defined handlers from 0 to count-1.

The inmessage:// is convenient since it can transfer raw content directly inside the SOAP message and therefore does not require an additional communitcation protocol, which requires another handshaking between client and server and -in secure environment- a seperate authentication & authorization and possibly completely seperate user management. Unfortnately, this protocol does not work for files bigger than approximately 2 megabytes due to limitations of the container. There no simple modification that would enable GT4 and it's underlying Axis to cope with big base64 encoded message parts. Big files have to be transferred using other protocols, like FTP, HTTP, or GridFTP. The only workaround for downloads from SMS is to use chunked downloads. On the other hand, the before mentioned well-established protocols may also provide much better performance for big files, where the handshaking takes comparably small time.

The next two lines set up handlers for downloading from FTP and HTTP locations using the build-in clients of the Java Class Library. For better performance and also deal with some security issues in SUNs implementation [5], Apache Jakarta Commons HTTPClient can also be used. For this, simply comment out
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.HTTPPseudoContentManager
and uncomment
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.CommonsHTTPClientPseudeContentManager
This will require that the correct .jar file from above mentioned location (or from the Service Archive) is also installed in $GLOBUS_LOCATION/lib/ together with its dependency on Apache Jakarta Commons Codec.

Advanced Setup of Storage Management for using not a database to store raw content

A template for using the file system instead of the RDBMS is presented in the configuration file:

# filesystem settings (another template)
file\:fileStorage.class=org.diligentproject.contentmanagement.baselayer.filesystemImpl.LocalFilesystemStorage
file\:fileStorage.params.count=1
file\:fileStorage.params.0=/usr/local/globus/etc/StorageManagementService/Stored_Content

To make this the default location to store the content, you have to set in the first lines of the configuration file:

DefaultRawFileContentManager=file\:fileStorage

Another option would be to use GridFTP here to store the files in a Storage Element on the Grid.

Setup of Content & Collection Management

Content & Collection Management entirely rely on Storage Management and interact heavily with it. Therefore it is a good choice to deploy them on the same node that is hosting Storage Management to avoid that network communication becames the bottleneck for perfermance.

The only parameter that might need adjustment can be found in both configuration files at $GLOBUS_LOCATION/etc/<CMS-GAR-Filename>/ContentManager.properties and $GLOBUS_LOCATION/etc/<ColMS-GAR-Filename>/CollectionManager.properties, respectively.

StorageManagementService=http\://127.0.0.1\:8080/wsrf/services/diligentproject/contentmanagement/StorageManagementServiceService

This line must point to the EPR of the Storage Management Service that should be used. If the GT4 container is running on it's default port 8080 and all three services are deployed on the same node, there should be no need to adjust this. Otherwise the port might need to get corrected (in both configruation files).

Metadata Management

The Metadata Management aims at modelling of arbitrary metadata relationships (IDB-relationships). The only assumption it does is that the metadata objects are serialized as well-formed XML documents. The service has a two-fold role:

  • to manage Metadata Objects and Metadata Collections
  • to establish secondary role-typed links. Such relationships can be in place between any type of Information Object and in the scope of a Collection or not

The Metadata Management Components

The main functionality of the Metadata Management components is the management of Metadata Objects, Metadata Collection and their relationships. To operate over Metadata Collections, the Metadata Management instantiates Collection Managers for each collection. A Collection Manager is the access point to all the possible operations over a specific Metadata Collection. From an architectural point of view, the Metadata Manager adopts the Factory pattern and Collection Managers are implemented as a GCUBEWSResource. Physically, the service is composed by:

  • the MetadataManagerFactory, a factory service that creates new Collection Managers and offers some cross-Collection operations
  • the MetadataManagerService, a service that operates over Metadata Collections (MCs) and on Metadata Objects as Elements, i.e. members of a specific Metadata Collection

The MetadataManagerFactory

The MetadataManagerFactor Service creates new Collection Managers and offers some cross-Collection operations. Moreover, it operates on Metadata Objects as Information Objects related to other Information Objects and not as Members of Metadata Collections.

  • createManager(CollectionID, params): This operation takes a Collection ID and a set of creation parameters and creates a new Manager in order to manage a Metadata Collection bound to such a Collection. If a Metadata Collection with the specified Metadata characteristics does not exist, the Manager creates the Metadata Collection, binds it with the Document Collection with the given secondary role relationship and publishes its profile in the Information System.

The Creation parameters are a set of key-value;the following keys are defined in the MMLibrary, the mandatory parameter accepted by the operation:

  1. COLLECTIONNAME -> name of the collection
  2. DESCRIPTION -> description
  3. ISUSERCOLLECTION -> if the collection is a user collection or not (“True”/”False”)
  4. ISINDEXABLE -> if the collection is indexable or not (“True”/”False”)
  5. RELATEDCOLLECTION -> the information
  6. METADATAFORMAT -> the metadata name and the metadata language as specified in the ISO 639-2
  7. SECONDARYROLE -> the secondary role

The optional parameter accepted by the operation are:

  1. GENERATEDBY -> the source Metadata Collection from which the current one has been generated (by the Metadata Broker), if any
  2. ISEDITABLE -> if the collection is editable or not (“True”/”False”)
  3. CREATOR -> the name of the creator of the Metadata Collection
  • createManagerFromCollection (MetadataCollectionID): This operation takes a Metadata Collection ID. It returns:
  1. the related CollectionManager, if it already exists
  2. creates a new CollectionManager and returns it, if the Metadata Collection exists
  3. an error, if the Collection ID is not valid
  • addMetadata(ObjectID, MO, SecondaryRole): This operation takes a new non-collectable Metadata Object and
  1. completes the metadata header information (e.g. the MOID, if it is not specified)
  2. stores (or updates if the MOID is already included in the MO header) the object on the Storage Management Service as Information Object
  3. creates a <is-described-by, <SecondaryRole>> binding in the Storage Management Service between the Metadata Object and the Information Object identified by the given Object ID
  4. returns the assigned MOID
  • deleteMetadata(MOID): This operation deletes from the Storage Management Service the Metadata Object identified by the given ID.
  • getMetadata ((ObjectID, SecondaryRole, CollectionID, Rank)[]): For each given ObjectID, this operation returns the Metadata Objets. They are:
  1. bound with the specified secondary role (the primary role is, of course, is-described-by) to the Information Object identified by that ObjectID
  2. members of the specified Metadata Collection. The operation relies on the String[] retrieveReferred(String targetObjectID, String role, String secondaryrole) operation of the Storage Management Service.

Index Management

Search Management

Each of the Search Framework Services, once deployed along with their dependencies, are designed to be autonomous and needs no user parametrization or supervising. Two issues that may come up and should be mentioned are the following:

  • The user under which the services run must have write permissions to the /tmp directory.
  • The execution of the plan produced by any Search Master Service, is determined by the presence of a lock file ($GLOBUS_LOCATION/etc/SearchMaster/BPEL.lock). If this file exists, the plan is forwarded to the Process Execution Service. Otherwise, the plan is executed internally by an embedded execution engine component (which does not support secure calls)

Feature Extraction

Currently, feature extraction reuses existing feature extractors that where developed and used in the ISIS/OSIRIS prototype system. These have been implemented in C++ using many libraries that are not easily portable to any other platform than Windows, on which the ISIS system is running. The Feature Extraction Service wraps an instance demo installation hosted at UNIBAS. The configuration of the service contains the URL of the service. Since this ISIS service is not a DILIGENT service, it cannot be dynamically retrived from the DIS. The other configuration parameter is the EPR of Content Management; this is configured in the service for debugging and performance reasons, since it allows for assignment of Feature Extraction Service to the closest CMS instance to reduce network traffic. In subsequent releases, the default is expected to change to dynamic retrieval of the CMS to contact and only allowing optional configuration to use dedicated instances. The configuration file can be found at $GLOBUS_LOCATION/etc/<FE-GAR-Filename>/FeatureExtraction.properties

#This file configures the feature extraction service.
contentmanagement.contentmanagementservice.epr=http\://dil03.cs.unibas.ch\:8080/wsrf/services/diligentproject/contentmanagement/ContentManagementServiceService
isis.fee.endpoint=http\://isisdemo.cs.unibas.ch\:9700/ARTE/FEE/ExtractFeature

Process Management

Most Process Management Services do not require any manual configuration after deployment, with the exception of the GLite Job Wrapper Service. The required configuration steps are outlined below.

GLite Job Wrapper Service configuration

There are two settings that must be defined in the JNDI configuration file of the service (usually $GLOBUS_LOCATION/etc/org_diligentproject_glite_jobwrapper/jndi-config.xml). These settings are used to specify the WMProxy endpoint to use for job submissions, and the user certificate for running the jobs. The format in the JNDI configuration file is as follows:


<environment name="proxyCredentialsFile" type="java.lang.String" value="/tmp/x509up_u1000"/>

<environment name="WMProxyURL" type="java.lang.String" value="https://dil01.cs.unibas.ch:7443/glite_wms_wmproxy_server"/>

The rest of the configuration should not be modified.

The proxyCredentialsFile is a VOMS proxy file on the local file system. The administrator of the node is responsible for making sure that this proxy certificate is valid (i.e. not expired) at all times, and that it is a certificate accepted by the WMProxy server pointed to by the WMProxyURL.