Digital Library Administration

From Gcube Wiki
Revision as of 13:24, 14 November 2007 by Valia (Talk | contribs) (ScenarioCollectionInfo)

Jump to: navigation, search

VDL Creation and Management

Resources Management

Generic Resources Management

In order to properly setup a VDL, several Generic Reosources are needed to be published on DIS. The VDL Administrator can create them by using the Generic Resource Portlet. Additionally, every time that a new schema appears on VDL, a MetadataSchemaInfo, a PresentationXSLT_<schemaName>_<xsltName> and a MetadataXSLT_<schemaName>_<xsltName> Generic Resources must be created for this schema.

ScenarioCollectionInfo

This Generic Resource contains information about the available collections for a specific VDL and their hierarchical structure. The collections can be clustered in group so as to help end users to identify similar collections and to present the collections in a human managable way.

The VDL Administrator must create a Generic Resource named: "ScenarioCollectionInfo" whose body must be in the following form:

<DL name="<AbsoluteDLName>">
    <collections name="collection group 1 name" shortname="short name" description="description of group of callections">
        <collection name="collection 1.1 name" reference="reference url for this collection" shortname="short name for the collection" description="collection desription"/>
        <collection name="collection 1.2 name" reference="reference url for this collection" shortname="short name for the collection" description="collection desription"/>
        <collection name="collection 1.3 name" reference="reference url for this collection" shortname="short name for the collection" description="collection desription"/>
        ...
    </collections>
    <collections name="collection group 2 name" shortname="short name" description="description of group of callections">
        <collection name="collection 2.1 name" reference="reference url for this collection" shortname="short name for the collection" description="collection desription"/>
        <collection name="collection 2.2 name" reference="reference url for this collection" shortname="short name for the collection" description="collection desription"/>
        ...
    </collections>
    ...
</DL>

The root element is DL and it has an attribute named "name". The name attribute is very important and it has to be in the form: /<VO>/<Community>/<DLName>
Additionally, the DL element, contains an arbitrary number of "collections" elements. Each of these elements represent a group of collections.
Its attributes are:

  1. name: The name of the group
  2. shortname: The shortname of the group
  3. description: Its description


Furthermore, its collections element contains an arbitrary number of "collection" elements. Each of these elements represents an actual collection.
Its attributes are:

  1. name: The name of the colection. This name must be a perfect match with the collection name as it exists in collection management service.
  2. shortname: The shortname of the collection
  3. description: Its description
  4. reference: A reference URL for this collection

MetadataSchemaInfo

One such Generic Resource must exist for each schema of the VDL.
It contains information about which are the searchable and the browsable fields in addition to what type of search must be applied.

The VDL Administrator must create one Generic Resource for each schema named "MetadataSchemaInfo ".
The body of this resource must be in the following form:

<schemaName>
    <option>
        <option-name>displayed name in search fields</option-name>
        <option-value>actual xml-element name in metadata</option-value>
        <option-type>type of search to apply</option-type>
        <option-sort>XPath expression to be used for sort (exist only for browsable fields)</option-sort>
    </option>
    ...
</schemaName>

TitleXSLT

GenericXSLT

GoogleXSLT

PresentationXSLT_<schemaName>_<xsltName>

MetadataXSLT_<schemaName>_<xsltName>

VO and Users Management

Content & Storage Management

Content Management strictly relies on Storage Management. Therefore it is a prerequisite to setup a running instance of Storage Management before Content Management can be successfully started. There are two possibilities to setup Storage Management: a simple one using Apache Derby as a database backend and an advanced one, where an existing database is used via JDBC.

Simple Setup of Storage Management using Apache Derby

Apache Derby is an open source relational database implemented entirely in Java and available under the Apache License, Version 2.0 with a small footprint of about 2 megabytes. It is sufficient to be used as a database backend for getting started with Storage Management. However, when much data is stored or some more elaborate backup & recovery strategies should get used, traditional (huge) RDBMS might be a better choice.

If Storage Management is deployed dynamically or manually from the GAR, it's default installation places a configuration file in $GLOBUS_LOCATION/etc/<Service-Gar-Filename>/StorageManager.properties that expects Derby to be available and have permissions to write at file under ./StorageManagementService/db/storage_db. Derby is started in embedded mode, for which it doesn't even need a username or password. Multiple connections from the same Java Virtual Machine are possible and are also quite fast, but no two Java VM can access the DB at the same time.

If all dependencies have been installed correctly, the container should start and create a new database if needed.

The lines defining the JDBC connection to the database in the above mentioned configuration files are:

DefaultRawFileContentManager=jdbc\:derby
DefaultRelationshipAndPropertyManager=jdbc\:derby
# derby settings (Default)
jdbc\:derby.class=org.diligentproject.contentmanagement.baselayer.rdbmsImpl.GenericJDBCDatabase
jdbc\:derby.params.count=4
jdbc\:derby.params.0=local_derby_storage_db
jdbc\:derby.params.1=org.apache.derby.jdbc.EmbeddedDriver
jdbc\:derby.params.2=jdbc\:derby\:./StorageManagementService/db/storage_db;create\=true
jdbc\:derby.params.3=5
By changing the
jdbc\:derby.params.2=jdbc\:derby\:./StorageManagementService/db/storage_db;create\=true
after derby\: you can choose another place to store the database.

In this setting, all relationships and properties as well as the raw file content are stored inside the Derby database. This is defined in the first two lines of the configurtaion snipped shown above.

Advanced Setup of Storage Management using an arbitrary relational JDBC database

Storage Management depends on the following external components:

  1. Apache Jakarta Commons Database Connection Pooling which requires itself Commons Pool and therefore also Commons Collections
  2. a JDBC-driver for the database to use.

The first one should get dynamically deployed, the second you will have to install since it depends only on the RDBMS you want to use. Most common choice is to use MySQL, since it is used for many of the gLite components as well like DPM or LFC, such that there is no need to set up another RDBMS. The corresponding JDBC driver is named Connector/J and is released under a dual-lincesing strategy like the MySQL RDBMS itself: a commercial license and the GNU General Public License. For this reason, neither the RDBMS nor the JDBC driver are directly distributed with the gCube software. The JDBC driver must be available to the container and therefore its .jar file(s) may need to be stored in $GLOBUS_LOCATION/lib/.

You will have to prepare the DBMS manually to create a new database that will get used for Storage Management. For this, you may also want to install mysql-client, MySQL Administrator, and MySQL Query Browser - or a database-independent tool like ExecuteQuery. On Scientific Linux 3, the following steps need to be performed:

apt-get install mysql-server mysql-client
mysqladmin create <dbname>
mysql --user=root <dbname>

This will install the MySQL server (if not already present) and the corresponding command-line client. The next line will create a new, empty database. The last line will connect to this database using the comand-line client. If the RDBMS has been set up to require a password for the local root account, use the option -p to be promted for the password. Once you are logged in, you have to create a new user with sufficient rights to connect, create new and alter tables and perform all kinds of selects, inserts, updates, delete from them in this database.

The easiest way to achieve this in MySQL is
GRANT ALL PRIVILEGES ON <dbname>.* TO '<username>'@'%' IDENTIFIED BY '<password>';
(MySQL has its very own syntax instead of CREATE USER here until version 5.0 - see [1] for more details.)

If you use MySQL versions < 5, it has by default a limited file size of individual database files of 4GB or even 2GB on some filesystems. This might become a problem if you either store many, many files or just a couple of huge files and MySQL might start to complain "Table is full". In this case, execute the SQL command

ALTER TABLE Raw_Object_Content MAX_ROWS=1000000000 AVG_ROW_LENGTH=1000;

to allocate pointers for bigger tables. See [2] for details.

Due to some inconvenience in the MySQL protocol for transfering BLOBs of several megabytes, you might have to increase the MAX_ALLOWED_PACKET variable in the my.cnf. On Scientif Linux this is located in /var/lib/mysql/ - see [3] for details.

For using MySQL, you can use the following lines in the above mentioned configuration file:

# local mysql settings (template for MySQL instances)
jdbc\:mysql_local.class=org.diligentproject.contentmanagement.baselayer.rdbmsImpl.GenericJDBCDatabase
jdbc\:mysql_local.params.count=4
jdbc\:mysql_local.params.0=local_mysql_db
jdbc\:mysql_local.params.1=com.mysql.jdbc.Driver
jdbc\:mysql_local.params.2=jdbc\:mysql\://127.0.0.1/storage_db?user\=THE_USER&password\=THE_PASS
jdbc\:mysql_local.params.3=100
You will have to change the line
jdbc\:mysql_local.params.2=jdbc\:mysql\://127.0.0.1/storage_db?user\=THE_USER&password\=THE_PASS
in order to use the correct IP-address of your server, the database name, the username and the password. This is nothing else than a regular JDBC connection (plus \ infront of each : to escape them in the Java property file) string, so if you are familiar with that, it should be quite simple to use; otherwise there is plenty of documentation how to make sense out of this, e.g. [4].

In addition, you have to set the Storage Manager to use this database by default. Therefore simply edit the lines on top to:

DefaultRawFileContentManager=jdbc\:mysql_local
DefaultRelationshipAndPropertyManager=jdbc\:mysql_local

If you do consistent renaming / copy & paste, there is no need to stick to the mysql_local.

Other databases, like PostgreSQL in a version > 8 or others should also work depending on there compliance to there standardization according to ANSI SQL92 and handling of BLOBs. But this has not been extensively tested yet.

Advanced Setup of Storage Management for protocol handlers

Storage Management is able to use a couple of other protocols to retrieve and store files. The default configuration contains the following entries:

# handlers
protocol.handler.count=4
protocol.handler.0.class=org.diligentproject.contentmanagement.baselayer.inMessageImpl.InMemoryContentManager
protocol.handler.1.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.FTPPseudoContentManager
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.HTTPPseudoContentManager
## Alternative for HTTPPseudoContentManager: Commons HTTPClient, requires additional libraries
# protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.CommonsHTTPClientPseudeContentManager
protocol.handler.3.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.GridFTPContentManager
## WARNING: do not run LocalFilesystemStorage handler on productive service, unless it is really, really secure!
# protocol.handler.4.class=org.diligentproject.contentmanagement.baselayer.filesystemImpl.LocalFilesystemStorage

The handlers are used in the order they are defined in the configuration file. If several handlers claim to handle the same protocol, only the one with the lowest number is used. The count must be in line with the defined handlers from 0 to count-1.

The inmessage:// is convenient since it can transfer raw content directly inside the SOAP message and therefore does not require an additional communitcation protocol, which requires another handshaking between client and server and -in secure environment- a seperate authentication & authorization and possibly completely seperate user management. Unfortnately, this protocol does not work for files bigger than approximately 2 megabytes due to limitations of the container. There no simple modification that would enable GT4 and it's underlying Axis to cope with big base64 encoded message parts. Big files have to be transferred using other protocols, like FTP, HTTP, or GridFTP. The only workaround for downloads from SMS is to use chunked downloads. On the other hand, the before mentioned well-established protocols may also provide much better performance for big files, where the handshaking takes comparably small time.

The next two lines set up handlers for downloading from FTP and HTTP locations using the build-in clients of the Java Class Library. For better performance and also deal with some security issues in SUNs implementation [5], Apache Jakarta Commons HTTPClient can also be used. For this, simply comment out
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.HTTPPseudoContentManager
and uncomment
protocol.handler.2.class=org.diligentproject.contentmanagement.baselayer.networkFileTransfer.CommonsHTTPClientPseudeContentManager
This will require that the correct .jar file from above mentioned location (or from the Service Archive) is also installed in $GLOBUS_LOCATION/lib/ together with its dependency on Apache Jakarta Commons Codec.

Advanced Setup of Storage Management for using not a database to store raw content

A template for using the file system instead of the RDBMS is presented in the configuration file:

# filesystem settings (another template)
file\:fileStorage.class=org.diligentproject.contentmanagement.baselayer.filesystemImpl.LocalFilesystemStorage
file\:fileStorage.params.count=1
file\:fileStorage.params.0=/usr/local/globus/etc/StorageManagementService/Stored_Content

To make this the default location to store the content, you have to set in the first lines of the configuration file:

DefaultRawFileContentManager=file\:fileStorage

Another option would be to use GridFTP here to store the files in a Storage Element on the Grid.

Setup of Content & Collection Management

Content & Collection Management entirely rely on Storage Management and interact heavily with it. Therefore it is a good choice to deploy them on the same node that is hosting Storage Management to avoid that network communication becames the bottleneck for perfermance.

The only parameter that might need adjustment can be found in both configuration files at $GLOBUS_LOCATION/etc/<CMS-GAR-Filename>/ContentManager.properties and $GLOBUS_LOCATION/etc/<ColMS-GAR-Filename>/CollectionManager.properties, respectively.

StorageManagementService=http\://127.0.0.1\:8080/wsrf/services/diligentproject/contentmanagement/StorageManagementServiceService

This line must point to the EPR of the Storage Management Service that should be used. If the GT4 container is running on it's default port 8080 and all three services are deployed on the same node, there should be no need to adjust this. Otherwise the port might need to get corrected (in both configruation files).

Metadata Management

Index Management

Search Management

Each of the Search Framework Services, once deployed along with their dependencies, are designed to be autonomous and needs no user parametrization or supervising. Two issues that may come up and should be mentioned are the following:

  • The user under which the services run must have write permissions to the /tmp directory.
  • The execution of the plan produced by any Search Master Service, is determined by the presence of a lock file ($GLOBUS_LOCATION/etc/SearchMaster/BPEL.lock). If this file exists, the plan is forwarded to the Process Execution Service. Otherwise, the plan is executed internally by an embedded execution engine component (which does not support secure calls)

Feature Extraction

Currently, feature extraction reuses existing feature extractors that where developed and used in the ISIS/OSIRIS prototype system. These have been implemented in C++ using many libraries that are not easily portable to any other platform than Windows, on which the ISIS system is running. The Feature Extraction Service wraps an instance demo installation hosted at UNIBAS. The configuration of the service contains the URL of the service. Since this ISIS service is not a DILIGENT service, it cannot be dynamically retrived from the DIS. The other configuration parameter is the EPR of Content Management; this is configured in the service for debugging and performance reasons, since it allows for assignment of Feature Extraction Service to the closest CMS instance to reduce network traffic. In subsequent releases, the default is expected to change to dynamic retrieval of the CMS to contact and only allowing optional configuration to use dedicated instances. The configuration file can be found at $GLOBUS_LOCATION/etc/<FE-GAR-Filename>/FeatureExtraction.properties

#This file configures the feature extraction service.
contentmanagement.contentmanagementservice.epr=http\://dil03.cs.unibas.ch\:8080/wsrf/services/diligentproject/contentmanagement/ContentManagementServiceService
isis.fee.endpoint=http\://isisdemo.cs.unibas.ch\:9700/ARTE/FEE/ExtractFeature

Process Management