ASL: OAI-PMH Implementation

From Gcube Wiki
Jump to: navigation, search

OAI-PMH Configuration File

D4ScienceII aims to provide interoperation of the D4Science e-Infrastructure with other data e-Infrastructures that run autonomously , thus creating the core of an e-Infrastructure ecosystem. It brings together many scientific e-Infrastructures that are established in various areas. Consequently, the content that D4Science will be publishing through the OAI protocol for metadata harvesting will differ depending on the destinated consumer. Furthermore, each harvester may have specific requirements upon the 'sets' that need to be supported by the gCube Data Provider, the batchsize of the metadata returned, the available metadata formats, etc. The request types of the protcol have strictly defined parameters (required and optional) and can't be extended with user defined ones. As a consequence, the harvester can't specify the details of the environment within which the harvesting will take place, like for instance the gCube scope of the requested objects, whether a set represents a virtual collection or a gCube collection, etc. In order to cover all the different cases that require an agreement with each individual harvester, the implementation has been based on the use of a generic configuration file, that will be stored as a generic resource within the D4Science Infrastructure. The OAI Configuration File will be specifying the details of the harvesting for each corresponding consumer, its format is depicted in the example bellow, and its content is described in the following paragraphs:

<?xml version="1.0" encoding="UTF-8"?>
<user-agents>
<user-agent value="DRIVER">
	<set name="DRIVER" id="DRIVER">
		<scope name="/d4science.research-infrastructures.eu/FARM/AquaMaps">
			<collection>
				<name>AquaMaps: Class Biodiversity Maps</name>
				<id>f451e0f0-2ded-11df-a801-c20ddc2e724e</id>
			</collection>
			<collection>
				<name>AquaMaps: Phylum Biodiversity Maps</name>
				<id>cfa4d200-2dff-11df-a81e-c20ddc2e724e</id>
			</collection>
		</scope>
	</set>
	<harvestBatchSize>100</harvestBatchSize>
	<adminEmail>admin@institution.org</adminEmail>
	<repositoryName>D4Science Repository</repositoryName>
	<requestedMetadataFormats></requestedMetadataFormats>
</user-agent>
<user-agent value="GENESI-DR">
	<set name="NO_SET">
		<scope name="/d4science.research-infrastructures.eu/FARM/FCPPS">
			<collection>
				<name>AquaMaps: Class Biodiversity Maps</name>
				<id>f451e0f0-2ded-11df-a801-c20ddc2e724e</id>
			</collection>
		</scope>
	</set>
	<harvestBatchSize>200</harvestBatchSize>
	<adminEmail>admin@institution.org</adminEmail>
	<repositoryName>D4ScienceII Repository</repositoryName>
	<requestedMetadataFormats>
		<metadataFormat>
			<schemaPrefix>dwc</schemaPrefix>
			<schemaURI>http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd</schemaURI>
			<schemaNamespace>http://rs.tdwg.org/dwc/xsd/simpledarwincore/</schemaNamespace>
	</requestedMetadataFormats>
</user-agent>
<user-agent value="Any">
	<set name="NO_SET">
		<scope name="/d4science.research-infrastructures.eu/FARM/FCPPS">
			<collection>
				<name>AquaMaps: Class Biodiversity Maps</name>
				<id>f451e0f0-2ded-11df-a801-c20ddc2e724e</id>
			</collection>
		</scope>
	</set>
	<harvestBatchSize>200</harvestBatchSize>
	<adminEmail>admin@institution.org</adminEmail>
	<repositoryName>iMarine Repository</repositoryName>
	<requestedMetadataFormats>
		<metadataFormat>
			<schemaPrefix>dwc</schemaPrefix>
			<schemaURI>http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd</schemaURI>
			<schemaNamespace>http://rs.tdwg.org/dwc/xsd/simpledarwincore/</schemaNamespace>
	</requestedMetadataFormats>
</user-agent>
</user-agents>

Harvester Identification

According to the Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting, the OAI-PMH harvesters should follow the standard practices for HTTP robotic agents. In particular they should supply HTTP User-Agent and From headers. The User-Agent header field should contain information about the user agent originating the request (it is described in section 14.43. of the *HTTP specification*). The gCube OAI-PMH data provider can identify the harvester that submits the protocol requests, comparing the identifier that has been specified as value of the user-agent element in the OAI configuration file with the content of the supplied User-Agent header. This identifier can be an agreed sequence of characters that is expected to be contained in the User-Agent header of the harvester's requests.

The configuration must contain also a 'user-agent' element with the 'Any' value, for the general case of harvesters that haven't established an agreement on the collections of the infrastructure that will be exposed to them. In that manner, the repository is being configured to provide through the protocol the open access collections, to every OAI harvester.

Sets & Virtual Sets per Harvester

Once the harvester is identified, the 'sets' that can be exposed to it can be extracted from the configuration file. Those could be virtual collections or gCube collections, stored under the same or different scopes. For instance in the case of DRIVER infrastructure, where a "DRIVER" set is required to be supported for the harvesting, the conception of a virtual grouping of gCube collections into the "DRIVER" set is managed through the OAI configuration file. If one or more virtual sets appear in the configuration file for a harvester, those are the sets that are returned when a 'ListSets' request type arrives. When the harvester submits a'ListIdentifiers' or 'ListRecords" request type for a virtual set, he gets back the records from all the gCube collections contained in that set, without having any knowledge of the internal D4Science grouping of the content, the different scopes under which those are stored, etc. If no virtual set is specified for a harvester, then the gCube content collections listed in the OAI configuration file are returned as a response to the 'ListSets' request. Again, the harvester doesn't need to have any knowledge about the gCube scopes, since that information is being configured and extracted from the OAI file.

Harvester requirements

Apart from the collections and virtual sets per harvester, more information can be configured in the file, based on the agreement with each harvester. The batch size, the repository name, the adminEmail as well as the metadata formats each harvester requests to be supported can also be filled in. The batch size is the number of records a repository delivers to the harvester for one resumption token and determines how many request processes have to be executed. Using the preferred batch size for each external repository will make the harvesters operate at optimal performance. The repositoryName and the adminEmail can be configured within different administrative modification needs, while addressing various harvesters. The 'requestedMetadataFormats' element of the configuration file has as child elements the metadata prefixes of the formats a harvester may require to be supported by the D4Science infrastructure for the exposed metadata, their schema URIs and their namespaces. This option gives the opportunity to external repositories to ask for metadata in a specific format that is not included in the gCube metadata schemata but can be generated on the fly with the dynamic use of the gCube transformators. This solution is also used in the case of 'oai-dc' schema, which is mandatory for basic interoperability from the protocol. 'oai-dc' is always returned as part of the response to the 'ListMetadataFormats' request, even if the gCube metadata don't exist in that schema. When the harvester requests for metadata in that schema, the gCube data provider uses on the fly oai-dc converters, that are stored as generic resources in the gCube information system. The same strategy is followed in the case of other external metadata formats, requested by the harvesters, with the precondition that the corresponding converters from the gCube metadata schemata to the external ones exist in IS.

Flow Control Mechanism

Finally, the gCube OAI-PMH data provider suppors flow control. Four of the request types return a list of entries. Three of them may reply with 'large' lists. OAI-PMH supports partitioning and the data provider implemented can partition the results and control their flow, based on a resumption token. To achieve the optimal operation of the data provider which serves subsequent requests from one or more harvesters, a number of caches has been impemented for the ResultSets returned from gCube MetadataManager and their readers. In the context of this implementation, the response to a request includes: the incomplete metadata list, the resumption token (+expiration date, size of complete list, cursor). For a new request with the same request type, the harvester needs to pass as a parameter the resumption token, omitting all the other parameters. The response includes the next (which may be the last) section of the list and a resumption token. That resumption token is empty if the last section of the list is enclosed.

Base URL

The base URL from within the OAI-PMH facilities are reached is composed from the location that addresses the server where ASL HTTP war is deployed, followed by the name of the protocol implementation servlet. For example:

http://portal.d4science.research-infrastructures.eu/applicationSupportLayerHttp/OAIPublisher?verb=Identify