Difference between revisions of "OpenSearch Framework"

From Gcube Wiki
Jump to: navigation, search
(OpenSearch Operator Logic)
(OpenSearch Operator Logic)
Line 129: Line 129:
  
 
====OpenSearch Operator Logic====
 
====OpenSearch Operator Logic====
The OpenSearch Operator employs the functionality provided by the [[#The General-Purpose OpenSearch Library|OpenSearch Library]] in order to extract the required information from the Description Document of the external provider and the <code>QueryBuilder</code>s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. Furhermore, in order for the proper way of constructing queries to be selected, a preprocessing step is performed by the Operator. The purpose of the preprocessing step is to select the
+
The OpenSearch Operator employs the functionality provided by the [[#The General-Purpose OpenSearch Library|OpenSearch Library]] in order to extract the required information from the Description Document of the external provider and the <code>QueryBuilder</code>s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the Operator, before issuing any actual queries. Given that an OpenSearch Resource can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element|URL Element] are not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the <code>QueryBuilder</code> whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the OpenSearch Resource describing the provider are discarded. The <code>QueryBuilder</code>s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:
 +
*<code>QueryBuilder</code>s lacking a parameter contained in the caller's query are discarded
 +
*<code>QueryBuilders</code>s are reordered according to the number of query parameters
 +
 
 +
The most usual case is for the provider's OpenSearch Resource to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one <code>QueryBuilder</code> being available to construct queries.
 +
 
 
The functions performed by the operator in order for a set of results to be retrieved are summarized in the following simplified diagram
 
The functions performed by the operator in order for a set of results to be retrieved are summarized in the following simplified diagram
  
Line 136: Line 141:
 
As shown, the operator accepts a set of query terms and a set of query parameters.
 
As shown, the operator accepts a set of query terms and a set of query parameters.
  
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the user requirement of the number of results is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator.  
+
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator.  
  
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to loop through these resources while retrieving results. This function is implied in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.
+
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to loop through these resources while retrieving results. This function is implied in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.
  
 
====Configurable Parameters====
 
====Configurable Parameters====

Revision as of 17:09, 19 November 2010

Description

The role of the gCube OpenSearch Framework is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the OpenSearch Specification. The framework consists of two components

  • The OpenSearch Library, which includes a general-purpose library and the OpenSearch Operator which utilizes functionality provided by the former, and
  • The OpenSearch Service, which binds collections with provider-specific information encapsulated in generic resources and invokes the OpenSearch Operator

To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole OpenSearch Library component and the name "General-Purpose Library" will be used when referring to the library constituent of the component.

The OpenSearch Library

The General-Purpose OpenSearch Library

Description

The General-Purpose OpenSearch Library conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers. The OpenSearch Operator, described in a following section functions atop this library.

Functionality

The central class which can be used in order to exploit the functionality provided by the library, is the DescriptionDocument class. For reasons explained in the following section, the DescriptionDocument class needs to be provided with a pair of URLElementFactory and QueryElementFactory factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the FactoryResolver class, as follows:

FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);

The DescriptionDocument is then instantiated as follows:

DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);

where the descriptionDocumentXML parameter corresponds to a DOM Document object containing the parsed Description Document. Properly instantiated, the DescriptionDocument class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a QueryBuilder object, which can be obtained as follows:

List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);

where rel is a rel value as described in the OpenSearch Specification, e.g. results and MimeType is a MIME type, such as application/rss+xml. The returned list contains one QueryBuilder object for each template contained in a URL Element with the specified rel and type attributes. Once the desired QueryBuilder is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query. For example, searchTerms parameter can be set to some value as follows:

qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms);

Once all the required parameters are set, the constructed query can be obtained as follows:

try {
    query = qb.getQuery();
}catch(IncompleteQueryException iqe) {
    //Incomplete query exception handling
}catch(MalformedQueryException mqe {
    //Malformed query exception handling
}

Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either HTMLResponse or XMLResponse, depending on the MIME type of the OpenSearch response, in order for the OpenSearch Response Elements and any other available information contained in the response to be processed.

InputStream responseStream = query.openConnection().getInputStream();
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding,  dd.getURIToPrefixMappings());

The raw XML data can then be obtained by the OpenSearchResponse object as follows:

response.getResponse();

and any information available, mainly relevant to paging, can be obtained by one of the methods of the OpenSearchResponse class. For example the total number of results, as reported by the totalResults Response Element, if present, can be obtained as follows:

response.getTotalResults();

Library Extensibility

The OpenSearch Operator

Description

The role of the OpenSearch operator is to provide support for querying and retrieval of search results via OpenSearch from providers which expose an OpenSearch description document. The operator accepts a set of query terms and parameters and an OpenSearch Resource reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.

Extensibility Points

The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special OpenSearch Resource structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the general-purpose OpenSearch library. The extra functionalities are summarized as follows:

  • The support of data transformation by the operator. Provided that a transformation specification, in the form of an XPath-XSLT pair, is available for one of the MIME types of the results returned by an OpenSearch-enabled provider, the operator is able to return the obtained results transformed to the desired schema. There is also provision for the tagging of each record with a unique identifier extracted by the results and described by an additional optional XPath expression.
  • Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.
  • The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the only direct provider in the direct provider case. The purpose of these parameters is to facilitate the creation of dynamic collections from results obtained by brokers by taking the fixed parameters into account while querying the broker and only the user defined parameters on lower levels and also to customize the behaviour of some provider to the needs of the gCube Framework (or both).
  • Support for one or more security schemes is planned for a subsequent version of the OpenSearch Library.

OpenSearch Resource

The purpose of an OpenSearch resource object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the Extensibility Points section. The attributes included are the following:

  • The name of the resource
  • The URL of the OpenSearch Description Document of the provider to be queried
  • Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.
  • Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:
    • The RecordSplitXPath expression is used to split a page of search results into individual records. For example for the rss format, the <item> elements under rss/channel could be of interest
    • The XSLTLink contains a pointer to an XSLT which is used to transform the individual records to the target schema.
    • The optional RecordIdXPath expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a DocID record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.
  • Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,

The serialization of an OpenSearch Resource can be easily incorporated into a Generic Resource. The default mode of operation for the OpenSearch Operator in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resources from the IS. There are two types of Generic Resources utilized by the OpenSearch Operator

  • The OpenSearchResource which contains the body of the OpenSearch Resource as described below
  • The OpenSearchXSLT which contains the XSLT portion of a transformation specification

The XSLT pointer, in this case, contains the name of the OpenSearchXSLT generic resource it points to.

Note that, solely for testing purposes, the OpenSearch Operator also supports a local mode of operation, whereby all OpenSearch Resources are loaded from the local file system. In that case, the XSLTLink element contains a URL pointing to the corresponding XSLT file.

The XML Schema that all OpenSearch Resource serializations should conform to is the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="OpenSearchResource">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="descriptionDocumentURI" type="xs:string"/>
            <xs:element name="brokeredResults" type="xs:boolean"/>
            <xs:element name="transformation" maxOccurs="unbounded">
               <xs:complexType>
                  <xs:sequence>
                     <xs:element name="MIMEType" type="xs:string"/>
                     <xs:element name="recordSplitXPath" type="xs:string"/>
                     <xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/>
                     <xs:element name="XSLTLink" type="xs:string"/>						
                  </xs:sequence>
               </xs:complexType>
            </xs:element>
            <xs:element name="security" minOccurs="0">
               <xs:complexType>
                  <xs:sequence>
                  </xs:sequence>
               </xs:complexType>
            </xs:element>		
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

The transformation element can appear multiple times within an OpenSearch Resource. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available schema in sequence, in the event of a failure in the transformation phase.

In the case of querying providers which return brokered results, the transformation element is used to specify a data tranformation that extracts the URLs of the Description Documents of the brokered OpenSearch services from the initial results provided by the OpenSearch service acting as a broker.

OpenSearch Operator Logic

The OpenSearch Operator employs the functionality provided by the OpenSearch Library in order to extract the required information from the Description Document of the external provider and the QueryBuilders needed in order to perform queries, in a fashion similar to that described in the OpenSearch Library Functionality section. In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the Operator, before issuing any actual queries. Given that an OpenSearch Resource can contain more that one transformation specifications and that the number of the templates present in a Element are not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the QueryBuilder whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the OpenSearch Resource describing the provider are discarded. The QueryBuilders of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:

  • QueryBuilders lacking a parameter contained in the caller's query are discarded
  • QueryBuilderss are reordered according to the number of query parameters

The most usual case is for the provider's OpenSearch Resource to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one QueryBuilder being available to construct queries.

The functions performed by the operator in order for a set of results to be retrieved are summarized in the following simplified diagram

A simplified flowchart of the operations performed by the OpenSearch operator

As shown, the operator accepts a set of query terms and a set of query parameters.

The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely startPage or startIndex and count. These parameters are therefore abstracted away by the OpenSearch Operator.

In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to loop through these resources while retrieving results. This function is implied in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.

Configurable Parameters

The OpenSearch Operator can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:

  • The resultsPerPage parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is 100.
  • The sequentialResults parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is false.
  • The useLocalResource parameter, when enabled, permits the operator to operate in the absence of an IS. The OpenSearch resources are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to false.

An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in #Library Extensibility. The sequentialResults parameter can also be configured in a per-query manner, including it in the query string as a query parameter.

Query Format

The OpenSearch operator expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form

<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"

and should be space-delimited. Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well. Any free-text parameter value should be URL-encoded.

The reserved keyword config when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the sequentialResults parameter described in Configurable Parameters, plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional. The following hold for the query configuration parameter values

  • The sequentialResults parameter should be assigned a value equal to true or false. Its absence implies the default value of the corresponding configurable parameter of the operator.
  • The numOfResults parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.

Taking everything into account, an example of a legitimate query for the OpenSearch Operator could be the following:

http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"

which instructs the operator to use the string Hello World as the value for the SearchTerms standard OpenSearch parameter and to retrieve up to 300 results from the provider.

The OpenSearch Service

Description

The OpenSearch service is a stateful service responsible for the invocation of the OpenSearch Operator in the context of the provider to be queried. It is also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all previously queried brokered providers and the corresponding Description Documents.

WS and Generic Resource Interrelation

Provided that a Content/Metadata Collection pair for the provider to be queried is available, the OpenSearch Service uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Metadata Collection corresponding to the provider.

An OpenSearch Service WS-Resource contains the following properties

  • The Metadata Collection ID of the collection to be used
  • The templates published in the Description Document of the provider. These are extracted using a method supplied by the General-Purpose OpenSearch Library
  • The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.
  • The URI of the Description Document of the top-provider.
  • A set of Fixed Parameters, which are used in every invocation of the Operator. See also Extensibility Points.

As mentioned above, the WS-Resource contains a reference only to the OpenSearchResource of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System and are therefore not directly referenced by the WS-Resource. The same holds for all OpenSearchXSLT generic resources.

Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.

Resource Caching

For performance and reliability reasons, the OpenSearch Service maintains one cache per WS-Resource which initially contains the Generic Resources (of both OpenSearchResource and OpenSearchXSLT types) and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant OpenSearch Operator module with the Generic Resources and Description Documents of all providers reached through the broker.

To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.

The cache refresh cycle policy used is described as follows:

  • The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.
  • The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.
  • In the event of failure, the previously cached version is kept.

Operations

The operations exposed by the OpenSearch Service are the following:

  • The query operation, with a single input message containing the query string to be sent to the operator, whose format is described in Query Format.
  • The refreshCache operation, which sends a request in order to force the cache of the Service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.

Configurable Parameters

The Service currently supports two configurable parameters, which are exposed to its deployment descriptor

  • The clearCacheOnStartup parameter, of boolean type, when enabled instructs the service to discard the stored cache on startup.
  • The cacheRefreshIntervalMillis parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value 0 can be used to disable periodic cache refresh cycles.