OpenSearch Framework

From Gcube Wiki
Revision as of 00:11, 20 November 2010 by Gerasimos.farantatos (Talk | contribs) (OpenSearch Operator Logic)

Jump to: navigation, search

Description

The role of the gCube OpenSearch Framework is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the OpenSearch Specification. The framework consists of two components

  • The OpenSearch Library, which includes a general-purpose library and the OpenSearch Operator which utilizes functionality provided by the former, and
  • The OpenSearch Service, which binds collections with provider-specific information encapsulated in generic resources and invokes the OpenSearch Operator

To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole OpenSearch Library component, whereas the name "General-Purpose Library" will be used when referring to the library constituent of the component.

The OpenSearch Library

The General-Purpose OpenSearch Library

Description

The General-Purpose OpenSearch Library conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers. It can be optionally extended, as described in the Extensibility section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported. The OpenSearch Operator, described in a following section functions atop this library.

Functionality

The central class which can be used in order to exploit the functionality provided by the library, is the DescriptionDocument class. For reasons explained in the following section, the DescriptionDocument class needs to be provided with a pair of URLElementFactory and QueryElementFactory factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the FactoryResolver class, as follows:

FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);

The DescriptionDocument is then instantiated as follows:

DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);

where the descriptionDocumentXML parameter corresponds to a DOM Document object containing the parsed Description Document. Properly instantiated, the DescriptionDocument class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a QueryBuilder object, which can be obtained as follows:

List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);

where rel is a rel value as described in the OpenSearch Specification, e.g. results and MimeType is a MIME type, such as application/rss+xml. The returned list contains one QueryBuilder object for each template contained in a URL Element with the specified rel and type attributes. Once the desired QueryBuilder is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query. For example, searchTerms parameter can be set to some value as follows:

qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms);

Once all the required parameters are set, the constructed query can be obtained as follows:

URL query;
try {
    query = qb.getQuery();
}catch(IncompleteQueryException iqe) {
    //Incomplete query exception handling
}catch(MalformedQueryException mqe {
    //Malformed query exception handling
}

Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either HTMLResponse or XMLResponse, depending on the MIME type of the OpenSearch response, in order for the OpenSearch Response Elements and any other available information contained in the response to be processed.

InputStream responseStream = query.openConnection().getInputStream();
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding,  dd.getURIToPrefixMappings());

The raw XML data can then be obtained by the OpenSearchResponse object as follows:

response.getResponse();

and any information available, mainly relevant to paging, can be obtained by one of the methods of the OpenSearchResponse class. For example the total number of results, as reported by the totalResults Response Element, if present, can be obtained as follows:

response.getTotalResults();

Library Extensibility

Motivation

The core functionality provided by the GP-OpenSearch Library is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored. An example of an assumption made by the GP-OpenSearch Library, in the form of a requirement, is that all parameter values passed to its QueryBuilder components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a user failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.

There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the user should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.

Extensibility Mechanism

The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, QueryBuilder components are included in the extensibility mechanism as dependent on the aforementioned elements.

Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the user.

The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:

  • The URLElement, QueryElement and QueryBuilder components are interfaces whose implementations support core or extension-related functionality.
  • Core functionality processing takes place in the last link of the chain of responsibility. For example, BasicQueryBuilder implements core QueryBuilder functionality.
  • Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if GeoQueryBuilder implements functionality related to the Geo OpenSearch Extension, it contains a reference to a QueryBuilder implementing either core functionality, or functionality related to some other extension.
  • Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.


In order for a chain of responsibility to be dynamically created by the DescriptionDocument class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the DescriptionDocument in order for it to be able to construct the correct elements. The FactoryResolver utility is responsible for the construction of factories capable of constructing objects supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing QueryBuilder objects, the latter are constructed without explicitly supplying a factory to the DescriptionDocument, by the getQueryBuilder method of the already constructed URLElement.

The FactoryResolver requires that two things be known in order to be able to construct the factories:

  • A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping <http://a9.com/-/spec/opensearch/1.1/, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)>, which declares that the implementations responsible for providing core functionality are the BasicURLElement and BasicQueryElement classes.
  • A list of all parameter namespaces present in the query string.

Having the above information, and provided that the implementations of all factories and classes are available, the FactoryResolver will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed objects will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.

It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. Therefore there are two choices when using the library

  • Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the user should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.
  • Extend the library whenever this proves useful or makes things easier.
Implementing a new Extension

The OpenSearch Operator

Description

The role of the OpenSearch operator is to provide support for querying and retrieval of search results via OpenSearch from providers which expose an OpenSearch description document. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an OpenSearch Resource reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.

Extensibility Points

The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special OpenSearch Resource structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the general-purpose OpenSearch library. The extra functionalities are summarized as follows:

  • The support of data transformation by the operator. Provided that a transformation specification, in the form of an XPath-XSLT pair, is available for one of the MIME types of the results returned by an OpenSearch-enabled provider, the operator is able to return the obtained results transformed to the desired schema. There is also provision for the tagging of each record with a unique identifier extracted by the results and described by an additional optional XPath expression.
  • Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.
  • The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the only direct provider in the direct provider case. The purpose of these parameters is to facilitate the creation of dynamic collections from results obtained by brokers by taking the fixed parameters into account while querying the broker and only the user defined parameters on lower levels and also to customize the behaviour of some provider to the needs of the gCube Framework (or both).
  • Support for one or more security schemes is planned for a subsequent version of the OpenSearch Library.

OpenSearch Resource

The purpose of an OpenSearch resource object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the Extensibility Points section. The attributes included are the following:

  • The name of the resource
  • The URL of the OpenSearch Description Document of the provider to be queried
  • Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.
  • Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:
    • The RecordSplitXPath expression is used to split a page of search results into individual records. For example for the rss format, the <item> elements under rss/channel could be of interest
    • The XSLTLink contains a pointer to an XSLT which is used to transform the individual records to the target schema.
    • The optional RecordIdXPath expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a DocID record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.
  • Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,

The serialization of an OpenSearch Resource can be easily incorporated into a Generic Resource. The default mode of operation for the OpenSearch Operator in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resources from the IS. There are two types of Generic Resources utilized by the OpenSearch Operator

  • The OpenSearchResource which contains the body of the OpenSearch Resource as described below
  • The OpenSearchXSLT which contains the XSLT portion of a transformation specification

The XSLT pointer, in this case, contains the name of the OpenSearchXSLT generic resource it points to.

Note that, solely for testing purposes, the OpenSearch Operator also supports a local mode of operation, whereby all OpenSearch Resources are loaded from the local file system. In that case, the XSLTLink element contains a URL pointing to the corresponding XSLT file.

The XML Schema that all OpenSearch Resource serializations should conform to is the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="OpenSearchResource">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="name" type="xs:string"/>
            <xs:element name="descriptionDocumentURI" type="xs:string"/>
            <xs:element name="brokeredResults" type="xs:boolean"/>
            <xs:element name="transformation" maxOccurs="unbounded">
               <xs:complexType>
                  <xs:sequence>
                     <xs:element name="MIMEType" type="xs:string"/>
                     <xs:element name="recordSplitXPath" type="xs:string"/>
                     <xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/>
                     <xs:element name="XSLTLink" type="xs:string"/>						
                  </xs:sequence>
               </xs:complexType>
            </xs:element>
            <xs:element name="security" minOccurs="0">
               <xs:complexType>
                  <xs:sequence>
                  </xs:sequence>
               </xs:complexType>
            </xs:element>		
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

The transformation element can appear multiple times within an OpenSearch Resource. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types.

In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch services from the initial results provided by the OpenSearch service acting as a broker.

OpenSearch Operator Logic

Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator

The OpenSearch Operator employs the functionality provided by the OpenSearch Library in order to extract the required information from the Description Document of the external provider and the QueryBuilders needed in order to perform queries, in a fashion similar to that described in the OpenSearch Library Functionality section.

It should be noted that the OpenSearch Operator abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way OpenSearch Resources are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options

  • If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional OpenSearch Resources can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.
  • If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.

In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the Operator, before issuing any actual queries. Given that an OpenSearch Resource can contain more that one transformation specifications and that the number of the templates present in a URL Element is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the QueryBuilder whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the OpenSearch Resource describing the provider are discarded. The QueryBuilders of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:

  • QueryBuilders whose required parameters are not covered by the parameters of the caller's query are discarded
  • QueryBuilders are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered
  • A QueryBuilder which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.

The most usual case is for the provider's OpenSearch Resource to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one QueryBuilder being available to construct queries, thereby resulting to a degenerate reordering step.

The functions performed by the operator in order for a set of results to be retrieved, given that the proper QueryBuilder is selected, are summarized in the simplified diagram of Figure 1.

As shown, the operator accepts a set of query terms and a set of query parameters.

The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely startPage or startIndex and count. These parameters are therefore abstracted away by the OpenSearch Operator.

In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.

Configurable Parameters

The OpenSearch Operator can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:

  • The resultsPerPage parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is 100.
  • The sequentialResults parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is false.
  • The useLocalResource parameter, when enabled, permits the operator to operate in the absence of an IS. The OpenSearch resources are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to false.

An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in Library Extensibility. The sequentialResults parameter can also be configured in a per-query manner, including it in the query string as a query parameter.

Query Format

The OpenSearch operator expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form

<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"

and should be space-delimited. Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well. Any free-text parameter value should be URL-encoded.

The reserved keyword config when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the sequentialResults parameter described in Configurable Parameters, plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional. The following hold for the query configuration parameter values

  • The sequentialResults parameter should be assigned a value equal to true or false. Its absence implies the default value of the corresponding configurable parameter of the operator.
  • The numOfResults parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.

Taking everything into account, an example of a legitimate query for the OpenSearch Operator could be the following:

http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"

which instructs the operator to use the string Hello World as the value for the SearchTerms standard OpenSearch parameter and to retrieve up to 300 results from the provider.

The OpenSearch Service

Description

The OpenSearch service is a stateful service responsible for the invocation of the OpenSearch Operator in the context of the provider to be queried. It is also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all previously queried brokered providers and the corresponding Description Documents.

WS and Generic Resource Interrelation

Provided that a Content/Metadata Collection pair for the provider to be queried is available, the OpenSearch Service uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Metadata Collection corresponding to the provider.

An OpenSearch Service WS-Resource contains the following properties

  • The Metadata Collection ID of the collection to be used
  • The templates published in the Description Document of the provider. These are extracted using a method supplied by the General-Purpose OpenSearch Library
  • The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.
  • The URI of the Description Document of the top-provider.
  • A set of Fixed Parameters, which are used in every invocation of the Operator. See also Extensibility Points.

As mentioned above, the WS-Resource contains a reference only to the OpenSearchResource of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System and are therefore not directly referenced by the WS-Resource. The same holds for all OpenSearchXSLT generic resources.

Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.

Resource Caching

For performance and reliability reasons, the OpenSearch Service maintains one cache per WS-Resource which initially contains the Generic Resources (of both OpenSearchResource and OpenSearchXSLT types) and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant OpenSearch Operator module with the Generic Resources and Description Documents of all providers reached through the broker.

To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.

The cache refresh cycle policy used is described as follows:

  • The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.
  • The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.
  • In the event of failure, the previously cached version is kept.

Operations

The operations exposed by the OpenSearch Service are the following:

  • The query operation, with a single input message containing the query string to be sent to the operator, whose format is described in Query Format.
  • The refreshCache operation, which sends a request in order to force the cache of the Service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.

Configurable Parameters

The Service currently supports two configurable parameters, which are exposed to its deployment descriptor

  • The clearCacheOnStartup parameter, of boolean type, when enabled instructs the service to discard the stored cache on startup.
  • The cacheRefreshIntervalMillis parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value 0 can be used to disable periodic cache refresh cycles.