OpenSearch Framework
Description
The role of the gCube OpenSearch Framework is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the OpenSearch Specification. The framework consists of two components
- The OpenSearch Library, which includes a general-purpose library and the OpenSearch Operator which utilizes functionality provided by the former, and
- The OpenSearch Service, which binds collections with provider-specific information encapsulated in generic resources and invokes the OpenSearch Operator
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole OpenSearch Library component, whereas the name "General-Purpose Library" will be used when referring to the library constituent of the component.
The OpenSearch Library
The General-Purpose OpenSearch Library
Description
The General-Purpose OpenSearch Library conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers. It can be optionally extended, as described in the Extensibility section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported. The OpenSearch Operator, described in a following section functions atop this library.
Functionality
The central class which can be used in order to exploit the functionality provided by the library, is the DescriptionDocument
class. For reasons explained in the following section, the DescriptionDocument
class needs to be provided with a pair of URLElementFactory
and QueryElementFactory
factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the FactoryResolver
class, as follows:
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);
The DescriptionDocument
is then instantiated as follows:
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);
where the descriptionDocumentXML
parameter corresponds to a DOM Document
object containing the parsed Description Document.
Properly instantiated, the DescriptionDocument
class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a QueryBuilder
object, which can be obtained as follows:
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);
where rel
is a rel value as described in the OpenSearch Specification, e.g. results
and MimeType
is a MIME type, such as application/rss+xml
. The returned list contains one QueryBuilder
object for each template contained in a URL Element with the specified rel
and type
attributes.
Once the desired QueryBuilder
is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.
For example, searchTerms
parameter can be set to some value as follows:
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms);
Once all the required parameters are set, the constructed query can be obtained as follows:
URL query; try { query = qb.getQuery(); }catch(IncompleteQueryException iqe) { //Incomplete query exception handling }catch(MalformedQueryException mqe { //Malformed query exception handling }
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either HTMLResponse
or XMLResponse
, depending on the MIME type of the OpenSearch response, in order for the OpenSearch Response Elements and any other available information contained in the response to be processed.
InputStream responseStream = query.openConnection().getInputStream(); OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());
The raw XML data can then be obtained by the OpenSearchResponse
object as follows:
response.getResponse();
and any information available, mainly relevant to paging, can be obtained by one of the methods of the OpenSearchResponse
class. For example the total number of results, as reported by the totalResults
Response Element, if present, can be obtained as follows:
response.getTotalResults();
Library Extensibility
The OpenSearch Operator
Description
The role of the OpenSearch operator is to provide support for querying and retrieval of search results via OpenSearch from providers which expose an OpenSearch description document. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an OpenSearch Resource reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.
Extensibility Points
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special OpenSearch Resource structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the general-purpose OpenSearch library. The extra functionalities are summarized as follows:
- The support of data transformation by the operator. Provided that a transformation specification, in the form of an XPath-XSLT pair, is available for one of the MIME types of the results returned by an OpenSearch-enabled provider, the operator is able to return the obtained results transformed to the desired schema. There is also provision for the tagging of each record with a unique identifier extracted by the results and described by an additional optional XPath expression.
- Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.
- The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the only direct provider in the direct provider case. The purpose of these parameters is to facilitate the creation of dynamic collections from results obtained by brokers by taking the fixed parameters into account while querying the broker and only the user defined parameters on lower levels and also to customize the behaviour of some provider to the needs of the gCube Framework (or both).
- Support for one or more security schemes is planned for a subsequent version of the OpenSearch Library.
OpenSearch Resource
The purpose of an OpenSearch resource object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the Extensibility Points section. The attributes included are the following:
- The name of the resource
- The URL of the OpenSearch Description Document of the provider to be queried
- Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.
- Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:
- The RecordSplitXPath expression is used to split a page of search results into individual records. For example for the rss format, the
<item>
elements underrss/channel
could be of interest - The XSLTLink contains a pointer to an XSLT which is used to transform the individual records to the target schema.
- The optional RecordIdXPath expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a
DocID
record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.
- The RecordSplitXPath expression is used to split a page of search results into individual records. For example for the rss format, the
- Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,
The serialization of an OpenSearch Resource can be easily incorporated into a Generic Resource. The default mode of operation for the OpenSearch Operator in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resources from the IS. There are two types of Generic Resources utilized by the OpenSearch Operator
- The OpenSearchResource which contains the body of the OpenSearch Resource as described below
- The OpenSearchXSLT which contains the XSLT portion of a transformation specification
The XSLT pointer, in this case, contains the name of the OpenSearchXSLT generic resource it points to.
Note that, solely for testing purposes, the OpenSearch Operator also supports a local mode of operation, whereby all OpenSearch Resources are loaded from the local file system. In that case, the XSLTLink element contains a URL pointing to the corresponding XSLT file.
The XML Schema that all OpenSearch Resource serializations should conform to is the following:
<?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="OpenSearchResource"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="descriptionDocumentURI" type="xs:string"/> <xs:element name="brokeredResults" type="xs:boolean"/> <xs:element name="transformation" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="MIMEType" type="xs:string"/> <xs:element name="recordSplitXPath" type="xs:string"/> <xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="XSLTLink" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="security" minOccurs="0"> <xs:complexType> <xs:sequence> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
The transformation element can appear multiple times within an OpenSearch Resource. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types.
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch services from the initial results provided by the OpenSearch service acting as a broker.
OpenSearch Operator Logic
The OpenSearch Operator employs the functionality provided by the OpenSearch Library in order to extract the required information from the Description Document of the external provider and the QueryBuilder
s needed in order to perform queries, in a fashion similar to that described in the OpenSearch Library Functionality section. In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the Operator, before issuing any actual queries. Given that an OpenSearch Resource can contain more that one transformation specifications and that the number of the templates present in a Element are not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the QueryBuilder
whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the OpenSearch Resource describing the provider are discarded. The QueryBuilder
s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:
QueryBuilder
s lacking a parameter contained in the caller's query are discardedQueryBuilders
s are reordered according to the number of query parameters
The most usual case is for the provider's OpenSearch Resource to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one QueryBuilder
being available to construct queries.
The functions performed by the operator in order for a set of results to be retrieved are summarized in the following simplified diagram
As shown, the operator accepts a set of query terms and a set of query parameters.
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely startPage
or startIndex
and count
. These parameters are therefore abstracted away by the OpenSearch Operator.
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to loop through these resources while retrieving results. This function is implied in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.
Configurable Parameters
The OpenSearch Operator can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:
- The resultsPerPage parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is
100
. - The sequentialResults parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is
false
. - The useLocalResource parameter, when enabled, permits the operator to operate in the absence of an IS. The OpenSearch resources are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to
false
.
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in #Library Extensibility. The sequentialResults parameter can also be configured in a per-query manner, including it in the query string as a query parameter.
Query Format
The OpenSearch operator expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"
and should be space-delimited. Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well. Any free-text parameter value should be URL-encoded.
The reserved keyword config
when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the
sequentialResults parameter described in Configurable Parameters, plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.
The following hold for the query configuration parameter values
- The sequentialResults parameter should be assigned a value equal to
true
orfalse
. Its absence implies the default value of the corresponding configurable parameter of the operator. - The numOfResults parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.
Taking everything into account, an example of a legitimate query for the OpenSearch Operator could be the following:
http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"
which instructs the operator to use the string Hello World
as the value for the SearchTerms
standard OpenSearch parameter and to retrieve up to 300 results from the provider.
The OpenSearch Service
Description
The OpenSearch service is a stateful service responsible for the invocation of the OpenSearch Operator in the context of the provider to be queried. It is also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all previously queried brokered providers and the corresponding Description Documents.
WS and Generic Resource Interrelation
Provided that a Content/Metadata Collection pair for the provider to be queried is available, the OpenSearch Service uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Metadata Collection corresponding to the provider.
An OpenSearch Service WS-Resource contains the following properties
- The Metadata Collection ID of the collection to be used
- The templates published in the Description Document of the provider. These are extracted using a method supplied by the General-Purpose OpenSearch Library
- The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.
- The URI of the Description Document of the top-provider.
- A set of Fixed Parameters, which are used in every invocation of the Operator. See also Extensibility Points.
As mentioned above, the WS-Resource contains a reference only to the OpenSearchResource of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System and are therefore not directly referenced by the WS-Resource. The same holds for all OpenSearchXSLT generic resources.
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.
Resource Caching
For performance and reliability reasons, the OpenSearch Service maintains one cache per WS-Resource which initially contains the Generic Resources (of both OpenSearchResource and OpenSearchXSLT types) and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant OpenSearch Operator module with the Generic Resources and Description Documents of all providers reached through the broker.
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.
The cache refresh cycle policy used is described as follows:
- The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.
- The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.
- In the event of failure, the previously cached version is kept.
Operations
The operations exposed by the OpenSearch Service are the following:
- The
query
operation, with a single input message containing the query string to be sent to the operator, whose format is described in Query Format. - The
refreshCache
operation, which sends a request in order to force the cache of the Service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.
Configurable Parameters
The Service currently supports two configurable parameters, which are exposed to its deployment descriptor
- The clearCacheOnStartup parameter, of
boolean
type, when enabled instructs the service to discard the stored cache on startup. - The cacheRefreshIntervalMillis parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value
0
can be used to disable periodic cache refresh cycles.