Distributed Information Retrieval Support Framework

From Gcube Wiki
Revision as of 18:10, 12 November 2008 by Fabio.simeoni (Talk | contribs) (Masters)

Jump to: navigation, search

The DIR Master service optimises the evaluation of unstructured queries across two or more autonomous collections, the target collections. In particular, the service offers the following functionalities:

  • collection selection: the identification of the subset of target collections that appear to be the most promising candidates for the evaluation of a given query. The process relies upon goodness criteria and selection criteria: the first are used to rank the target collections from the most promising to the least promising, while the second are used to select collections for query evaluation based on their rank. Depending on the choice of goodness and selection criteria, the process may promote the efficiency or the effectiveness of query evaluation. In the first case, the output of the service is used to limit the costs of query evaluation only to the selected collections. In the second case, the output of the service is used to regulate the number of results retrieved from each collection, and thus to limit the number of irrelevant results that are ultimately presented to the user. The two usage models may well coexist within a single collection selection strategy.
  • result fusion: the integration of the partial results obtained by evaluating the queries against (selections of) the target collections. Typically, the process is one of harmonisation of result scores that have been assigned with respect to different retrieval models and content statistics. Accordingly, the process promotes the effectiveness of query evaluation.
  • collection description: the synthesis and maintenance over time of summary information about the content of target collections, from partial inverted indices and language models to result traces for training or past queries. Collection description is only of indirect interest in the context of query evaluation and thus is not exposed by the Master service. Its output forms the basis upon which collections are ranked before being selected for query evaluation. It may also be required to normalise the scores with which the partial results of query evaluation are finally merged.

Collectively, these functionalities characterise the field known as (unstructured) Distributed Information Retrieval (DIR).

Architecture

In a gCube infrastructure, the service operates within the context of the search framework. Its role is twofold: as a query pre-processor, it its invoked to perform collection selection prior to query evaluation; as a search operator, it is invoked to merge results after the query has been evaluated across selected target collections.

Within the implementation of the service, collection selection and result fusion are local processes. Collection description, however, relies on FullText Index Management services services to localise and maintain statistics about the content of the target collections.

The relationships between the Master service and the services that interact with it are illustrated in the following component diagram:

Masterarchitecture.jpg

Design

The design of the service is distributed across two port-types: the Master and the Factory port-types.

Masterdesign1.jpg

Masters

The Master port-type is stateful, in that it maintains (in memory and on the local file system) summary information about the target collections. Such 'proxies' of the target collections are grouped in (potentially overlapping) collection sets that represent the execution scope of a class of distributed queries. Collections sets are then bound to the port-type interface on a per-request basis, in line with the implied resource pattern of WSRF. In particular, pairing the Master interface with collection sets identifies WS-Resources referred to as a Masters.

Masterdesign2.jpg

Masters are created in response to client requests to the Factory port-type. At the time and within the scope of their creation, a selection of the state of Masters is published in the Information System. In WSRF terminology, these are the Resource Properties that identify the target collections and by which Masters can be discovered by clients.


Adding a collection to the Master requires interactions with WS-Resources and Running Instances of Index Management and Collection Management services. The interactions are entirely based on instantiations of the Handler’s framework included in the gCF, and thus inherit best-effort and caching strategies from it. A successful interaction results in the local availability of a term histogram of the collection, i.e. a dictionary of the stems of the most content-bearing words of the collection content, each annotated with their frequency of occurrence within the collection. The histogram is then ingested in the master index, a local inverted index of the union of the content of all the target associated with Masters.

From the master index, collection content statics can then be retrieved for the application of selection and fusion strategies. These are identified dynamically on a per-request basis. For example, a collection selection strategy begins with the application of an algorithm to rank the target collections with respect to a client-specified query. This is then followed by the application of a selection criterion, also specified by clients, against the resulting collection ranking. Differently from selection criteria, however, ranking algorithms are identified implicitly, by comparing characteristics of the individual request (e.g. the type of the query and query terms) with those of prototypical examples of the inputs expected by the available algorithms. This dynamic approach allows the port-type to present a very extensible interface and thus: (i) support multiple algorithms at any one time, and (ii) evolve to support more algorithms whilst maintaining backwards compatibility. Similar considerations apply to merging strategies.

In the current version of the service, Masters rank target collections by estimating the likelihood that their content will prove relevant to the information needs that underlies queries. The collection retrieval inference network (CORI) is a collection-level generalisation of the Bayesian Inference Network probabilistic model of retrieval for text documents. In the model, probabilities are based upon statistics that are analogous to term occurrence frequency (tf) and inverse document frequency (idf) in classic document retrieval. In particular, term frequency is replaced with document frequency (df) and inverse document frequency is replaced with inverse collection frequency (icf). Three selection criteria may then be chosen to return the first n target collections in the ranking. In the TopN criterion, n is an absolute constant. In the BestScores criterion, n is derived with respect to a threshold on the relevance score. In the ResultDistribution criterion, n is derived from a distribution of the number of results to be returned to the user.

As to fusion, the current version of the service guarantees a consistent merging of the results sets that emanate from the target collections in response to the evaluation of a given query. In particular, Masters re-compute the relevance estimates of the documents identified in the result sets using non-heuristic techniques. To do so, Masters rely on: (i) global collection content statistics available in the master index, and (ii) term occurrence statistics for each result (such as the number of terms in the corresponding document and the frequency with which the query terms occur in the document). Effectively, the availability of collection-wide and result-wide statistical information allows the service to consistently re-rank the virtual collection comprised of all the documents identified in the result set.

The Factory

The Factory is the point of contact to the Master for clients that wish to create Masters for zero or more target collections, starting from their public identifiers. In this role, it is stateless.

Masterdesign3.jpg

The public interface of the Factory port-type can be found here.

Alert icon2.gif THIS SECTION OF GCUBE DOCUMENTATION IS CURRENTLY UNDER UPDATE.