Distributed Information Retrieval Support Framework

From Gcube Wiki
Revision as of 13:08, 6 November 2008 by Fabio.simeoni (Talk | contribs)

Jump to: navigation, search

The DIR Master service optimises the evaluation of unstructured queries across two or more autonomous collections, the target collections. In particular, the service offers the following functionalities:

  • collection selection: the identification of the subset of target collections that appear to be the most promising candidates for the evaluation of a given query. The process relies upon goodness criteria and selection criteria: the first are used to rank the target collections from the most promising to the least promising, while the second are used to select collections for query evaluation based on their rank. Depending on the choice of goodness and selection criteria, the process may promote the efficiency or the effectiveness of query evaluation. In the first case, the output of the service is used to limit the costs of query evaluation only to the selected collections. In the second case, the output of the service is used to regulate the number of results retrieved from each collection, and thus to limit the number of irrelevant results that are ultimately presented to the user. The two usage models may well coexist within a single collection selection strategy.
  • result fusion: the integration of the partial results obtained by evaluating the queries against (selections of) the target collections. Typically, the process is one of harmonisation of result scores that have been assigned with respect to different retrieval models and content statistics. Accordingly, the process promotes the effectiveness of query evaluation.
  • collection description: the synthesis and maintenance over time of summary information about the content of target collections, from partial inverted indices and language models to result traces for training or past queries. Collection description is only of indirect interest in the context of query evaluation and thus is not exposed by the Master service. Its output forms the basis upon which collections are ranked before being selected for query evaluation. It may also be required to normalise the scores with which the partial results of query evaluation are finally merged.

Collectively, these functionalities characterise the field known as (unstructured) Distributed Information Retrieval (DIR).

Architecture

The DIR Master service embodies the simplest approach to providing functionalities of collection selection, collection description, and result fusion within a gCube infrastructure. The service participates of the runtime search framework in two different roles: as a query pre-processor, it its invoked to perform collection selection prior to query evaluation; as a search operator, it is invoked to merge results after the query has been evaluated across selected target collections.

Within the implementation of the service, collection selection and result fusion are implemented by local algorithms. The algorithms, however, rely on the availability of collection descriptions gathered from other Information Retrieval services asynchronously with respect to the query evaluation workflow. In particular, Index Management services services are relied upon to localise descriptions of the target collections.

The relationships between the DIR Master service and other Information Retrieval services are illustrated in the following component diagram:

Masterarchitecture.jpg

DIR Master Service

The design of the service is distributed across two port-types: the Master and the Factory port-types.

The Master port-type has the primary task of applying collection selection and result fusion algorithms to a set of target collections. Locally, such sets are materialised in terms of collection ‘proxies’ – containers of summary information about the remote collections – and are maintained both in memory and on the file system. Collectively, they form the state of the port-type and are bound on a per-request basis to it, in line with the implied resource pattern of WSRF. In particular, the pairing of the Master interface and the collection sets identifies WS-Resources referred to as Masters.

Masters are created in response to client requests to the Factory port-type and inherit their scope. It is within this scope that a selection of their properties is published in the Information System. In WSRF terminology, these are the Resource Properties that identify: (i) the collections in the bound set, and (ii) the selection and fusion algorithms supported by Masters. At creation time or at any later point in the Master’s lifetime, clients may add or remove target collections from the associated set.

Adding a collection to the Master requires interactions with WS-Resources and Running Instances of Index Management and Collection Management services. The interactions are entirely based on instantiations of the Handler’s framework included in the gCF, and thus inherit best-effort and caching strategies from it. A successful interaction results in the local availability of a term histogram of the collection, i.e. a dictionary of the stems of the most content-bearing words of the collection content, each annotated with their frequency of occurrence within the collection. The histogram is then ingested in the master index, a local inverted index of the union of the content of all the target associated with Masters.

From the master index, collection content statics can then be retrieved for the application of selection and fusion strategies. These are identified dynamically on a per-request basis. For example, a collection selection strategy begins with the application of an algorithm to rank the target collections with respect to a client-specified query. This is then followed by the application of a selection criterion, also specified by clients, against the resulting collection ranking. Differently from selection criteria, however, ranking algorithms are identified implicitly, by comparing characteristics of the individual request (e.g. the type of the query and query terms) with those of prototypical examples of the inputs expected by the available algorithms. This dynamic approach allows the port-type to present a very extensible interface and thus: (i) support multiple algorithms at any one time, and (ii) evolve to support more algorithms whilst maintaining backwards compatibility. Similar considerations apply to merging strategies.

In the current version of the service, Masters rank target collections by estimating the likelihood that their content will prove relevant to the information needs that underlies queries. The collection retrieval inference network (CORI) is a collection-level generalisation of the Bayesian Inference Network probabilistic model of retrieval for text documents. In the model, probabilities are based upon statistics that are analogous to term occurrence frequency (tf) and inverse document frequency (idf) in classic document retrieval. In particular, term frequency is replaced with document frequency (df) and inverse document frequency is replaced with inverse collection frequency (icf). Three selection criteria may then be chosen to return the first n target collections in the ranking. In the TopN criterion, n is an absolute constant. In the BestScores criterion, n is derived with respect to a threshold on the relevance score. In the ResultDistribution criterion, n is derived from a distribution of the number of results to be returned to the user.

As to fusion, the current version of the service guarantees a consistent merging of the results sets that emanate from the target collections in response to the evaluation of a given query. In particular, Masters re-compute the relevance estimates of the documents identified in the result sets using non-heuristic techniques. To do so, Masters rely on: (i) global collection content statistics available in the master index, and (ii) term occurrence statistics for each result (such as the number of terms in the corresponding document and the frequency with which the query terms occur in the document). Effectively, the availability of collection-wide and result-wide statistical information allows the service to consistently re-rank the virtual collection comprised of all the documents identified in the result set.


Alert icon2.gif THIS SECTION OF GCUBE DOCUMENTATION IS CURRENTLY UNDER UPDATE.