Search Planner

From Gcube Wiki
Revision as of 17:17, 22 July 2011 by Vassilis.verroios (Talk | contribs) (Search Planner)

Jump to: navigation, search

Search Planner

The outcome of a CQL query is a set of documents that satisfy the criteria defined by the query. Documents are hosted in a gCube infrastructure by Data Sources. Data Sources are able to execute CQL queries, bound to the information they host, and each Data Source supports different CQL capabilities. In most cases the initial CQL query can not be answered solely by a single Data Source. Therefore we need to use the functionality of Search Operators and combine the sets of documents retrieved from various Data Sources. The Search Planner detects which are the Data Sources that must be involved in order to answer the initial query and produces a plan. This plan specifies a)the subquery of the initial query that must be answered by each Data Source and b)how the Data Sources are combined with Search Operators in order to produce the final outcome.

Search Planner takes into account the following facts about the gCube environment:

  • Vertical and Horizontal partitioning of data
  • Data Sources CQL capabilities
  • The costs implied by the execution of a search plan

The information for one document is distributed across multiple Data Sources. We define this information partitioning as the Vertical Data Partitioning. Documents are also divided into collections (see OCMA), forming in such a way a Horizontal Data Partitioning. Multiple Horizontal partitions are hosted by each Data Source. For the example of Figure 1 assume that the documents are divided into collections A, B, C, D, E, which constitute the Horizontal Data Partitioning. The header, type, and location for each document are hosted into different Data Sources, forming the Vertical Data Partitioning. Assume also that DataSource ABh hosts the header information for the documents of collections A and B, source ABt hosts the type information for collections A and B, and ABl hosts the location for documents of A and B. Following the same fashion sources CDh, CDt, CDl, Eh, Et and El host the information for the header, type and location of the documents of collections C, D and E.

Figure 1. CQL query before the first stage of planning

In order to produce the final outcome for a CQL query, different data partitions are implicated. This set of partitions comes of the criteria defined in the CQL query to be answered. To this end, Search planner detects the partitions and the related Data Sources, where the documents satisfying the criteria of the query, belong. The second fact the Search Planner takes into account, is the CQL capabilities supported by each Data Source. Based on the CQL capabilities exposed by Data Sources, the Planner discovers which subqueries of the initial query, can be answered by each Data Source. Taking into account the data partitioning and the CQL capabilities supported by the Data Sources, the Planner ends up with a number of search plans that are sufficient for answering the initial query. The final goal is to find the plan that minimizes the total cost implied by its execution. The parameters affecting the total cost of a search plan are the following:

  • Computational cost for answering a given CQL query by a specific Data Source.
  • Computational cost for combining the outcome of Data Sources by a specific Search Operator.
  • Communication cost for transferring a set of documents from a machine hosting a Data Source or an Operator, to a machine hosting an Operator.
  • Cached results from previous queries that can be used to answer parts of the current query.

The Planning process consists of two stages. In the first stage we apply a set of static rules and we rewrite the initial query. Through this first stage, we aim at detecting the minimum set of partitions that need to be involved in order to answer the query. In the example of Fig. 1 the documents coming from the left subtree of the root can be part of collections C, D or E. The right child of the root, will further filter the documents of the left subtree, so that only the documents that belong to collection E are returned as results. The rewriting rules applied in this example will substitute criteria collection=A, collection=B, which are the right children of the two NOT nodes of the query tree. The rewritten query is depicted in Fig. 2. In this example the Planner detects, after the rewriting stage, that only partition defined by collection E is involved in this query. The static rewriting rules applied in the first stage of planning, are a combination of a)boolean algebra rules and b)rules related to the horizontal and vertical partitioning of the data. For brievity reasons rewriting rules are not described in detail here.