Search Planning and Execution Specification
Contents
Overview
A fundamental part of the gCube Information Retrieval framework consists of the Search Planning and Execution components. The Search Planner enables the on-the fly integration of CQL-compliant Data Sources. The key concept in this process is the publication of CQL capabilities by the integrated Sources. The Search Planner will involve any of the Sources that have published their capabilities on a given infrastructure, as long as they contribute to the result for a query.
The optimization mechanisms of the Planner detect the smallest set of Sources required to answer a query. Moreover, using a probabilistic approach, a near-optimal plan for execution is found. The algorithms of the Planning and Optimization stages allow the IR framework to scale in the number of Data Sources that are integrated in an infrastructure.
Search Planner produces an plan that combines Data Sources and a selection from various Search Operators. A distributed execution environment ensures the efficient execution of the plan. Information travels from Data Sources to Search Operators and, in turn, to the Search clients through a pipelining data transferring mechanism that provides low latency, large throughput and a flow control facility.
Key features
- On-the-fly Integration of New Data Sources
- CQL-compliant Sources that publish their capabilities, are dynamically involved in the IR process.
- Involves the minimum number of Data Sources
- Detection of all the Sources that contribute to the result of a query.
- Scalability in the number of Sources integrated in an Infrastructure
- Planning and Execution components designed to scale.
- Dynamic Integration of New Search Operators
- Operators with new functionality can be dynamically integrated in the IR process.
- Pipelining Mechanism
- Offers flow control, low latency and high throughput.
Design
Philosophy
Search Planning and Execution components are designed in order to:
- allow the efficient and flexible integration of new Sources and Operators.
- exploit the IR capabilities and functionality of various information providers.
- scale in environments with a large number of heterogeneous Sources.
- decouple and eliminate the dependencies among the Planner, the Execution environment and the information providers.
Architecture
The main components of the Search system are the Planner, the Search Operators and the distributed Execution engine. The architecture is shown in the following figure:
The Planner is aware of the available Data Sources, through the Resource Registry that provides an interaction mechanism with the Information System. The Information System is used by Data Sources in order to publish their capabilities. Planner computes a plan for answering a CQL query received by a search client. This plan contains also some hints about specific functionality required for the involved Search Operators.
The distributed Execution environment computes a preferred allocation for the execution of received plans through the Execution planner. During this procedure the possible invocation options for each Data Source and Search Operator are taken into account. The outcome of execution is the endpoint of a GRS2 pipeline that can transfer the results to the search client that initiated the query.
Deployment
The deployment schema must be decided based on:
1. the workload from the search clients
2. the extent of the Data Sources' space
Large Deployment
Planner, Search Operators and Execution Engine are deployed over gCore containers. The gRS2 pipelining mechanism and the Resource Registry must also be part of the node. In case of a high frequency of received queries, from search clients, a lot of Planner, Operators and Execution Engine instances have to be deployed. Such instances can be co-deployed for minimizing different types of overhead. A Planner instance can use only the co-deployed Execution Engine instance or in case the Data Sources' space is huge, it can exploit the computational resources of the whole distributed Execution Engine deployed on all nodes. Deployment is usually performed on a VRE level.
Small Deployment
In a smaller scale, with a small frequency of received queries and a reasonable Sources' space, only one node with Planner, Operators and Execution Engine co-deployed, is preferable.
Use Cases
The suitability of the gCube Data Source specifications for IR components is strongly related to the two standards adopted:
- CQL: IR providers that support functionality which can be directly mapped in the CQL standard are good candidates for being wrapped into Data Sources.
- OpenSearch: IR providers that implement the OpenSearch API can be directly wrapped into Data Sources.
Well suited Use Cases
Components that provide IR functionality are well-suited for forming Data Sources based on their relation to the above standards. Integration of an IR provider through the OpenSearch Data Source is preferable in cases where there is no direct mapping of the provider's functionality to the CQL standard. However, if CQL can express accurately the provided IR capabilities, the direct integration of the corresponding IR component as a separate Data Source can be advantageous. The advantages in that case are mainly related to the better exploitation of the component's IR functionality. Note that CQL is chosen as the standard in our framework, because it is a highly expressive query language that suits the IR functionality of most general-case IR systems.
Less well suited Use Cases
In case a Data provider can not be associated with any of the two standards, the alternative approach is to apply an intermediate step by inserting the provider's data into an Index partition. In this case the provider's information will be exploited through the Index System functionality. However, this alternative implies a significant overhead when the content of the provider is frequently updated.