Search Framework

From Gcube Wiki
Jump to: navigation, search

Contents

Search Intro

The Search Framework consists of three major component categories: Search Master Service, Search Library and Search Operators. The first two categories are atomic in the sense that they consist of a single entity. The third one is a family of gCube services which expose various search functionalities.

The Search Master Service provides the access point to the gCube Search Engine. It receives a user query from the Portal, and along with some environment information received from the Information Service (IS), it initiates the query processing procedure. This procedure comes up with an ExecutionPlan which is fed the appropriate Execution Engine Connector which, in turn, forwards it to its corresponding execution engine (external to the Search Framework).

The Search Library is an all-in-one bundle that incorporates all the query processing, as well as the actual implementation of the Search Operators. The most critical subcomponents are: Query Processor Bundle (QueryObject, QueryParser, QueryPlanner, QueryOptimizer), Search Operator Library and the eXecution ENgine API (XENA). We will analyze these components here.

Finally, the Search Operators bundle is decomposed into several physical packages offering low level data processing elements. It is a family of gCube services, each of which is dedicated to a specific data processing facility. As previously mentioned, the SearchOperators bundle is a thin wrapper over the SearchLibrary which provides the actual implementation.

In the following paragraphs we shall present these three components along with their constituent subcomponents, and attempt to analyze their inner mechanics and relationships with the rest of the gCore/gCube context.

Query Processing Chain

The query processing chain in the Search framework follows (more or less) the chain of a traditional search engine:

  1. Submittion of query expression to search
  2. SearchMaster initiates the query parsing procedure
    1. Search Master invokes the Query Parser, passing the query expression as its argument
    2. Query Parser returns a query tree or exception in case of error; in the later case, Search Master forwards this exception to the caller
  3. SearchMaster initiates the query preprocessing procedure
    1. Search Master invokes the registered query preprocessors successively, passing the query tree as their argument
    2. Each preprocessor returns a potentially new query expression or exception in case of error; in the later case, Search Master forwards this exception to the caller
  4. SearchMaster initiates the query planning procedure
    1. Search Master invokes the Query Planner, passing the query tree as its argument
    2. Query Parser returns an execution plan or exception in case of error; in the later case, Search Master forwards this exception to the caller
  5. SearchMaster initiates the execution procedure
    1. An appropriate execution engine is selected
    2. Search Master invokes the selected engine, passing the execution plan as its argument
    3. The execution engine returns a result set endpoint reference or exception in case of error; in the later case, Search Master forwards this exception to the caller
  6. Search Master returns back the result set endpoint reference to the caller.

Search from the User Perspective: Querying

In general, a query is a form of questioning, in a line of inquiry; a statement of information needs, typically keywords combined with boolean operators and other modifiers; a specification of a result to be calculated from one or more sets of data. In the gCube environment, a query object contains the information for a simple or complex search operation on one or more collections of data. The principle behind the Query is that for each search operation there is a respective query node class. So, for example, there is a Join class which represents the Join search operation. Each and every class of the Query framework, is a sub-class of the SearchOperation. In this way, we can build query trees, with a query node class define one or more query node classes as its child. The root of this query tree is handled by the Query class, which provides methods for (de) serializing a query and locating a specific query node in the tree. In order to be able to express the appropriate information, a query must describe the following elements:

  • Search operation: Every search operation corresponds to a search service, which implements it. Available operations are:
    • FieldedSearch
    • FullTextSearch
    • Join
    • KeepTop
    • Merge
    • Project
    • QueryExternalSource
    • SimilaritySearch
    • SpatialSearch
    • Conditional
    • Sort
    • TranformResultSet
    • Source Selection: The Sources Selection part is responsible for identifying the resources on which the Query should be performed defining the collections against which the query criteria should be executed. There are also provisions for the case in which the submitted query is to be performed against the ResultSet computed by a previous query. Since the ResultSet is to be treated as a WS-Resource, this is done by passing the End Point Reference of the previous result set in the Metadata Source subpart of the Source definition section of any following query.


Available Operations

Operation Semantics
project Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept.
sort Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC).
merge Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set.
join Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload.
fielded search Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '*', which means any. We discriminate these cases:
  • '*ad*'. It can match any of these: 'ad', 'add', 'mad', 'ladder'
  • '*der'. It can match any of these: 'der', 'ladder', but not 'derm' ot 'ladders'.
  • 'ad*'. It can match any of these: 'ad', 'additional', but not 'mad' or 'ladder'.
  • 'ad'. It can only match 'ad'.

If we search on a text field, then contains refers to any of its consisting words. For example, if we search on the field title which is the rain in spain stays mainly in the plane, then the matching criteria '*ain*' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed. Predicates can be combined with ORs, ANDs (currently under development). The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier.

full text search Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source.

Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical AND. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the lemmatization. In a few words, the terms are processed and a set of relative words is generated and also used in the full text search.

filter by xpath, xslt, math, beanshell Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language. The beanshell type refers to filtering RSRecords using beanshell expressions.
keep top Keep only a given number of records of a result set.
retrieve metadata Retrieve ALL metadata associated to the search results of a previous search operation.
read Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation.
external search (deprecated) Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully:
<root>
<driverName>your jdbc driver</driverName>
<connectionString>your jdbc connection string</connectionString>
<query>your sql queryt</query>
</root>
Finally, in the OSIRIS case, the query string must have the following format:
<root>
<collection>your osiris collection</collection>
<imageURL>your image URL to be searched for similar images</imageURL>
<numberOfResults>the number of results</numberOfResults>
</root>
similarity search Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight.
spatial search Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches.
conditional search Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values

Syntax

  <function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> | 
     <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun> 

  <read_fun> ::= <read_fun_name> <epr>
  <read_fun_name> ::= 'read'
  <epr> ::= string

  <project_fun> ::= <project_fun_name> <by> <project_key> <project_source>
  <project_fun_name> ::= 'project'
  <project_key> ::= string
  <project_source> ::= <non_leaf_source>

  <sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source>
  <sort_fun_name> ::= 'sort'
  <sort_key> ::= string
  <sort_order> ::= 'ASC' | 'DESC'
  <sort_source> ::= <non_leaf_source>

  <filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source>
  <filter_fun_name> ::= 'filter'
  <filter_type> ::= string
  <filter_statement> ::= string
  <filter_source> ::= <non_leaf_source> | <leaf_source>

  <merge_fun> ::= <merge_fun_name> <on> <merge_sources>
  <merge_fun_name> ::= 'merge'
  <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2>
  <merge_sources2> ::= <and> <merge_source> <merge_sources2> | φ
  <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source>
  <join_fun_name> ::= 'join'
  <join_key> ::= string
  <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter'
  <join_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source>
  <keeptop_fun_name> ::= 'keeptop'
  <keeptop_number> ::= integer
  <keeptop_source> ::= <non_leaf_source>

  <fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources>
  <fulltexts_fun_name> ::= 'fulltextsearch'
  <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | φ
  <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2>
  <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | φ
  <fulltexts_source> ::= string

  <fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source>
  <fieldedsearch_fun_name> ::= 'fieldedsearch'
  <query> ::= string
  <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>

  <extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source>
  <extsearch_fun_name> ::= 'externalsearch'
  <extsearch_query> ::= string
  <extsearch_source> ::= string

  <similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source>
  <similsearch_fun_name> ::= 'similaritysearch'
  <URL> ::= string
  <pair> ::= <feature> <equal> <weight>
  <pairs> ::= <and> <pair> <pairs> | φ
  <similarity_source> ::= <leaf_source>

  <if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op>
  <compare-sign> ::= '==' | '>' | '<' | '>=' | '<='
  <function-st> ::= <left-op> <math-op> <right-op> | <left-op>
  <math-op> ::= '+' | '-' | '*' | '/'
  <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal>
  <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun>
  <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis>
  <right-op> ::= <function-st> | <left-op>
  <element> ::= an element of the result set payload (either XML element, or XML attribute)

  <retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema>
  <rm_fun_name> ::= 'retrievemetadata'
  <schema> ::= string
  <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source>
  <spatialsearch_fun_name> ::= 'spatialsearch'
  <relation> ::= {'intersects', 'contains', 'isContained'}
  <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis>
  <polygon_name> ::= 'polygon'
  <timeBoundary> ::= 'within' <startTime> <stopTime>
  <startTime> ::= double
  <stopTime> ::= double
  <spatial_source> ::= <leaf_source>
  <points> ::= <point> {<comma> <point>}+
  <point> ::= <x> <y>
  <x> ::= long
  <y> ::= long

  <leaf_source> ::= [<in> <languager>] <on> <my_source> [<as> <schema>]
  <non_leaf_source>  ::= <on> <left_parenthesis> <function> <right_parenthesis>

  <language>  ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' | 
     'CZECH' | 'WELSH' |    'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' |
     'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' | 
     'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' |
     'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' |
     'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' | 
     'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' |
     'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU'
  <my_source> ::= string
  <schema>  ::= string
  <left_parenthesis> ::= '('
  <right_parenthesis> ::= ')'
  <comma> ::= ','
  <and> ::= 'and'
  <on> ::= 'on'
  <as> ::= 'as'
  <by> ::= 'by'
  <sort_by> ::= 'sort'
  <from> ::= 'from'
  <if> ::= 'if'
  <then> ::= 'then'
  <else> ::= 'else'

Examples

Example 1
User Request Give me back all documents whose metadata contain the word woman from the collection identified by the triplet <ENGLISH, 0a952bf0-fa44-11db-aab8-f715cb72c9ff, dc>
Actual Query
fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'
Explanation We perform the fulltextsearch operation, using the woman term in the data source identified by the laguage ENGLISH, source number 0a952bf0-fa44-11db-aab8-f715cb72c9ff and schema dc

Example 2
User Request Give me back all documents from the collection identified by the triplet <ENGLISH, 0a952bf0-fa44-11db-aab8-f715cb72c9ff, dc> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. Hemans, Felicia Dorothea Browne
Actual Query
fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'
Explanation We perform the fieldedsearch operation in the data source identified by the laguage ENGLISH, source number 568a5220-fa43-11db-82de-905c553f17c3 and schema dc and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content.

Example 3
User Request Give me back the creator and subject the first 10 documents from the collection identified by the triplet <ENGLISH, 0a952bf0-fa44-11db-aab8-f715cb72c9ff, dc> sorted by the DocID field, whose creator's name contains ro.
Actual Query
project by 'creator', 'subject' on 
   (keeptop '10' on 
      (sort 'ASC' by 'DocID' on 
         (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))
Explanation First of all, we perform the fieldedsearch operation in the data source identified by the laguage ENGLISH, source number 568a5220-fa43-11db-82de-905c553f17c3 and schema dc and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the DocID field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the creator and subject fields.

Example 4
User Request Perform spatial search against the collection identified by the triplet <ENGLISH,6cbe79b0-fbe0-11db-857b-d6a400c8bdbb,eiDB> defining a search rectagular (0,0), (0,50), (50,50), (50,0).
Actual Query
spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0) 
   in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'
Explanation Search in collection identified by the triplet <ENGLISH,6cbe79b0-fbe0-11db-857b-d6a400c8bdbb,eiDB> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}.

Example 5
User Request Give me back all documents from the collection identified by the triplet <ENGLISH, 0a952bf0-fa44-11db-aab8-f715cb72c9ff, dc> that are created by dorothea and all documents from the same collection whose title contain the word woman
Actual Query
merge on 
    (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')
Explanation This is an example of how the user can merge the results of more than one subqueries.

Example 6
User Request Give me back all documents from the collection identified by the triplet <ENGLISH, 0a952bf0-fa44-11db-aab8-f715cb72c9ff, dc> that are created by dorothea AND whose description contain the word London
Actual Query
join inner by 'DocID' on 
    (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')
Explanation This is an example of how the user can perform a join (logical AND) of subqueries. We perform the join operation in the field DocID, which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set.


Search Orchestrator (Search Master Service)

The Search Master Service (SMS) is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.

There are two steps in achieving the necessary Query processing before it can be forwarded to the Search engine for the actual execution and evaluation of the net results. The first step is the transformation of the abstract Query terms in a series of WS invocations. The output of this step is an enriched execution plan mapping the abstract Query to a workflow of service invocations. These invocations are calls to Search Service operators providing basic functionality called Search Operators. The second step is the optimization of the calculated execution plan.

The SMS is responsible for the first stage of query processing. This stage produces a query execution plan, which in the gCube implementation is a directed acyclic graph of Search Operator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.

The information gathered is produced by various components or services of the gCube Infrastructure. They include the gCube Information Service (IS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SMS keeps a cache of previously discovered information and state.

The SMS validates the received Query using SearchLibrary elements. It validates the user supplied query against the elements of the specific Virtual Organisation (VO). This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc. Afterwards it performs a number of preprocessing steps invoking functionality offered by services such as the Query Personalisation and the DIR (former Content Source Selection) service, in order to refine the context of the search or inject extra information at the query. These are specializations of the general Query Preprocessor Element. An order of Query Preprocessor calls is necessary in the case where they might inject conflicting information. Otherwise, a method for weighting the source of the conflicting information importance is necessary. Furthermore, a number of exceptions may occur during the operation of a preprocessor, as during the normal flow of any operation. The difference is that, although useful in the context of gCube, preprocessors are not necessary for a Search execution. So errors during Query Preprocessing must not stop the search flow of execution.

The above statement is a sub case of the more general need of a framework for defining fatal errors and warnings. During the entire Search operation a number of errors and/or warnings may emerge. Depending on the context in which they appear, they may have great or no significance to the overall progress of execution. Currently, these cases are handled separately but a uniform management may come into play as the specifications of each service’s needs in the grand plan of the execution become more apparent at a low enough level of detail.

After the above pre-processing steps are completed successfully, the SMS dispatches a QueryPlanner thread to create the Query Execution Plan. Its job is firstly to map the provided Query that has been enriched by the preprocessors to a concrete workflow of WS invocations. Subsequently, the Query Planner uses the information encapsulated with the provided Query, the information gathered by the SMS for the available gCube environment and a set of transformation rules to come up with a near optimal plan. When certain conditions are met (e.g. the Query Planner has finished, time has elapsed, all plans have been evaluated), the planer returns to its caller the best plan calculated. If more than one Query Planners are utilized, the plans calculated by each Query Planner are gathered by the SMS. He then chooses the overall optimal plan and passes it to a suitable execution engine, where execution and scheduling is being achieved in a generic manner. The actual integration with the available execution engines and the formalization of their interaction with the SMS is accomplished through the introduction of the eXecution Engine Api (XENA), which is thoroughly analyzed in the Search Library section. In this formal methodology, the SMS is able to selected among the various available engines, such as the Process Execution Service, the Internal Execution Engine or any other WS-workflow engine. These engines are free to enforce their own optimization strategies, as long as they respect to the semantic invariants dictated by the original Execution Plan.

Finally, the SMS receives the final ResultSet from the execution engine and pass its end point reference back to the requestor.

Description of Search-related VRE Resources

Through the Search Master, external services can receive a structured overview of the VRE resources available and usable during a search operation. An example of this summarization is shown bellow:

  <SearchConfig>
    <collections>
      <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3">
        <TYPE>DATA</TYPE>
        <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH>
        <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH>
      </collection>
      <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab">
        <TYPE>DATA</TYPE>
        <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH>
        <INDEX>FEATURE</INDEX>
      </collection>
      <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff">
        <TYPE>METADATA</TYPE>
        <LANGUAGE>en</LANGUAGE>
        <SCHEMA>dc</SCHEMA>
        <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
        <INDEX>FTS</INDEX>
        <INDEX>XML</INDEX>
      </collection>
      <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s">
        <TYPE>METADATA</TYPE>
        <LANGUAGE>en</LANGUAGE>
        <SCHEMA>tei</SCHEMA>
        <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
        <INDEX>FTS</INDEX>
        <INDEX>XML</INDEX>
      </collection>
      <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff">
        <TYPE>METADATA</TYPE>
        <LANGUAGE>en</LANGUAGE>
        <SCHEMA>dc</SCHEMA>
        <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH>
        <INDEX>FTS</INDEX>
        <INDEX>XML</INDEX>
      </collection>
    </collections>
  </SearchConfig>

Query Processing (Search Library)

The core query processing functionality is provided by the Search Library component and is orchestrated by the SMS. In the following paragraphs we analyze the major subcomponents of the Search Library component.

Query Parser

The Query Parser is responsible for transforming query expressions, coming either from the Search Portlet or directly provided by the end-user, into instances of the Query (and its subordinate) class. These instances are afterwards forwarded to the Query Planner and along with some environment information (extracted by the SMS from the IS) initiate the query processing process. To minimize external dependencies as well as the total execution time we chose not to use a pre-existing java compiler, such as JavaCC, but instead build our own.


Query Planner

Intro

The objective of query processing in the gCube distributed context is to transform a high-level query into an efficient execution strategy that takes into account the status and nature of the infrastructure, which lies beneath. The need for optimization is evident in many use cases, including the mis-utilization of resources and the beneficial use of existing structures. The task of query processing includes the mapping of the Query to an execution plan, consisting of WS invocations, and an initial domain specific optimisation of the produced execution plan. This task is the responsibility of the QueryPlanner, which will be thoroughly analysed in this paragraph.

Generally there are two main layers, which are involved in mapping the query into an optimized sequence of local operations, each acting on a specific node. These layers perform the functions of query materialization and query optimization.

Query Materialization

The main role of this layer is to transform the query operations into concrete search operators, which are provided by the Search Service, escorted by the corresponding gHNs that host these operators. The appropriate information concerning the existent operators and their hosting gHNs is passed to the QueryPlanner by the SearchMasterService, which in turn receives it from the gCube Information Service. This information is the set of available service profiles, described above, and the set of hosting nodes and their respective services. The generating execution plan is constructed in steps, by matching a query node to the respective search service. There is an M-N relation among sub queries and service invocations, m, n>= 1. The output of this layer is an initial query execution plan which can be forwarded to the Process Execution service and run as is (the BPEL part, see previous paragraph). However, this may lead to poor performance, since no optimization is performed upon the query, which can lead to a substantial execution cost increase.

Query Optimization

The goal of query optimization is to find an execution strategy for the query which is as close to optimal as possible. However, selecting the optimal execution strategy for a query is an NP-hard problem. For complex queries with many sources (collections) involving a large number of operations hosted in a complex infrastructure, this can incur a prohibitive optimization cost. Therefore, the actual objective of the optimization is to find a strategy close to optimal within a logical amount of time. The output of the optimization layer, which is also the final output of the QueryPlanner, is an optimized QueryExecutionPlan object. The optimization process followed by the planner makes extensive use of service instances response time statistics and operator cardinalities. Service instances are ranked based on their response times and query operations are ranked based on their selectivity factor. Query Planner enumerates the equivalent plans for a given query and selects the best plan, according to its execution cost. However, an exhaustive search in all possible plans leads to prohibitive time and resource consumption. Therefore, we employ a greedy heuristic of minimizing the cost in each step of the plan construction. In the first step, the initial query tree is re-written using a set of XSL Transformations, in order to produce a plan with minimum intermediate results size. In the second step, the query tree produced earlier is transformed to its corresponding execution plan, selecting the best service instances that can compute the sub queries of the query tree. If we come up to two candidate services that correspond to a given sub query, then we select only the one that minimizes the total cost up to this point. We note that only abstract services are selected in this step. The actual scheduling takes place in the third and final optimization step, by the ProcessOptimizationService component (POS). The POS is responsible for allocating tasks to gHNs, based on some criteria, such as gHN load.

To sum up, the QueryPlanner gets a query and return a near optimal query execution plan object that contains a sequence of concrete search operations. The final mapping of operations and gHNs will be done at runtime by the execution service.

Planning Artifacts

The most important architectural aspects of the query planner subcomponent are two:

  • Search Service Profiles: Metadata descriptors of available services, based on which, the planning procedure takes place
  • Execution Plan: Output of the planning phase

Search Service Profiles

Each WSRF service that participates in the search framework is accompanied with a profile. This profile contains vital information concerning the usability and applicability of the service. It is primarily used for describing the connection between the service itself and the query it can execute. For example, “serviceA” declares in its profile that it can answer to a sort query. During the production of the execution plan from a query tree, the query planner matches the various query nodes to the services which can answer them, thus constructing the execution plan step-by-step.

More analytically, each service, first, describes a set of invocation information, which include its service path name (without the host machine address), its port type and operation and optionally its resource Endpoint Reference (if the service is stateful). Besides the execution information, services declare a semantic descriptor of which queries can they compute. For that reason, we employ XML Schema Definitions (XSD), so that every service can define the Schema of the XML queries that are able to answer to. Moreover, each service defines a generic transformation of the matching XML query to produce its invocation message (actually the body of the invocation SOAP message). This is done using XSL Transformations of the XML query to the XML SOAP message body. Finally, in order to accommodate the combination of more that one service invocations computing to one sub query, we have introduced a generic replacement strategy, through which a pseudo-service can compute one sub query, by splitting it into many sub-queries, each of which can be directly computed by a single service instance. This is also done using XSL Transformations of the matching XML query to the combination of other XML sub queries.

Service Profiles are basically XML documents that can be found in the DIS component. However, the query planner deserializes them into java classes via a data-binding technology, JAXB. These classes are located in the SearchLibrary. Since the service profiles are only internally used by the Planner component, their design will not be analyzed further.

Matching Rules

As previously stated (#Service_Profiles), the maching between subqueries and services is performed via the SearchServiceProfiles. Under the assumption that all D4S services are up-n-running, then the matching rules are the following:

Operation Corresponding Service
project TransformByXSLTOperator
sort SortOperator
merge MergeOperator
join If join inner then JoinInnerOperator otherwise no match
fielded search If value starts with '*" then XMLIndexer, otherwise if it exists forward to FullTextIndex if not, the XMLIndexer. If the FTIndex is selected then a retrievemetadata operation is added (if not already) in order to get all metadata. If the language is set to ontology then the query is answered by the OntologyManagement service.
full text search FullTextIndex if it exists, otherwise XMLIndexer. In the former case, a retrievemetadata is also added (if it's not already there)
filter by xpath, xslt FilterXPath, TransformByXSLT respectively.
keep top KeepTopOperator
retrieve metadata MetadataManagerFactory
read build-in (does not correspond to any service)
external search Perform a search on external source. Currently, only google can be queried.
similarity search  ?
spatial search GeoIndexLookup
conditional search build-in (does not correspond to any service)



ExecutionPlan

The output of the Planner component is an instance of the ExecutionPlan class, which in principle is just a java graph of service invocations. Edges denote parent-child relationships and nodes specific service invocations. Note that these invocations are abstract in the sense that no specific endpoint reference is defined. To accommodate the WS scheduling that will be performed by the execution engine, the ExecutionPlan also includes a set of candidate concrete service endpoint references, so as to be selected later on by the execution engine.

The best way to describe an execution plan is through its Schema definition:

 <?xml version="1.0" encoding="UTF-8"?>
 <schema xmlns="http://www.w3.org/2001/XMLSchema" 
       xmlns:xsd="http://www.w3.org/2001/XMLSchema"
       targetNamespace="http://gcube.org/searchservice/qep"
       xmlns:tns="http://gcube.org/searchservice/qep"
       xmlns:jxb="http://java.sun.com/xml/ns/jaxb" jxb:version="2.0"
       xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
 
       <xsd:import namespace="http://schemas.xmlsoap.org/ws/2004/03/addressing"
            schemaLocation="../../external/WS-Addressing.xsd" />
 
       <xsd:annotation>
               <xsd:appinfo>
                       <jxb:schemaBindings>
                               <jxb:package name="org.gcube.searchservice.searchlibrary.qep"/>
                       </jxb:schemaBindings>
               </xsd:appinfo>
       </xsd:annotation>
 
       <xsd:complexType name="securityType">
               <xsd:complexContent>
                       <xsd:restriction base="xsd:anyType" />
               </xsd:complexContent>
       </xsd:complexType>
 
 
 
       <xsd:simpleType name="scopeType">
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="wsdlType">
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="substituteTypeType">
               <xsd:annotation>
                       <xsd:documentation>
                               The 'substituteTypeType' defines the role of the input
                               message's element to be substituted. There are 2
                               possible values: 'child' and 'children'. The 'child'
                               value expresses the fact that the element to be
                               substotuted holds the endpoint reference of one input
                               source of the service to be invoked. The 'children'
                               value expresses the fact that the element to be
                               substotuted holds the endpoint references of ALL input
                               sources of the service to be invoked; that is, an array
                               of endpoint references.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string">
                       <xsd:enumeration value="child" />
                       <xsd:enumeration value="children" />
               </xsd:restriction>
       </xsd:simpleType>
 
       <xsd:simpleType name="substituteNameType">
               <xsd:annotation>
                       <xsd:documentation>
                               The 'substituteNameType' defines the element of a
                               service's input message, that should be substituted.
                               Actually, a population of this element takes places,
                               with values not available at runtime. These values
                               concern the output of the children services of that
                               specific service. This output refers to the result set
                               endpoint reference of the resource of the results of a
                               child service. Note that a service is child to another
                               service if its former's output is the later's input.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="messageType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of the message used as input to a
                               service.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="portTypeType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of a portType; actually, an xsd string
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="operationType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of an operation; actually, an xsd
                               string
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="msgType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of a msgType; actually, an xsd string
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="partType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of a message part; actually, an xsd
                               string
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:string" />
       </xsd:simpleType>
 
       <xsd:simpleType name="elmtType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of a part element; actually, a
                               qualified name
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:restriction base="xsd:QName" />
       </xsd:simpleType>
 
 
 
       <xsd:complexType name="resourceType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of the resource of a web service. Note
                               that a service's resource is an identifier of one of its
                               states.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:complexContent>
                       <xsd:restriction base="xsd:anyType" />
               </xsd:complexContent>
       </xsd:complexType>
 
       <xsd:complexType name="substituteType">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the type of a message's element to be
                               substituted/populated ate runtime. This elements should
                               be populated with the output of the children services of
                               another service. For further detail see the
                               'substituteNameType' and 'substituteTypeType'
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="name" type="tns:substituteNameType" />
                       <xsd:element name="type" type="tns:substituteTypeType" />
               </xsd:sequence>
       </xsd:complexType>
 
       <xsd:complexType name="inputMessageType">
               <xsd:annotation>
                       <xsd:documentation>
                               The 'inputMessageType' defines a complex the type of the
                               message that serves as input to a service, plus the set
                               of elements to be populated at runtime with the output
                               of the children services.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="message" minOccurs="0" maxOccurs="1"
                               type="tns:messageType" />
                       <xsd:element name="substitute" minOccurs="0"
                               maxOccurs="unbounded" type="tns:substituteType" />
               </xsd:sequence>
       </xsd:complexType>
 
       <xsd:complexType name="wsResourceType">
               <xsd:annotation>
                       <xsd:documentation>
                               The 'wsResourceType' defines a WSRF resource (web
                               service + resource). The complex type is consisted of
                               the web service's wsdl URI and the XML representation of
                               a resource.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="wsdl" type="tns:wsdlType" />
                       <xsd:element name="resource" type="xsd:string" />
               </xsd:sequence>
       </xsd:complexType>
 
       <xsd:complexType name="executionEnvelopeType">
               <xsd:annotation>
                       <xsd:documentation>
                               This element describes some execution information
                               concerning the service invocation. This info include,
                               namespace, portType, operation, I/O msg, part, element.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="namespace" type="xsd:string" />
                       <xsd:element name="portType" type="tns:portTypeType" />
                       <xsd:element name="operation" type="tns:operationType" />
                       <xsd:element name="inputMsg" type="tns:messageType" />
                       <xsd:element name="outputMsg" type="tns:messageType" />
                       <xsd:element name="inputPart" type="tns:partType" />
                       <xsd:element name="outputPart" type="tns:partType" />
                       <xsd:element name="inputElement" type="tns:elmtType" />
                       <xsd:element name="outputElement" type="tns:elmtType" />
                       <xsd:element name="endpointReference" type="wsa:EndpointReferenceType"/>
               </xsd:sequence>
       </xsd:complexType>
 
       <xsd:complexType name="qepNode">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the datatype of the query execution plan node.
                               It declares the ws resource, its input message and the
                               set of children nodes.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="serviceClass" type="xsd:string" minOccurs="0" maxOccurs="1" default="Search"/>
                       <xsd:element name="wsResource" type="tns:wsResourceType" />
                       <xsd:element name="inputMessage"
                               type="tns:inputMessageType" />
                       <xsd:element name="executionEnvelope"
                               type="tns:executionEnvelopeType" />
                       <xsd:element name="managementInfo" type="tns:managementInfoType" minOccurs="0" maxOccurs="1"/>
                       <xsd:element name="sources" type="tns:qepNode" minOccurs="0"
                               maxOccurs="unbounded" />
               </xsd:sequence>
       </xsd:complexType>
 
       <xsd:complexType name="managementInfoType">
               <xsd:annotation>
                       <xsd:documentation>
                               The 'managementInfoType' type includes define management information
                               that might be useful to the execution engine. This information include
                               security parameters, (D4S) scope attributes etc. It can be applied either
                               globally (affecting all service invocations), or locally (affecting a single service)
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:sequence>
                       <xsd:element name="scope" type="tns:scopeType" minOccurs="0" maxOccurs="1"/>
                       <xsd:element name="security" type="tns:securityType" minOccurs="0" maxOccurs="1"/>
               </xsd:sequence>
       </xsd:complexType>
 
 
       <xsd:element name="queryExecutionPlan">
               <xsd:annotation>
                       <xsd:documentation>
                               This is the head of the query execution plan. Actually
                               is it a 'pointer' to the qep root node.
                       </xsd:documentation>
               </xsd:annotation>
               <xsd:complexType>
                       <xsd:sequence>
                               <xsd:element name="rootNode" type="tns:qepNode" />
                               <xsd:element name="managementInfo" type="tns:managementInfoType" minOccurs="0" maxOccurs="1"/>
                       </xsd:sequence>
               </xsd:complexType>
       </xsd:element>
 
 </schema>

As you can see, an execution plan is a graph of qepNode instances, along with a global managementInfo envelope. This envelope refers to some managerial factors, such as scope handling and security. Currently, only scope handling is employed. Each qepNode instance has an arbitrary number of qepNode children instances, its local managementInfo envelope (valid for only this instance, not its children instances), the message payload and its execution context. The message payload consists of the raw message as well as the part of the message that will be dynamically populated by the invocation of other qepNode instance executions. In other words, each qepNode consumes the results of its children qepNode instances and this consumption in described in its input variables which store the produced output. Finally, the execution context refers to the service invocation details, such as the service EPR.

Search Operators Core Library

The classes that belong to the Search Operators set are the core classes of the data retrieval and processing procedure. They implement the necessary processing algorithms and are wrapped by the respective Search Services, in order to expose their functionality to the gCube infrastructure. However, the SearchOperator classes implement only a subset of the whole gCube functionality; there are some services that do not rely on these classes and incorporate the full functionality, without any dependencies to any library (apart from the ResultSet and ws-core, of course). These services are the IndexLookupService, GeoIndexLookupService, ForwardIndex, FeatureIndex and Fuse; they will be analyzed in separate, following paragraphs.

Execution (Engine) Integration

The term Execution (Engine) Integration refers to a logical group of components which materialize the bridge between query planning and plan execution. As previously written, the QueryPlanner produces an ExecutionPlan which should be forwarded to the execution engine(s). The procedure of feeding the engine(s) with the ExecutionPlan is performed by the components described in this paragraph.

Search-Execution integration is by far not a trivial task because of the obvious fact that components from each side were developed by different groups, resulting in heterogeneity regarding their interfaces. The purpose of decoupling components is exactly the ability to employ them from various institutions. All in all, the problem comes up to the fact that we have an ExecutionPlan, a set of available execution engines, which potentially have completely different interfaces, and we must find a way to feed the engine with a corresponding execution plan, expressed in its native language, but obviously based on the original ExecutionPlan. The solution to this integration problem is the introduction of the eXecution Engine Api (XENA), which is described in the following paragraphs. Before that, we present the available execution engines, which are or will be employed in D4Science. Currently, there are 3 execution engines available in gCube, ProcessExecutionService, the internal QEPExecution engine and finally ActiveBPEL, which is an open-source, BPEL engine, very popular in the web service world. For further details regarding the execution engines, please refer to #Execution_Engines.

eXecution ENgine Api (XENA)

As the name reveals XENA provides the abstractions needed by any engine to execute plans generated by the Search Framework, namely instances of the ExecutionPlan class. XENA can be thought of as a middleware between the Search Framework and the execution engine, with the responsibility of ‘translating’ the ExecutionPlan instances to the engine’s internal plan representation.

However, the integration issue cannot be treated so simplistically. First of all, one cannot assume that every engine should respect to the XENA paradigm and therefore offer an inherent support/integration to the Search Framework. Another problem is that the XENA design itself must be very flexible and expressive in order to encompass all of the significant syntactic and semantic attributes of a process execution. Therefore, it is imperative to clarify two major factors: the real-world architectural solution to the search-execution integration and the data model that XENA adopts.

The idea behind XENA is the following: Between The XENA API and the execution engine, there is a proxy component called Connector, which is responsible for translating the XENA API model to the engine’s internal model. The idea is not new. It has been applied in many middleware solutions and mostly known from the JDBC API. The ala-JDBC paradigm dictates the introduction of connectors that bridge JDBC and the underlying RDBMS. For every different RDBMS product there is a different software connector. So, in our context, we employ different XENA connectors for different workflow engines. Each connector, apart from the XENA artifacts, is fed with a set of execution engine endpoint references, in order for the connector to know which engine instance(s) it may refer to. Each XENA connector is first registered in a formal way to a special class called ExecutionEngineConnectorFactoryBuilder, publishing itself and the set of supported features. These features are key-value(s) pairs and describe the abilities of the execution engine. They may include persistence, automatic recovery, performance metrics, quality of service, etc. The registration procedure accommodates the dynamic binding to a specific connector (and thus execution engine) made by the SearchMasterService, based on some user preferences or predefined system parameteres. So, for example, if a user desires maximum availability in his/her processes, then the SearchMasterService can select an execution engine / respective connector that supports persistence and failsafe capabilities. The dynamic binding of XENA API to a specific connector is accomplished by employing some of the Java Classloader capabilities.

Regarding the data model that is adopted by XENA, it has borrowed the design philosophy of another middleware solution in the area of web service registries, JAXR, a Java Specification (JSR) for that field. The main abstract classes are:

  • ExecutionPlan: analyzed in #ExecutionPlan
  • ExecutionVariable: Variables that participate either as data transfer containers or as control flow variables.
  • ExecutionResult: The current result of the plan execution. We don’t use the term final, since the ExecutionResult can be retrieved at anytime during the process execution. See the description regarding ExecutioConnector. Through this class, one can get the current ExecutionTrace.
  • ExecutionTrace: Current trace of execution. It contains the image of the already done/committed actions defined in the execution plan and the ExecutionVariables.
  • ExecutionConnector: Provides the actual abstraction of the execution engine. It declares some ‘execution’ methods. The most primitive task is a plain, blocking execute method. However, users may want a more fine-grained control over their process executions. For that reason we have defined three Connector levels:
    • Basic Level: it defines a single, blocking execute method, which receives an ExecutionPlan and returns back an ExecutionResult.
    • Advanced Procedural Level: Basic Level + process management methods, such as executeAsync, pause, resume, cancel, getStatus, waitCompletion.
    • Event-Driven Level: Apart from the purely procedural level, there are many workflow engines which adopt a different paradigm, the event-driven one. According to this, any action that takes place during a process execution, e.g. service invocation finished, or variable initialized, produces a message which can be handled by a corresponding callback method. So the “service invocation finished” event causes the invocation of its associated callback method, within which the developer can perform any management action, housekeeping, logging, etc. The Event-Driven Level offers all the mechanics for registering the set of event which the developer wishes to handle and their associated callback methods. Note that XENA does not offer a callback system. This is the responsibility of the underlying engine. Consequently, only engines that adopt the event model should implement the Event-Driven ExecutionConnector Level.

Since there are three available execution engines (PES, QEPExecution, ActiveBPEL), we the initial desing includes three corresponding connectors:

  • A connector for the workflow (process) language employed by PES is a shortened version of BPEL.
  • An ActiveBPEL connector which fully conforms to the BPEL OASIS standard.
  • QEPExecution, internally implemented within the Search Framework and it can directly execute ExecutionPlans..

Query Preprocessors

The query preprocessor architecture allows for dynamic addition/deletion of preprocessors, (re-)positioning in the preprocessor chain and (re-)setting of necessity. If a preprocessor is marked as critical then if it fails the whole query processing chain fails as well.

PersonalisationPreprocessor

Alters queries based on the user preferences, retrieved from the Personalisation Service. These preferences refer to adding new query operators (such as sorting or projection of certain fields), adding/removing search sources and refining the execution context.

CSSPreprocessor

Injects specific data sources to abstract fulltextsearch queries (a.k.a. with no specific sources defined). The business logic lies in the DIR service. This preprocessor contacts DIR, sending the search terms and retrieving a set of collections with higher ranking in respect to the frequencies of those terms. The retrieved collections are then injected to the query.

StopwordsRemovePreprocessor

Removes stopwords from fulltextsearch queries. It employs a cache with stopword lists per language, which are retrieved from IS. This cache is renewed in regular intervals.

SourceResolver

We provide various source definition methods. We discern the following cases:

  • <datasource_operation> [in ' '] on ' <data_collection_id> ' [as ' ']
    • Perform search in all metadata collections of the given data collection, plus the data collection.
  • <datasource_operation> in '<language>' on ' <data_collection_id> ' as ' <schema> '
    • Perform search in the metadata collection which belongs to the given data collection and is defined in the given schema and language.
  • <datasource_operation> [in ' '] on ' <data_collection_id> ' as '<schema>'
    • Perform search in the metadata collection which belongs to the given data collection and is defined in the given schema.
  • <datasource_operation> in ' <language> ' on ' <data_collection_id> ' [as ' ']
    • Perform search in the metadata collection which belongs to the given data collection and is defined in the given language.
  • <datasource_operation> [in ' '] on ' <metadata_collection_id> ' [as ' ']
    • Perform search in given the metadata collection.
  • <datasource_operation> in ' <language> ' on ' <metadata_collection_id> ' as ' <schema> '
    • Perform search in the given metadata collection which is defined in the specified schema and language (for validation purposes).

The expressions in [] are optional within queries. The source resolver is responsible for performing these changes in queries. For that purpose, it employs an embedded derby database storing the association of data, metadata and corresponding data source services (currently FTIndex and XMLIndexer). If all sources are answered by XMLIndexer resources then any retrievemetadata operations are discarded. Otherwise, a new retrievemetadata operation is added, if it's not already defined.

Another responsibility of this preprocessor is to set the merging behaviour of the defined sources. If FTIndex resources are the only source providers, then IndexFuse merge is defined; if only content collections are defined then SimpleFuse merge is used; Otherwise, simple merge is chosen.

Search Operators

Introduction

The Search Operator family of services are the building blocks of any search operation. These along with external to the Search services handle the production, filtering and refinement of available data according to the user queries. The various intermediate steps towards producing the final search output are handled by Search Operator services. In this section we will only describe the Search Service internal Services listed below, although the Search Operator Framework reaches out to "integrate" on a high level other services too that can be utilized within a Search operation context.

The following operators are implemented as stateless services. They receive their input and produce their output in the context of a single invocation without holding any intermediate state. In case any data transferring is necessary either as input to a service or as output from the processing, the ResultSet Framework is employed.

The search operators cover the basic functionality that could be encountered in a typical search operation. A search can be decomposed in undividable units consisting of the above operators and their interaction can construct a workflow producing the net result delivered to the requester. The external source search and the service invocation services provide some extendibility for future operators by offering a method for invoking an \u201cunknown\u201d to the Search framework service, importing its results to the search operator workflow. The distinguished search operators at present time are listed below.

Example Code

Search Operators Usage Examples

Operators

BooleanOperator

Description

The Boolean Operator is used in conditional execution and more specifically, in evaluating the condition. So, it actually offers the ability of selecting alternative execution plans. For example, one can follow a plan (let\u2019s say a projection on a given field of a set of data), if a given precondition is valid; otherwise, she may follow the alternative plan (e.g. a projection on another field of the same set of data and then sort on the field). The precondition validation is the responsibility of this Service.

The condition is a Boolean expression. Basically, it involves comparisons using the operations: equal, not_equal, greater_than, lower_than, greater_equal, lower_equal. The comparing parts are either literals (date, string, integer, double literals are supported) or aggregate functions on the results of a search service execution. These aggregate functions include max, min, average, size, sum and they can be applied to a given field of the result set of a search service execution, by referring to that field employing xPath expressions.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

FilterByXPathOperator

Description

The role of the FilterByXPath Operator is to perform search through an expression to be evaluated against an XML structure. Such an expression could be an xPath query. The XML structure against which the expression is to be evaluated is a ResultSet, previously constructed by an other operator or complete search execution. The result of the operation is a new ResultSet and the end point reference to this is returned to the caller.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

JoinInnerOperator

Description

The role of the JoinResultSetService is to perform a join operation on a specific field using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet, leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller. An in memory hash \u2013 join algorithm has been implemented to perform the Joining functionality.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

KeepTopOperator

Description

The role of the KeepTop Operator is to perform a simple filtering operation on its input ResultSet and to produce as output a new ResultSet that holds a defined number of leading records.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

MergeOperator

Description

The role of the Merge Operator is to perform a merge operation using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

GoogleOperator

Description

The role of the GoogleOperator is to redirect a query to the Google search engine through its Web Service interface and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

SortOperator

Description

The role of the Sort Operator is to sort the provided ResultSet using as key a specific field. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its end point reference is returned to the caller. The algorithm used is merge sort. The comparison rules differ depending on the type of the elements to be sorted. The key of the sort operator can be expressed in one of the ways defined in the following method org.diligentproject.searchservice.searchlibrary.resultset.elements.ResultElementGeneric#extractValue

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

TransformByXSLTOperator

Description

The role of the TransformByXSLT Operator is to transform a ResultSet it receives as input from one schema to another through a transformation technology such as XSL / XSLT. These transformations are directly supplied as input to the service. The output of the transformation, which could be a projection of the initial ResultSet, is a new ResultSet wrapped in a WS-Resource whose endpoint reference is returned to the caller.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

ScriptFilterOperator

Description

This operator performs a generic filter on RSRecord basis, using a filtering expression on some scripting language. For every language, there is the corresponding connector which is responsible for performing the filtering and returning the filtered RSRecord. Scripting languages are registered in the JNDI file of this service. Currently, only beanshell (see here) is supported, but personally I don't think that anyone could ask for more. However, the architecture is fully extendible and supports seamless addition of any scripting language one may ask for.

The current JNDI configuration is the following:

  <environment name="scriptLangsNumber" value="1" type="java.lang.Integer" />
  <environment name="scriptLang.1" value="org.gcube.searchservice.scriptfilterservice.beanshell.BeanShell" type="java.lang.String" />

Dependencies

  • jdk 1.5
  • gCore
  • bsh.jar (beanshell jar file)

IndexFuseOperator

Description

This operator receives as input Result Sets that contain information for the content or metadata of objects, that belong to one or more collections. These Result Sets are produced through the Index Subsystem. Each Result set contains scores, that express the relevance of metadata or content of objects, to the initial search query that triggered the use of the operator by the Search Subsystem. The operator combines the scores of the metadata and content for a given ObjectID(OID) and produces a total score for each OID. There is no limit in the number of Results sets used for metadata(e.g. 2 metadata ResultSets for the basic metadata collection and an annotations' collection). The output of the operator is a Result Set that contains one total score for each OID. The elements of the produced Result Set are sorted based on the total score. The whole operation is executed in an asynchronous manner.

Dependencies

  • jdk 1.5
  • gCore
  • SearchLibrary

DuplicateEliminatorOperator

Description

This operator is responsible for removing all duplicate RSRecords of a ResultSet. Duplicates are considered the RSRecords that have the same DocID attribute value.

Dependencies

  • jdk 1.5
  • gCore

Execution Engines

As previously writter, there are 3 execution engines available in D4Science:

  • Process Execution Engine: This is the official project's engine. Further details can be found here.
  • ActiveBPEL: This is a widely-spread, open-source, BPEL-enabled execution engine. It supports various high-level features, such as fault-tolerance, persistency, advanced monitoring, etc. Further details can be found here.
  • QEPExecution: The QEPExecution is the internal, simple, generic, ws-compliant, execution engine of the Search Infrastructure. It is basically a web service execution engine, which orchestrates the execution of a set of services. It is designed to work with any web service and therefore communicates via the exchange of SOAP messages. Its input is the raw ExecutionPlan, produced by the QueryPlanner and its output is a string of the final results Endpoint Reference (exactly like the output of the PES). The QepExecution is able to work with any WSRF service (both stateful and stateless). However, since the engine is designed to be used internally as a final solution, it lacks various features of a full fledged execution engine, such as advanced error handling. In the later versions, QEPExecution is also selectively parallelized; administrators may also set it in sequential mode. Sibling execution nodes in the execution graph are executed in parallel.


Performance

See Search Performance

System Issues

Distributed Logging

In order to have an automatic processing procedure over the logs we need to define a formal structure over them. Below you'll find the proposed formal syntax that log entries should respect to if they intend to be machine-readable. Note that not all log entries need to be machine-readable and therefore respect to the syntax.

 logEntryFormat         ::=	openToken sessionID closeToken
				openToken componentName colon componentClass:scope closeToken
				openToken state closeToken
				[openToken stateParamKey closeToken openToken stateParamValue closeToken]*
 openToken 		::= '<#'
 closeToken		::= '#>'
 sessionID		::= <string>
 componentName		::= serviceName colon resourceEPR | simpleComponentName
 componentClass         ::= <string>
 scope                  ::= <string>
 serviceName		::= <string>
 colon			::= ':'
 resourceEPR		::= <string>
 simpleComponentName	::= <string>
 state			::= <string>
 stateParamKey		::= <string>
 stateParamValue	::= <string>

Now, some comments regarding these entities:

A sessionID is an identifier for a sequence of actions that are logically grouped together (a.k.a. session).

A state may be a resource enumeration or an operation application. Examples of state:

  • query parsing
  • create new resource
  • apply transformation

Reserved stateParamKeys and corresponding values:

 username			::= <string> (valid only if actor == 'user')
 timing                         ::= <long> (millis from 1/1/1970)
 operation			::= <string>
 actorType			::= {'user' | 'system'}

Note that, most probably, developers do not need to write the sessionID, componentName, componentClass and scope data themselves, since this can be automatically done by the wrapping GCUBELog class.

The next step is to offer the log entries to the log processing system. This is transparently taken care of adding a new Log Appender (for details regarding log appenders in log4j, you can refer to the official site of the project), to the existing ones. This appender will send all log entries to a centralized message sink, which will appropriately store them in a permanent storage layer, thus providing the means to extract useful statistics over the log entries. We are currently developing an administration portlet which will generate and visualize these statistics.

Search Use Cases: Problematic Conditions

A user logs into the system and clicks the search portlet. We enumerate the following cases:

  1. Search portlet does not even load:
    • There may not any search service RI available
    • Security problems
    • IS may be offline
    • Generally speaking, some critical component has completely failed (hard to identify the actual cause)
  2. Search portlet is loaded but it shows no collections:
    • These collections are not registered in IS, or they are registered within the last few minutes (until search master refreshes its own cache from IS).
    • Search Master has an internal problem and returns wrong collections [1]
  3. Search portlet is loaded, it shows the collections but when you click on search button then you wait forever and you never see the results page:
    • IS returns erroneous Search Master RIs: the portlet tries to access a search master RI which is simply not there, so it waits until it gets back a connection timeout. This situation is not frequent.
    • IS returns the correct Search Master RIs but some of them are marked as 'unreachable': Search Master does not handle this situation so it returns all RIs to the portlet and potentially the portlet selects the wrong RI to submit the query to.
    • Search Master has not refreshed the RIs returned from IS and emits old/obsolete RIs. In regular intervals the search master refreshes its own cache but there is a chance that some of its data has changed, therefore invalidating the cache.
  4. Search portlet is loaded, it shows the collections, when you click search it submits the query to a proper search master RI but you get a runtime exception:
    • Internal Search error.[2]
    • Problems with the WSDLs of services used by search. There is a slim chance of facing this situation, since the WSDLs are validated and rarely change.
    • Search tries to use a non-existent service RI. Problems regarding freshness of data either in IS or in the search internal cache.
    • Some service emits an exception.
  5. Search portlet is loaded, it shows the collections, when you click search it submits the query to a proper search master RI but you get no results:
    • Perhaps your query does not produce any results
    • One service employed by search may not be working properly and blocks the results
    • Internal Search Problem[1]
  6. Search portlet is loaded, it shows the collections, when you click search it submits the query to a proper search master RI and you DO get results: Everything worked just fine.

DIR Feeding

This section will move to a more appropriate place

Assumptions

  • There is a single DIR resource
  • This DIR resource must be fed with metadatacollections that are indexed by FTS resources in the same VRE

Feeding Process

  1. First an empty resource
    1. If there is not any, then create one
    2. If there are multiple resources, then either delete all but one, or choose one
  2. Retrieve all data collections
  3. Retrieve all metadata collections along with their referenced data collections
  4. Retrieve all FTS resources along with their referenced metadata collections
  5. Do a pivot and select the metadata collections that have a corresponding existing data collection and are indexed by an FTS resource
  6. From this set, choose only one metadata collection from the collections that refer to the same data collection
  7. From this set, choose only metadata collections that belong to the same schema
    1. If there are multiple selections, then choose the biggest set

  1. 1.0 1.1 Such a problem was encountered in the past and was related to a bug of Search Master which was running on scope A but harvested information from IS found in scope B.
  2. If the dedicated memory to the JVM is less that 100MB and dependending on the size of the VRE that Search runs into, OutOfMemoryExceptions may occur