Search Management
THIS SECTION OF GCUBE DOCUMENTATION IS CURRENTLY UNDER UPDATE.
Contents
Search Management
Example Code
Search Management Services Usage Examples
Search Master
Introduction
The SearchMasterService is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.
The SearchMasterService is responsible for the first stage of query processing. This stage produces a query execution plan, which in the DILIGENT implementation is a directed acyclic graph of SearchOperator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.
The information gathered is produced by various components or services of the DILIGENT Infrastructure. They include the Diligent Information Service (DIS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SearchMasterService keeps a cache of previously discovered information and state.
The SearchMaster validates the received Query using Search Library elements. It validates the user supplied query against the elements of the specific Digital Library Instance. This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc.
DL Description
Through the Search Master, external services can receive a structured overview of the Digital Library resources available and usable during a search operation. An example of this summarization is shown bellow:
<SearchConfig> <collections> <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3"> <TYPE>DATA</TYPE> <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH> <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH> </collection> <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab"> <TYPE>DATA</TYPE> <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH> <INDEX>FEATURE</INDEX> </collection> <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff"> <TYPE>METADATA</TYPE> <LANGUAGE>en</LANGUAGE> <SCHEMA>dc</SCHEMA> <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH> <INDEX>FTS</INDEX> <INDEX>XML</INDEX> </collection> <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s"> <TYPE>METADATA</TYPE> <LANGUAGE>en</LANGUAGE> <SCHEMA>tei</SCHEMA> <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH> <INDEX>FTS</INDEX> <INDEX>XML</INDEX> </collection> <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff"> <TYPE>METADATA</TYPE> <LANGUAGE>en</LANGUAGE> <SCHEMA>dc</SCHEMA> <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH> <INDEX>FTS</INDEX> <INDEX>XML</INDEX> </collection> </collections> </SearchConfig>
Query Language
Available Operations
Operation | Semantics |
---|---|
project | Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept. |
sort | Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC). |
merge | Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set. |
join | Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload. |
fielded search | Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '*', which means any. We discriminate these cases:
If we search on a text field, then contains refers to any of its consisting words. For example, if we search on the field title which is the rain in spain stays mainly in the plane, then the matching criteria '*ain*' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed. Predicates can be combined with ORs, ANDs (currently under development). The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier. |
full text search | Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source.
Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical AND. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the lemmatization. In a few words, the terms are processed and a set of relative words is generated and also used in the full text search. |
filter by xpath, xslt, math | Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language. |
keep top | Keep only a given number of records of a result set. |
retrieve metadata | Retrieve ALL metadata associated to the search results of a previous search operation. |
read | Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation. |
external search (deprecated) | Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully:
<root> <driverName>your jdbc driver</driverName> <connectionString>your jdbc connection string</connectionString> <query>your sql queryt</query> </root> Finally, in the OSIRIS case, the query string must have the following format: <root> <collection>your osiris collection</collection> <imageURL>your image URL to be searched for similar images</imageURL> <numberOfResults>the number of results</numberOfResults> </root> |
similarity search | Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight. |
spatial search | Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches. |
conditional search | Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values |
Syntax
<function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> | <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun>
<read_fun> ::= <read_fun_name> <epr> <read_fun_name> ::= 'read' <epr> ::= string
<project_fun> ::= <project_fun_name> <by> <project_key> <project_source> <project_fun_name> ::= 'project' <project_key> ::= string <project_source> ::= <non_leaf_source>
<sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source> <sort_fun_name> ::= 'sort' <sort_key> ::= string <sort_order> ::= 'ASC' | 'DESC' <sort_source> ::= <non_leaf_source>
<filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source> <filter_fun_name> ::= 'filter' <filter_type> ::= string <filter_statement> ::= string <filter_source> ::= <non_leaf_source> | <leaf_source>
<merge_fun> ::= <merge_fun_name> <on> <merge_sources> <merge_fun_name> ::= 'merge' <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2> <merge_sources2> ::= <and> <merge_source> <merge_sources2> | φ <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>
<join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source> <join_fun_name> ::= 'join' <join_key> ::= string <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter' <join_source> ::= <left_parenthesis> <function> <right_parenthesis>
<keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source> <keeptop_fun_name> ::= 'keeptop' <keeptop_number> ::= integer <keeptop_source> ::= <non_leaf_source>
<fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources> <fulltexts_fun_name> ::= 'fulltextsearch' <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | φ <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2> <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | φ <fulltexts_source> ::= string
<fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source> <fieldedsearch_fun_name> ::= 'fieldedsearch' <query> ::= string <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>
<extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source> <extsearch_fun_name> ::= 'externalsearch' <extsearch_query> ::= string <extsearch_source> ::= string
<similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source> <similsearch_fun_name> ::= 'similaritysearch' <URL> ::= string <pair> ::= <feature> <equal> <weight> <pairs> ::= <and> <pair> <pairs> | φ <similarity_source> ::= <leaf_source>
<if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op> <compare-sign> ::= '==' | '>' | '<' | '>=' | '<=' <function-st> ::= <left-op> <math-op> <right-op> | <left-op> <math-op> ::= '+' | '-' | '*' | '/' <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal> <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun> <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis> <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis> <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis> <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis> <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis> <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis> <right-op> ::= <function-st> | <left-op> <element> ::= an element of the result set payload (either XML element, or XML attribute)
<retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema> <rm_fun_name> ::= 'retrievemetadata' <schema> ::= string <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>
<spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source> <spatialsearch_fun_name> ::= 'spatialsearch' <relation> ::= {'intersects', 'contains', 'isContained'} <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis> <polygon_name> ::= 'polygon' <timeBoundary> ::= 'within' <startTime> <stopTime> <startTime> ::= double <stopTime> ::= double <spatial_source> ::= <leaf_source> <points> ::= <point> {<comma> <point>}+ <point> ::= <x> <y> <x> ::= long <y> ::= long
<leaf_source> ::= [<in> <language>] <on>
Invalid language.
You need to specify a language like this: <source lang="html4strict">...</source>
Supported languages for syntax highlighting:
4cs, 6502acme, 6502kickass, 6502tasm, 68000devpac, abap, actionscript, actionscript3, ada, aimms, algol68, apache, applescript, arm, asm, asp, asymptote, autoconf, autohotkey, autoit, avisynth, awk, bascomavr, bash, basic4gl, bf, bibtex, blitzbasic, bnf, boo, c, caddcl, cadlisp, cfdg, cfm, chaiscript, chapel, cil, clojure, cmake, cobol, coffeescript, cpp, csharp, css, cuesheet, d, dart, dcl, dcpu16, dcs, delphi, diff, div, dos, dot, e, ecmascript, eiffel, email, epc, erlang, euphoria, ezt, f1, falcon, fo, fortran, freebasic, freeswitch, fsharp, gambas, gdb, genero, genie, gettext, glsl, gml, gnuplot, go, groovy, gwbasic, haskell, haxe, hicest, hq9plus, html4strict, html5, icon, idl, ini, inno, intercal, io, ispfpanel, j, java, java5, javascript, jcl, jquery, kixtart, klonec, klonecpp, latex, lb, ldif, lisp, llvm, locobasic, logtalk, lolcode, lotusformulas, lotusscript, lscript, lsl2, lua, m68k, magiksf, make, mapbasic, matlab, mirc, mmix, modula2, modula3, mpasm, mxml, mysql, nagios, netrexx, newlisp, nginx, nimrod, nsis, oberon2, objc, objeck, ocaml, octave, oobas, oorexx, oracle11, oracle8, oxygene, oz, parasail, parigp, pascal, pcre, per, perl, perl6, pf, php, pic16, pike, pixelbender, pli, plsql, postgresql, postscript, povray, powerbuilder, powershell, proftpd, progress, prolog, properties, providex, purebasic, pycon, pys60, python, q, qbasic, qml, racket, rails, rbs, rebol, reg, rexx, robots, rpmspec, rsplus, ruby, rust, sas, scala, scheme, scilab, scl, sdlbasic, smalltalk, smarty, spark, sparql, sql, standardml, stonescript, systemverilog, tcl, teraterm, text, thinbasic, tsql, typoscript, unicon, upc, urbi, uscript, vala, vb, vbnet, vbscript, vedit, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xbasic, xml, xpp, yaml, z80, zxbasic
[<as> <schema>] <non_leaf_source> ::= <on> <left_parenthesis> <function> <right_parenthesis> ---- <language> ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' | 'CZECH' | 'WELSH' | 'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' | 'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' | 'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' | 'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' | 'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' | 'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' | 'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU' <source> ::= string <schema> ::= string <left_parenthesis> ::= '(' <right_parenthesis> ::= ')' <comma> ::= ',' <and> ::= 'and' <on> ::= 'on' <as> ::= 'as' <by> ::= 'by' <sort_by> ::= 'sort' <from> ::= 'from' <if> ::= 'if' <then> ::= 'then' <else> ::= 'else' ====== Examples ====== {| border="1" |+ '''Example 1''' |- ! User Request | Give me back all documents whose metadata contain the word ''woman'' from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> |- ! Actual Query | <pre>fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'</pre> |- ! Explanation | We perform the ''fulltextsearch'' operation, using the ''woman'' term in the data source identified by the laguage ''ENGLISH'', source number ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'' and schema ''dc'' |} ---- {| border="1" |+ '''Example 2''' |- ! User Request | Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. ''Hemans, Felicia Dorothea Browne'' |- ! Actual Query | <pre>fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'</pre> |- ! Explanation | We perform the ''fieldedsearch'' operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content. |} ---- {| border="1" |+ '''Example 3''' |- ! User Request | Give me back the ''creator'' and ''subject'' the first 10 documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> sorted by the ''DocID'' field, whose creator's name contains ''ro''. |- ! Actual Query | <pre>project by 'creator', 'subject' on (keeptop '10' on (sort 'ASC' by 'DocID' on (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))</pre> |- ! Explanation | First of all, we perform the fieldedsearch operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the ''DocID'' field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the ''creator'' and ''subject'' fields. |} ---- {| border="1" |+ '''Example 4''' |- ! User Request | Perform spatial search against the collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> defining a search rectagular (0,0), (0,50), (50,50), (50,0). |- ! Actual Query | <pre>spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0) in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'</pre> |- ! Explanation | Search in collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}. |} ---- {| border="1" |+ '''Example 5''' |- ! User Request | Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea and all documents from the same collection whose title contain the word ''woman'' |- ! Actual Query | <pre>merge on (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre> |- ! Explanation | This is an example of how the user can merge the results of more than one subqueries. |} ---- {| border="1" |+ '''Example 6''' |- ! User Request | Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea AND whose description contain the word ''London'' |- ! Actual Query | <pre>join inner by 'DocID' on (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre> |- ! Explanation | This is an example of how the user can perform a join (logical AND) of subqueries. We perform the ''join'' operation in the field ''DocID'', which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set. |} ---- ==== Search Manager ==== ===== Introduction ===== An alternative entry point to the Search functionality is the SearchManager Service. This service provides an abstraction over the SearchMasterService enabling non-blocking query submission and result retrieval. Through this service the client is capable of submitting a query, checking the execution progress of a specific query and finally retrieving the endpoint reference of the results. Upon submission of a query, the SearchManager Service creates a resource that is the placeholder of the query’s status. The end point reference (EPR) of this resource is returned to the client, so as to be able to retrieve, at a future time, the relevant information regarding the progress of the query. Internally the SearchManager Service has to: *Communicate with a SearchMaster Service and wait for the search operation to terminate. *Return the EPR of a status resource that actually shows that the query is queued. To this end, this service spawns a new thread that hides the blocking functionality of the SearchMaster Service. The above mentioned resource contains the status of the search request and the endpoint reference of the final results, if available. This resource is retrieved each time a client asks for the status of his request, through its corresponding EPR.