Difference between revisions of "Data Fusion (New)"

From Gcube Wiki
Jump to: navigation, search
(Data Fusion Procedure)
(Introduction)
 
(5 intermediate revisions by one other user not shown)
Line 4: Line 4:
  
 
Data fusion is an operator that is used by gCube Search System in order to  
 
Data fusion is an operator that is used by gCube Search System in order to  
merge and the search results from different datasources and sort them by their score
+
merge and the search results from different datasources and sort them by their score.
 
Fusion is enabled when 2 or more datasources are participating in search query.
 
Fusion is enabled when 2 or more datasources are participating in search query.
  
Line 21: Line 21:
 
#Execution of the query on all appropriate datasources (like ordinary search).
 
#Execution of the query on all appropriate datasources (like ordinary search).
 
#Collection of the records from the datasources(*).
 
#Collection of the records from the datasources(*).
#Re-rank (re-index) of the records based on some field(s) or actual content (see [[https://gcube.wiki.gcube-system.org/gcube/index.php/Data_Fusion_(New)#Fusion_Fields Fusion Fields]).
+
#Re-rank (re-index) of the records based on some field(s) or actual content (see [[Data_Fusion_(New)#Fusion_Fields | Fusion Fields]]).
 
#Execution of the initial query against the new index.
 
#Execution of the initial query against the new index.
 
#Optionally, boosting of the (new) score with the position that each record had at the origin datasource.
 
#Optionally, boosting of the (new) score with the position that each record had at the origin datasource.
Line 28: Line 28:
 
example:
 
example:
 
<pre>
 
<pre>
(gDocCollectionID == \"ColID\") and (title = tuna) project * fuse tuna
+
(gDocCollectionID == "ColID") and (title = tuna) project * fuse tuna
 
</pre>
 
</pre>
  
Line 48: Line 48:
  
 
===Positional Boost===
 
===Positional Boost===
In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score
+
In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score.
Position Score formula (experimental and naive):
+
Currently the following Position Score formula is used (experimental and naive):
 +
 
 
<pre>
 
<pre>
 
s(p) = a / (b ^ p), where (a = 0.986..., b = 1.025...)
 
s(p) = a / (b ^ p), where (a = 0.986..., b = 1.025...)

Latest revision as of 12:50, 7 January 2014

Data Fusion

Introduction

Data fusion is an operator that is used by gCube Search System in order to merge and the search results from different datasources and sort them by their score. Fusion is enabled when 2 or more datasources are participating in search query.

Note that the new score (after re-ranking) may be complete different from the initial score for each result but the final ordering of the results is the goal of the data fusion.

The Data Fusion library is available in our Maven repositories with the following coordinates

<groupId>org.gcube.search</groupId>
<artifactId>data-fusion</artifactId>
<version>...</version>

Data Fusion Procedure

The following steps are executed in the data fusion procedure:

  1. Execution of the query on all appropriate datasources (like ordinary search).
  2. Collection of the records from the datasources(*).
  3. Re-rank (re-index) of the records based on some field(s) or actual content (see Fusion Fields).
  4. Execution of the initial query against the new index.
  5. Optionally, boosting of the (new) score with the position that each record had at the origin datasource.

Search System uses it through operators library when fuse keyword appears at the end of the CQL query followed by the search term: example:

(gDocCollectionID == "ColID") and (title = tuna) project * fuse tuna

(*) Record retrieval: Data fusion comes with a custom iterator that multiplexes ranked and unranked result sets into one while keeping them sorted based on their score as well as removes the duplicates. Unranked records are considered of higher value. Sorting can improve the performance of the fusion process when combined with count.

Data Fusion Configuration

The Data Fusion component can be parametrized through the properties file fusion.properties.

Fusion Fields

The field on which the records will be re-ranked can be customized: We can define a list of fields in the property file. For example:

snippet-fields=S,description,title

For each record the selected field will be the first field in the list that exists in the record. If no field is found the actual content of the record will be retrieved.

Positional Boost

In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score. Currently the following Position Score formula is used (experimental and naive):

s(p) = a / (b ^ p), where (a = 0.986..., b = 1.025...)

This feature can be easily enabled/disabled by changing the include-position property in the properties file. For example:

include-position=false