Difference between revisions of "Data Fusion (New)"

From Gcube Wiki
Jump to: navigation, search
(Created page with '=Data Fusion= ==Introduction== Data fusion is an operaror that is used by gCube Search System in order to merge and the search results from different datasources and sort them…')
 
Line 3: Line 3:
 
==Introduction==
 
==Introduction==
  
Data fusion is an operaror that is used by gCube Search System in order to  
+
Data fusion is an operator that is used by gCube Search System in order to  
 
merge and the search results from different datasources and sort them by their score
 
merge and the search results from different datasources and sort them by their score
 
Fusion is enabled when 2 or more datasources are participating in search query.
 
Fusion is enabled when 2 or more datasources are participating in search query.
Line 16: Line 16:
 
<version>...</version>
 
<version>...</version>
 
</source>
 
</source>
 
  
 
==Data Fusion Procedure==
 
==Data Fusion Procedure==
 
 
 
The following steps are executed in the data fusion procedure:
 
The following steps are executed in the data fusion procedure:
 
#Execution of the query on all appropriate datasources (like ordinary search)
 
#Execution of the query on all appropriate datasources (like ordinary search)
Line 36: Line 33:
 
(*) Record retrieval:
 
(*) Record retrieval:
 
Data fusion comes with a custom iterator that multiplexes ranked and unranked result sets into one  
 
Data fusion comes with a custom iterator that multiplexes ranked and unranked result sets into one  
while keeping them sorted based on their score as well as removes the duplicates. Unranked records are considered of higher value. Sorting can improve the performace of the fusion process when combined with ''count''.
+
while keeping them sorted based on their score as well as removes the duplicates. Unranked records are considered of higher value. Sorting can improve the performance of the fusion process when combined with ''count''.
 
+
  
 
==Data Fusion Configuration==
 
==Data Fusion Configuration==
 
+
The Data Fusion component can be parametriaed through the properties file ''fusion.properties''.
The Data Fusion component can be parametrised through the properties file ''fusion.properties''.
+
  
 
===Fusion Fields===
 
===Fusion Fields===
Line 54: Line 49:
 
===Positional Boost===
 
===Positional Boost===
 
In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score
 
In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score
Position Score formula (experimental and naïve):
+
Position Score formula (experimental and naive):
 
<pre>
 
<pre>
 
s(p) = a / (b^p) , where (a = 0.986..., b =  1.025...)
 
s(p) = a / (b^p) , where (a = 0.986..., b =  1.025...)

Revision as of 14:48, 20 December 2013

Data Fusion

Introduction

Data fusion is an operator that is used by gCube Search System in order to merge and the search results from different datasources and sort them by their score Fusion is enabled when 2 or more datasources are participating in search query.

Note that the new score (after re-ranking) may be complete different from the initial score for each result but the final ordering of the results is the goal of the data fusion.

The Data Fusion library is available in our Maven repositories with the following coordinates

<groupId>org.gcube.search</groupId>
<artifactId>data-fusion</artifactId>
<version>...</version>

Data Fusion Procedure

The following steps are executed in the data fusion procedure:

  1. Execution of the query on all appropriate datasources (like ordinary search)
  2. Collection of the records from the datasources(*)
  3. Re-rank (re-index) of the records based on some field(s) or actual content (see Data Fusion Configuration)
  4. Execution of the initial query against the new index
  5. Optionally, boosting of the (new) score with the position that each record had at the origin datasource

Search System uses it through operators library when fuse keyword appears at the end of the CQL query followed by the search term: example:

(gDocCollectionID == \"ColID\") and (title = tuna) project * fuse tuna

(*) Record retrieval: Data fusion comes with a custom iterator that multiplexes ranked and unranked result sets into one while keeping them sorted based on their score as well as removes the duplicates. Unranked records are considered of higher value. Sorting can improve the performance of the fusion process when combined with count.

Data Fusion Configuration

The Data Fusion component can be parametriaed through the properties file fusion.properties.

Fusion Fields

The field on which the records will be re-ranked can be customized: We can define a list of fields in the property file. For example:

snippet-fields=S,description,title

For each record the selected field will be the first field in the list that exists in the record. If no field is found the actual content of the record will be retrieved.

Positional Boost

In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score Position Score formula (experimental and naive):

s(p) = a / (b^p) , where (a = 0.986..., b =  1.025...)

This feature can be easily enabled/disabled by changing the include-position property in the properties file. For example:

include-position=false