https://wiki.gcube-system.org/api.php?action=feedcontributions&user=Alex.antoniadi&feedformat=atomGcube Wiki - User contributions [en]2024-03-28T21:54:56ZUser contributionsMediaWiki 1.25.1https://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21774SRU Facilities2014-07-17T12:52:59Z<p>Alex.antoniadi: /* SRU RDBMS Adapter */</p>
<hr />
<div>=Introduction=<br />
<br />
The [http://www.loc.gov/standards/sru/sru-1-1.html SRU] components have been created in the context of the [http://en.wikipedia.org/wiki/Federated_search Federated Search]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to [https://gcube.wiki.gcube-system.org/gcube/index.php/Search_Framework_2.0 gCube SearchSystem]<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve<br />
<br />
<br />
=SRU Consumer=<br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru consumer service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru consumer service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru consumer service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
==Deployment Instructions==<br />
<br />
==Usage Examples==<br />
<br />
=SRU RDBMS Adapter=<br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru db adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-db-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru db adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-db-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru db adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-db-adapter-client</artifactId><br />
<version>...</version><br />
<br />
==Deployment Instructions==<br />
<br />
==Usage Examples==</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21773SRU Facilities2014-07-17T12:52:11Z<p>Alex.antoniadi: </p>
<hr />
<div>=Introduction=<br />
<br />
The [http://www.loc.gov/standards/sru/sru-1-1.html SRU] components have been created in the context of the [http://en.wikipedia.org/wiki/Federated_search Federated Search]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to [https://gcube.wiki.gcube-system.org/gcube/index.php/Search_Framework_2.0 gCube SearchSystem]<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve<br />
<br />
<br />
=SRU Consumer=<br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru consumer service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru consumer service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru consumer service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-consumer-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
==Deployment Instructions==<br />
<br />
==Usage Examples==<br />
<br />
=SRU RDBMS Adapter=<br />
<br />
==Deployment Instructions==<br />
<br />
==Usage Examples==</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21772SRU Facilities2014-07-17T12:49:11Z<p>Alex.antoniadi: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
<br />
The [http://www.loc.gov/standards/sru/sru-1-1.html SRU] components have been created in the context of the [http://en.wikipedia.org/wiki/Federated_search Federated Search]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to [https://gcube.wiki.gcube-system.org/gcube/index.php/Search_Framework_2.0 gCube SearchSystem]<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21771SRU Facilities2014-07-17T12:48:40Z<p>Alex.antoniadi: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
<br />
The [http://www.loc.gov/standards/sru/sru-1-1.html SRU] components have been created in the context of the [http://en.wikipedia.org/wiki/Federated_search Federated Search]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21770SRU Facilities2014-07-17T12:48:05Z<p>Alex.antoniadi: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
<br />
The [http://www.loc.gov/standards/sru/sru-1-1.html SRU] components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21769SRU Facilities2014-07-17T12:45:14Z<p>Alex.antoniadi: /* Java Client (Stateful) */</p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21768SRU Facilities2014-07-17T12:44:38Z<p>Alex.antoniadi: /* SRU Search System Adapter */</p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- sru search adapter service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- sru search adapter service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- sru search adapter service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21767SRU Facilities2014-07-17T12:44:17Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- index service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
* (in case of stateful only) smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21766SRU Facilities2014-07-17T12:43:44Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- index service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
* sru-search-adapter-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
<br />
'''NOTE:''' smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
'''NOTE:''' The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21765SRU Facilities2014-07-17T12:41:21Z<p>Alex.antoniadi: </p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem<br />
<br />
<br />
=SRU Search System Adapter=<br />
<br />
The SRU Search System Adapter is a service that is used in order to provide SRU capabilities to the gCube SearchSystem and thus be used by external datasources. Although the SRU Search System Adapter service can be used directly from the HTTP API it can also be used through the Java client of the service. <br />
<br />
SRU Search System Adapter is consisted by a few components that are available in our Maven repositories with the following coordinates<br />
<br />
<source lang="xml"><br />
<!-- index service web app --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.search.sru</groupId><br />
<artifactId>sru-search-adapter-client</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Details==<br />
SRU Search Adapter Service can run both as a stateful or as a stateless service. Each one has some pros and cons. If the service is stateful then only one running instance is required in order to provide SRU to SearchSystem Services from multiple scopes but this requires smartgears to be deployed along with the service and pass scope on each HTTP call to the service, which means that a standard SRU client might not work. On the other hand, stateless requires to have one running instance for each SearchSystem Service. <br />
<br />
We strongly recommend to use and deploy SRU Search Adapter as a stateless service.<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run SRU Search System Adapter service on a node we will need the following:<br />
<br />
<br />
<br />
Note that smartgears should not be deployed if SRU Search Adapter Service will operate as a stateless service<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.<br />
<br />
The hostname of the node as well as the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.<br />
<br />
NOTE: The Stateless SRU Search Adapter Service must run on the same (VRE) scope as the SearchSystem on which it will provide the SRU capabilities. Also SRU Search System Adapter runs without ResourceRegistry<br />
<br />
<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext/NextNext<br />
</pre><br />
<br />
==Usage Example==<br />
===Java Client (Stateless)===<br />
<br />
<source lang="java"><br />
<br />
String query = "title = tuna";<br />
Integer maxRecords = 4;<br />
String recordSchema = “oai_dc”;<br />
<br />
<br />
final String scope = "/gcube/devNext/NextNext";<br />
final String endpoint = "http://dl08.di.uoa.gr:8080/sru-search-adapter-service";<br />
<br />
<br />
SruSearchAdapterStatelessClient client = new SruSearchAdapterStatelessClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
<br />
String searchResponse = client.searchRetrieve(1.1f, "", query, maxRecords, recordSchema);<br />
String explain = client.explain();<br />
</source><br />
<br />
<br />
===Java Client (Stateful)===<br />
<br />
<source lang="java"><br />
SruSearchAdapterFactoryClient factory = new SruSearchAdapterFactoryClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
String searchSystemEndpoint = "http://localhost:8080/searchsystemservice";<br />
String hostname = "localhost";<br />
<br />
SruSearchAdapterResource resource = new SruSearchAdapterResource();<br />
resource.setHostname(hostname);<br />
resource.setPort(8080);<br />
resource.setSearchSystemEndpoint(searchSystemEndpoint);<br />
<br />
String resourceID = factory.createResource(resource, scope);<br />
<br />
SruSearchAdapterClient client = new SruSearchAdapterClient.Builder()<br />
.endpoint(endpoint)<br />
.scope(scope)<br />
.resourceID(resourceID)<br />
.build();<br />
<br />
</source><br />
<br />
===HTTP===<br />
<br />
*explain : http://dl08.di.uoa.gr:8080/sru-search-adapter-service<br />
*search: http://dl08.di.uoa.gr:8080/sru-search-adapter-service?query=title+%3D+%22tuna%22&version=1.1&operation=searchRetrieve</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21764SRU Facilities2014-07-17T12:24:32Z<p>Alex.antoniadi: /* Introduction */</p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others<br />
<br />
<br />
In order to utilize the above benefits we have developed the following components:<br />
*an SRU consumer service that is used in order to exploit external SRU providers<br />
*an SRU adapter service for RDBMSs that is used in order to provide SRU capabilities to RBMSs<br />
*an SRU adapter service for SearchSystem that is used in order to provide SRU capabilities to gCube SearchSystem</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21763SRU Facilities2014-07-17T12:23:53Z<p>Alex.antoniadi: </p>
<hr />
<div>=Introduction=<br />
<br />
The SRU components have been created in the context of the Federated Search[link to wiki page]. Federated Search has the following benefits:<br />
*Search space is expanded as new datasources come in<br />
*The complexity of search and index to the datasource providers<br />
<br />
By enabling Federated Search gCube can benefit by:<br />
*Getting more external sources:<br />
**External providers that comply with SRU can be used directly as external datasources<br />
**External providers that do not comply with SRU can be used indirectly as external datasources through SRU Adapters<br />
*Providing gCube datasources to others:<br />
**gCube SearchSystem can be used by others</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=SRU_Facilities&diff=21762SRU Facilities2014-07-17T12:19:41Z<p>Alex.antoniadi: Created page with 'SRU'</p>
<hr />
<div>SRU</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21722Index Management Framework2014-06-19T12:37:40Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=./resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''NOTE''': it is important to note that ''resourcesFoldername'' as well as ''dataDir'' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better for these properties to take absolute paths as values.<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=OpenSearch_Framework&diff=21721OpenSearch Framework2014-06-19T12:37:26Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>==Description==<br />
The role of the gCube ''OpenSearch Framework'' is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the [http://www.opensearch.org/Specifications/OpenSearch OpenSearch Specification]. The framework consists of two components <br />
*The ''OpenSearch Library'', which includes a core library providing general-purpose OpenSearch functionality, and the ''OpenSearch Operator'' which utilizes functionality provided by the former.<br />
*The ''OpenSearch Service'' (also called ''OpenSearchDataSource Service''), which binds collections with provider-specific information encapsulated in generic resources and invokes the ''OpenSearch Operator''<br />
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole ''OpenSearch Library'' component, whereas the name "OpenSearch Core Library" will be used when referring to the library constituent of the component.<br />
<br />
A client library library for OpenSearch Service, called ''OpenSearchDataSource Client Library'', also exists in order to assist the programmatically use of the service. <br />
<br />
''OpenSearch Library'', ''OpenSearch Service'' and ''OpenSearchDataSource Client Library'' are available in our Maven repositories with the following coordinates:<br />
<source lang="xml"><br />
<!-- OpenSearch Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchlibrary</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearch Service --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-service</artifactId><br />
<version>...</version><br />
<br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-stubs</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearchDataSource Client Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-client-library</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==The OpenSearch Library==<br />
<br />
===The OpenSearch Core Library===<br />
====Description====<br />
The ''OpenSearch Core Library'' conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers.<br />
It can be optionally extended, as described in the [[#Extensibility|Extensibility]] section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported.<br />
The ''OpenSearch Operator'', described in a [[#The OpenSearch Operator|following]] section functions atop this library.<br />
<br />
====Functionality====<br />
The central class which can be used in order to exploit the functionality provided by the library, is the ''DescriptionDocument'' class. For reasons explained in the [[#Library Extensibility|following]] section, the ''DescriptionDocument'' class needs to be provided with a pair of ''URLElementFactory'' and ''QueryElementFactory'' factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the ''FactoryResolver'' class, as follows:<br />
<source lang="java5"><br />
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);<br />
</source><br />
The ''DescriptionDocument'' is then instantiated as follows:<br />
<source lang="java5"><br />
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);<br />
</source><br />
where the ''descriptionDocumentXML'' parameter corresponds to a DOM Document object containing the parsed Description Document.<br />
Properly instantiated, the ''DescriptionDocument'' class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a ''QueryBuilder'' object, which can be obtained as follows:<br />
<source lang="java5"><br />
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);<br />
</source><br />
where <code>rel</code> is a rel value as described in the [http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_4#Url_rel_values OpenSearch Specification], e.g. <code>results</code> and <code>MimeType</code> is a MIME type, such as <code>application/rss+xml</code>. The returned list contains one ''QueryBuilder'' instance for each template contained in a URL Element with the specified <code>rel</code> and <code>type</code> attributes.<br />
Once the desired ''QueryBuilder'' is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.<br />
For example, <code>searchTerms</code> parameter can be set to some value as follows:<br />
<source lang="java5"><br />
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms); <br />
</source><br />
Once all the required parameters are set, the constructed query can be obtained as follows:<br />
<source lang="java5"><br />
URL query;<br />
try {<br />
query = qb.getQuery();<br />
}catch(IncompleteQueryException iqe) {<br />
//Incomplete query exception handling<br />
}catch(MalformedQueryException mqe {<br />
//Malformed query exception handling<br />
}<br />
</source><br />
<br />
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either ''HTMLResponse'' or ''XMLResponse'', depending on the MIME type of the OpenSearch response, in order for the [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Response Elements] and any other available information contained in the response to be processed.<br />
<source lang="java5"><br />
InputStream responseStream = query.openConnection().getInputStream();<br />
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());<br />
</source><br />
<br />
The raw XML data can then be obtained by the ''OpenSearchResponse'' object as follows:<br />
<source lang="java5"><br />
response.getResponse();<br />
</source><br />
and any information available, mainly relevant to paging, can be obtained by one of the methods of the ''OpenSearchResponse'' class. For example the total number of results, as reported by the <code>totalResults</code> Response Element, if present, can be obtained as follows:<br />
<source lang="java5"><br />
response.getTotalResults();<br />
</source><br />
<br />
====Library Extensibility====<br />
=====Motivation=====<br />
The core functionality provided by the ''OpenSearch Core Library'' is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored.<br />
An example of an assumption made by the ''OpenSearch Core Library'', in the form of a requirement, is that all parameter values passed to its ''QueryBuilder'' components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a client failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.<br />
<br />
There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the RFC specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the caller should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.<br />
<br />
=====Extensibility Mechanism=====<br />
The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, ''QueryBuilder'' components are included in the extensibility mechanism as they depend on the aforementioned elements.<br />
<br />
Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore, it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the caller.<br />
<br />
The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:<br />
*The ''URLElement'', ''QueryElement'' and ''QueryBuilder'' components are interfaces whose implementations support core or extension-related functionality.<br />
*Core functionality processing takes place in the last link of the chain of responsibility. For example, ''BasicQueryBuilder'' implements core ''QueryBuilder'' functionality.<br />
*Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if ''GeoQueryBuilder'' implements functionality related to the Geo OpenSearch Extension, it contains a reference to a ''QueryBuilder'' implementing either core functionality, or functionality related to some other extension.<br />
*Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.<br />
<br />
<br />
In order for a chain of responsibility to be dynamically created by the ''DescriptionDocument'' class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the ''DescriptionDocument'' in order for it to be able to construct the correct elements. The ''FactoryResolver'' utility is responsible for the construction of factories capable of constructing instances supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing ''QueryBuilder'' instances, the latter are constructed without explicitly supplying a factory to the ''DescriptionDocument'', by the ''getQueryBuilder'' method of the already constructed ''URLElement''.<br />
<br />
The ''FactoryResolver'' requires that two things be known in order to be able to construct the factories:<br />
*A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping could be: <code><<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)></code>, which declares that the implementations responsible for providing core functionality are the ''BasicURLElement'' and ''BasicQueryElement'' classes.<br />
*A list of all parameter namespaces present in the query string.<br />
<br />
Having the above information, and provided that the implementations of all factories and classes are available, the ''FactoryResolver'' will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed instances will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.<br />
<br />
It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. There are, therefore, two choices when using the library<br />
*Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the caller should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.<br />
*Extend the library whenever this proves useful or makes things easier.<br />
<br />
=====Implementing a new Extension=====<br />
In order to implement a new extension that will be correctly incorporated into the already existing library functionality, one should do the following:<br />
*Implement ''URLElement'', ''QueryElement'' and ''QueryBuilder'' interface implementations, the constructor of which accepts at least a reference to a corresponding upcasted object which will be next in the chain of responsibility. All requests that cannot be handled, or require additional processing by subsequent links in the chain, should be forwarded to the next link in the chain.<br />
*Implement a ''URLElementFactory'' and a ''QueryElementFactory'' interface implementations, the constructors of which accept a single argument, which is a reference to an upcasted factory of the same type corresponding to the factory used to create instances next in the chain of responsibility. An example of a ''URLElementFactory'' used for the construction of ''GeoURLElement''s which implement Geo extension functionality is the following:<br />
<source lang="java5"><br />
public class GeoURLElementFactory implements URLElementFactory {<br />
<br />
URLElementFactory f;<br />
<br />
public GeoURLElementFactory(URLElementFactory f) {<br />
this.f = f;<br />
}<br />
<br />
public GeoURLElement newInstance(Element url, Map<String, String> nsPrefixes) throws Exception {<br />
URLElement el = f.newInstance(url, nsPrefixes);<br />
return new GeoURLElement(url, el);<br />
}<br />
}<br />
</source><br />
*See that the ''getQueryBuilder'' method of the ''URLElement'' implementation of the new extension correctly constructs a ''QueryBuilder'' instance. For example, the ''getQueryBuilder'' method of the ''GeoQueryElement'' presented above could look like this:<br />
<source lang="java5"><br />
public QueryBuilder getQueryBuilder() throws Exception {<br />
return new GeoQueryBuilder(el.getQueryBuilder());<br />
}<br />
</source><br />
where <code>el</code> is the next ''URLElement'' in the chain.<br />
*Add a mapping for the two new factories to the set of mappings passed to the ''FactoryResolver'' utility upon initialization of the library. For example, given that ''GeoURLElementFactory'' and ''GeoQueryElementFactory'' are implemented for Geo Extensions, one could add the mapping as follows:<br />
<source lang=java5><br />
factoryMappings.add("http://a9.com/-/opensearch/extensions/geo/1.0/",<br />
new FactoryClassNamePair("org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElement", "org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElement"));<br />
</source><br />
<br />
===The OpenSearch Operator===<br />
====Description====<br />
The role of the ''OpenSearch Operator'' is to provide support for querying and retrieval of search results via [http://www.opensearch.org/Home OpenSearch] from providers which expose an [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch description document]. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an [[#OpenSearch Resource|OpenSearch Resource]] reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.<br />
<br />
====Extensibility Points====<br />
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special [[#OpenSearch Resource|OpenSearch Resource]] structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the ''OpenSearch Core Library''. The extra functionalities are summarized as follows:<br />
*Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.<br />
*The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the single direct provider in the direct provider case. The purpose of these parameters is, first, to facilitate the creation of dynamic collections from results obtained by brokers by superseeding the caller's query parameters while querying the broker and using the full set of the caller parameters only on lower levels and, second, to customize the behaviour of some provider to the needs of the gCube Framework (for example set a value for a required query parameter that the framework cannot handle). These options can be used both in tandem, if desired.<br />
*Support for one or more security schemes is planned for a subsequent version of the ''OpenSearch Library''.<br />
<br />
====OpenSearch Resource====<br />
The purpose of an ''OpenSearch Resource'' object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the [[#Extensibility Points|Extensibility Points]] section. The attributes included are the following:<br />
* The name of the resource<br />
* The URL of the OpenSearch Description Document of the provider to be queried<br />
* Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.<br />
* Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:<br />
**The ''RecordSplitXPath'' expression is used to split a page of search results into individual records. For example for the rss format, the <code><item></code> elements under <code>rss/channel</code> could be of interest<br />
**The ''presentationInfo'' expression which is actually a map between fieldnames and XPath expressions that are used to extract the desired value from the response.<br />
**The optional ''RecordIdXPath'' expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a <code>DocID</code> record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.<br />
* Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,<br />
<br />
The serialization of an ''OpenSearch Resource'' can be easily incorporated into a Generic Resource. The default mode of operation for the ''OpenSearch Operator'' in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resource from the [[Information_System|IS]].<br />
The Generic Resources utilized by the ''OpenSearch Operator'' is:<br />
*The ''OpenSearchResource'' which contains the body of the OpenSearch Resource as described below<br />
<br />
<br />
Note that, solely for testing purposes, the ''OpenSearch Operator'' also supports a local mode of operation, whereby all ''OpenSearch Resources'' are loaded from the local file system. <br />
<br />
The XML Schema that all OpenSearch Resource serializations should conform to is the following:<br />
<source lang="xml"><br />
<?xml version="1.0"?><br />
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><br />
<xs:element name="OpenSearchResource"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="name" type="xs:string"/><br />
<xs:element name="descriptionDocumentURI" type="xs:string"/><br />
<xs:element name="brokeredResults" type="xs:boolean"/><br />
<xs:element name="transformation" maxOccurs="unbounded"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="MIMEType" type="xs:string"/><br />
<xs:element name="recordSplitXPath" type="xs:string"/><br />
<xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/><br />
<xs:element name="presentationInfo" maxOccurs="unbounded"><br />
<xs:element name="presentable" maxOccurs="unbounded"><br />
<xs:element name="fieldName" type="xs:string"/><br />
<xs:element name="expression" type="xs:string"/><br />
</xs:element><br />
</xs:element><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
<xs:element name="security" minOccurs="0"><br />
<xs:complexType><br />
<xs:sequence><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element> <br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
</xs:schema><br />
</source><br />
<br />
The transformation element can appear multiple times within an ''OpenSearch Resource''. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types. <br />
<br />
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch providers from the initial results provided by the OpenSearch provider acting as a broker.<br />
<br />
An example of an ''OpenSearch Resource'' serialization describing the [http://www.bing.com/ Bing] external repository as a direct OpenSearch provider, currently in use by the gCube Framework is the following:<br />
<source lang="xml"><br />
<OpenSearchResource><br />
<name>Bing</name><br />
<descriptionDocumentURI>http://imarine.web.cern.ch/imarine/OpenSearch/bing.xml</descriptionDocumentURI><br />
<brokeredResults>false</brokeredResults><br />
<parameters><br />
<parameter><br />
<fieldName>allIndexes</fieldName><br />
<qName>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms</qName><br />
</parameter><br />
</parameters><br />
<transformation><br />
<MIMEType>application/rss+xml</MIMEType><br />
<recordSplitXPath>*[local-name()='rss']/*[local-name()='channel']/*[local-name()='item']</recordSplitXPath><br />
<recordIdXPath>//*[local-name()='item']/*[local-name()='link']</recordIdXPath><br />
<presentationInfo><br />
<presentable><br />
<fieldName>title</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='title']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>link</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='link']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>description</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>S</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>pubDate</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='pubDate']</expression><br />
</presentable><br />
</presentationInfo><br />
</transformation><br />
</OpenSearchResource><br />
</source><br />
<br />
In case our datasrouce is broker the ''brokeredResults'' should be changed to true as:<br />
<br />
<source lang="xml"><br />
<brokeredResults>true</brokeredResults><br />
</source><br />
<br />
====OpenSearch Operator Logic====<br />
[[File:Opensearch_op_flowchart.png|right|Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator]]<br />
The ''OpenSearch Operator'' employs the functionality provided by the [[#The OpenSearch Core Library|OpenSearch Core Library]] in order to extract the required information from the Description Document of the external provider and the ''QueryBuilder''s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. <br />
<br />
It should be noted that the ''OpenSearch Operator'' abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way ''OpenSearch Resource''s are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options<br />
*If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional ''OpenSearch Resource''s can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.<br />
*If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.<br />
<br />
In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the operator, before issuing any actual queries. <br />
Given that an ''OpenSearch Resource'' can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element URL Element] is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the ''QueryBuilder'' whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the ''OpenSearch Resource'' describing the provider are discarded. The ''QueryBuilder''s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:<br />
*''QueryBuilder''s whose required parameters are not covered by the parameters of the caller's query are discarded<br />
*''QueryBuilder''s are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered<br />
*A ''QueryBuilder'' which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.<br />
<br />
The most usual case is for the provider's ''OpenSearch Resource'' to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one ''QueryBuilder'' being available to construct queries, thereby resulting to a degenerate reordering step.<br />
<br />
The functions performed by the operator in order for a set of results to be retrieved, given that the proper ''QueryBuilder'' is selected, are summarized in the simplified diagram of Figure 1.<br />
<br />
As shown, the operator accepts a set of query terms and a set of query parameters.<br />
<br />
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator. <br />
<br />
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.<br />
<br />
====Configurable Parameters====<br />
The ''OpenSearch Operator'' can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:<br />
*The ''resultsPerPage'' parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is ''100''.<br />
*The ''sequentialResults'' parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is ''false''.<br />
*The ''useLocalResource'' parameter, when enabled, permits the operator to operate in the absence of an IS. The ''OpenSearch Resources'' are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to ''false''.<br />
<br />
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in [[#Library Extensibility|Library Extensibility]].<br />
The ''sequentialResults'' parameter can also be configured in a per-query manner, including it in the query string as a query parameter.<br />
<br />
====Query Format====<br />
The ''OpenSearch Operator'' expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form<br />
<br />
<code><br />
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"<br />
</code><br />
<br />
and should be space-delimited.<br />
Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well.<br />
Any free-text parameter value should be URL-encoded.<br />
<br />
The reserved keyword ''config'' when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the<br />
sequentialResults parameter described in [[#Configurable Parameters|Configurable Parameters]], plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.<br />
The following hold for the query configuration parameter values<br />
*The ''sequentialResults'' parameter should be assigned a value equal to ''true'' or ''false''. Its absence implies the default value of the corresponding configurable parameter of the operator.<br />
*The ''numOfResults'' parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.<br />
<br />
Taking everything into account, an example of a legitimate query for the ''OpenSearch Operator'' could be the following:<br />
<br />
<code>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"</code><br />
<br />
which instructs the operator to use the string <code>Hello World</code> as the value for the <code>SearchTerms</code> standard OpenSearch parameter and to retrieve up to 300 results from the provider.<br />
<br />
==The OpenSearch Service==<br />
===Description===<br />
The ''OpenSearch Service'' is a stateful web service responsible for the invocation of the ''OpenSearch Operator'' in the context of the provider to be queried.<br />
It also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all<br />
previously queried brokered providers and the corresponding Description Documents.<br />
<br />
<br />
===Deployment Instructions===<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* opensearchdatasource-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl08.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''NOTE''': it is important to note that ''resourcesFoldername'' property has relative path in its default value (./resources/opensearch). In some cases this value maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better for this property to take an absolute path as value.<br />
<br />
===WS and Generic Resource Interrelation===<br />
Provided that a Collection for the provider to be queried is available, the ''OpenSearch Service'' uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Collection corresponding to the provider.<br />
<br />
An ''OpenSearch Service'' WS-Resource contains the following properties:<br />
*The ''AdaptorID'' which is unique for every WSResource and is used for referencing the right WSResource on querying (it is optional).<br />
*The ''CollectionID'' of the collection to be used.<br />
*The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.<br />
*The URI of the Description Document (''DescriptionDocumentURI'') of the top-provider.<br />
*A set of fields (presentables and searchables) as extracted from the ''OpenSearchGenericResource''.<br />
*A set of ''FixedParameters'', which are used in every invocation of the Operator. See also [[#Extensibility Points|Extensibility Points]].<br />
<br />
As mentioned above, the WS-Resource contains a reference only to the ''OpenSearchResource'' of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System implementation and are therefore not directly referenced by the WS-Resource.<br />
<br />
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.<br />
<br />
On WS-Resource creation, only the Metadata Collection ID and the ID of the ''OpenSearchResource'' of the top-provider need to be supplied to the ''create'' operation of the service's factory. All other properties are created internally by the service itself.<br />
<br />
===Resource Caching===<br />
For performance and reliability reasons, the ''OpenSearch Service'' maintains one cache per WS-Resource which initially contains the Generic Resource (''OpenSearchResource'') and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant ''OpenSearch Operator'' module with the Generic Resources and Description Documents of all providers reached through the broker.<br />
<br />
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically, based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.<br />
<br />
The cache refresh cycle policy used is described as follows:<br />
*The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.<br />
*The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.<br />
*In the event of failure, the previously cached version is kept.<br />
<br />
===Operations===<br />
The operations exposed by the OpenSearch Service are the following:<br />
*The ''query'' operation, with a single input message containing the query string to be sent to the operator, whose format is described in [[#Query Format|Query Format]].<br />
*The ''refreshCache'' operation, which sends a request in order to force the cache of the service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.<br />
<br />
===Configurable Parameters===<br />
The Service currently supports three configurable parameters, which are exposed to its deployment descriptor<br />
*The ''clearCacheOnStartup'' parameter, of ''boolean'' type, when enabled instructs the service to discard the stored cache on startup.<br />
*The ''cacheRefreshIntervalMillis'' parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value ''0'' can be used to disable periodic cache refresh cycles.<br />
* The ''openSearchLibraryFactories'' parameter, of ''string'' type is used to supply the ''OpenSearch Core Library'' with the factory mappings for all namespaces for which there exists an implementation of a library extension. For more information on the mappings, see also the section referring to the [[#Extensibility Mechanism|extensibility mechanism]] of the library.<br />
<br />
The ''openSearchLibraryFactories'' parameter is encoded as a sequence of mappings from strings to pairs, where each mapping is enclosed in braces, association is denoted by the ''='' sign and each pair is enclosed in parentheses. For example, given that there are implementations for core functionality, Geo and Time extensions, the value of this configuration parameter could be the following:<br />
<br />
<code><br />
[<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/geo/1.0/</nowiki>/=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/time/1.0/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.time.TimeURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.time.TimeQueryElementFactory)]<br />
</code><br />
<br />
==The OpenSearchDataSource Client Library==<br />
In this section some examples of usage of the ''OpenSearchDataSource Client Library'' are provided.<br />
<br />
'''Query''' example:<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext";<br />
<br />
final OpenSearchClient client = new OpenSearchClient.Builder()<br />
// .endpoint(endpoint) // you can also give a specific endpoint<br />
.scope(scope)<br />
.build();<br />
<br />
final String queryString = "((((gDocCollectionID == \"ea5b9c70-01ce-4b45-96e7-6db037ebf2bc\") and (gDocCollectionLang == \"en\"))) and (5575bbdb-6d47-4297-ad12-2259b3405ce7 = greece)) project d91f3c47-e46e-4737-9496-a0f72361a397 3e3584f0-eed3-4089-99cd-86a7def1471e";<br />
<br />
String grs2Locator = client.query(queryString);<br />
<br />
// or<br />
<br />
List<Map<String, String>> records = client2.queryAndRead(queryString);<br />
<br />
</source><br />
<br />
'''Create Resource''' method example:<br />
<source lang="java"><br />
<br />
static void createResource(List<String> fieldParameters, List<String> fixedParameters, String collectionID, String openSearchResourceID, String scope) throws OpenSearchClientException {<br />
<br />
OpenSearchFactoryClient factory = injector.getInstance(OpenSearchFactoryClient.Builder.class)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
List<String> fieldParams = Lists.newArrayList();<br />
<br />
for (int i=0; i<fieldParameters.size(); i++) {<br />
fieldParams.add(collectionID + ":" + fieldParameters.get(i));<br />
System.out.println("Field parameter: " + (i+1) + " " + fieldParams.get(i));<br />
}<br />
<br />
Provider p = new Provider();<br />
p.setCollectionID(collectionID);<br />
p.setOpenSearchResourceID(openSearchResourceID);<br />
p.setFixedParameters(fixedParameters);<br />
<br />
List<Provider> providers = Lists.newArrayList();<br />
providers.add(p);<br />
<br />
factory.createResource(fieldParams, providers, scope);<br />
}<br />
<br />
</source><br />
<br />
==External Links==<br />
Some useful external links for further reading are provided here:<br />
*[http://www.opensearch.org/Home OpenSearch Home]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5 OpenSearch Specification]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Responce Elements]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch Description Document ]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21720Index Management Framework2014-06-19T12:35:17Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=./resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''NOTE''': it is important to note that ''resourcesFoldername'' as well as ''dataDir'' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better if these properties take absolute paths as values.<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21719Index Management Framework2014-06-19T12:34:42Z<p>Alex.antoniadi: /* Services */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=./resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''NOTE''': it is important to note that '''resourcesFoldername''' as well as '''dataDir''' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better if these properties take absolute paths as values.<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21718Index Management Framework2014-06-19T12:34:20Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''NOTE''': it is important to note that '''resourcesFoldername''' as well as '''dataDir''' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better if these properties take absolute paths as values.<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21717Index Management Framework2014-06-19T12:29:13Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
'''[[NOTE]]''': it is important to note that '''resourcesFoldername''' as well as '''dataDir''' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better if these properties take absolute paths as values.<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=OpenSearch_Framework&diff=21693OpenSearch Framework2014-06-11T12:57:00Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>==Description==<br />
The role of the gCube ''OpenSearch Framework'' is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the [http://www.opensearch.org/Specifications/OpenSearch OpenSearch Specification]. The framework consists of two components <br />
*The ''OpenSearch Library'', which includes a core library providing general-purpose OpenSearch functionality, and the ''OpenSearch Operator'' which utilizes functionality provided by the former.<br />
*The ''OpenSearch Service'' (also called ''OpenSearchDataSource Service''), which binds collections with provider-specific information encapsulated in generic resources and invokes the ''OpenSearch Operator''<br />
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole ''OpenSearch Library'' component, whereas the name "OpenSearch Core Library" will be used when referring to the library constituent of the component.<br />
<br />
A client library library for OpenSearch Service, called ''OpenSearchDataSource Client Library'', also exists in order to assist the programmatically use of the service. <br />
<br />
''OpenSearch Library'', ''OpenSearch Service'' and ''OpenSearchDataSource Client Library'' are available in our Maven repositories with the following coordinates:<br />
<source lang="xml"><br />
<!-- OpenSearch Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchlibrary</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearch Service --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-service</artifactId><br />
<version>...</version><br />
<br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-stubs</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearchDataSource Client Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-client-library</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==The OpenSearch Library==<br />
<br />
===The OpenSearch Core Library===<br />
====Description====<br />
The ''OpenSearch Core Library'' conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers.<br />
It can be optionally extended, as described in the [[#Extensibility|Extensibility]] section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported.<br />
The ''OpenSearch Operator'', described in a [[#The OpenSearch Operator|following]] section functions atop this library.<br />
<br />
====Functionality====<br />
The central class which can be used in order to exploit the functionality provided by the library, is the ''DescriptionDocument'' class. For reasons explained in the [[#Library Extensibility|following]] section, the ''DescriptionDocument'' class needs to be provided with a pair of ''URLElementFactory'' and ''QueryElementFactory'' factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the ''FactoryResolver'' class, as follows:<br />
<source lang="java5"><br />
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);<br />
</source><br />
The ''DescriptionDocument'' is then instantiated as follows:<br />
<source lang="java5"><br />
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);<br />
</source><br />
where the ''descriptionDocumentXML'' parameter corresponds to a DOM Document object containing the parsed Description Document.<br />
Properly instantiated, the ''DescriptionDocument'' class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a ''QueryBuilder'' object, which can be obtained as follows:<br />
<source lang="java5"><br />
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);<br />
</source><br />
where <code>rel</code> is a rel value as described in the [http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_4#Url_rel_values OpenSearch Specification], e.g. <code>results</code> and <code>MimeType</code> is a MIME type, such as <code>application/rss+xml</code>. The returned list contains one ''QueryBuilder'' instance for each template contained in a URL Element with the specified <code>rel</code> and <code>type</code> attributes.<br />
Once the desired ''QueryBuilder'' is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.<br />
For example, <code>searchTerms</code> parameter can be set to some value as follows:<br />
<source lang="java5"><br />
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms); <br />
</source><br />
Once all the required parameters are set, the constructed query can be obtained as follows:<br />
<source lang="java5"><br />
URL query;<br />
try {<br />
query = qb.getQuery();<br />
}catch(IncompleteQueryException iqe) {<br />
//Incomplete query exception handling<br />
}catch(MalformedQueryException mqe {<br />
//Malformed query exception handling<br />
}<br />
</source><br />
<br />
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either ''HTMLResponse'' or ''XMLResponse'', depending on the MIME type of the OpenSearch response, in order for the [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Response Elements] and any other available information contained in the response to be processed.<br />
<source lang="java5"><br />
InputStream responseStream = query.openConnection().getInputStream();<br />
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());<br />
</source><br />
<br />
The raw XML data can then be obtained by the ''OpenSearchResponse'' object as follows:<br />
<source lang="java5"><br />
response.getResponse();<br />
</source><br />
and any information available, mainly relevant to paging, can be obtained by one of the methods of the ''OpenSearchResponse'' class. For example the total number of results, as reported by the <code>totalResults</code> Response Element, if present, can be obtained as follows:<br />
<source lang="java5"><br />
response.getTotalResults();<br />
</source><br />
<br />
====Library Extensibility====<br />
=====Motivation=====<br />
The core functionality provided by the ''OpenSearch Core Library'' is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored.<br />
An example of an assumption made by the ''OpenSearch Core Library'', in the form of a requirement, is that all parameter values passed to its ''QueryBuilder'' components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a client failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.<br />
<br />
There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the RFC specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the caller should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.<br />
<br />
=====Extensibility Mechanism=====<br />
The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, ''QueryBuilder'' components are included in the extensibility mechanism as they depend on the aforementioned elements.<br />
<br />
Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore, it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the caller.<br />
<br />
The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:<br />
*The ''URLElement'', ''QueryElement'' and ''QueryBuilder'' components are interfaces whose implementations support core or extension-related functionality.<br />
*Core functionality processing takes place in the last link of the chain of responsibility. For example, ''BasicQueryBuilder'' implements core ''QueryBuilder'' functionality.<br />
*Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if ''GeoQueryBuilder'' implements functionality related to the Geo OpenSearch Extension, it contains a reference to a ''QueryBuilder'' implementing either core functionality, or functionality related to some other extension.<br />
*Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.<br />
<br />
<br />
In order for a chain of responsibility to be dynamically created by the ''DescriptionDocument'' class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the ''DescriptionDocument'' in order for it to be able to construct the correct elements. The ''FactoryResolver'' utility is responsible for the construction of factories capable of constructing instances supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing ''QueryBuilder'' instances, the latter are constructed without explicitly supplying a factory to the ''DescriptionDocument'', by the ''getQueryBuilder'' method of the already constructed ''URLElement''.<br />
<br />
The ''FactoryResolver'' requires that two things be known in order to be able to construct the factories:<br />
*A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping could be: <code><<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)></code>, which declares that the implementations responsible for providing core functionality are the ''BasicURLElement'' and ''BasicQueryElement'' classes.<br />
*A list of all parameter namespaces present in the query string.<br />
<br />
Having the above information, and provided that the implementations of all factories and classes are available, the ''FactoryResolver'' will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed instances will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.<br />
<br />
It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. There are, therefore, two choices when using the library<br />
*Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the caller should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.<br />
*Extend the library whenever this proves useful or makes things easier.<br />
<br />
=====Implementing a new Extension=====<br />
In order to implement a new extension that will be correctly incorporated into the already existing library functionality, one should do the following:<br />
*Implement ''URLElement'', ''QueryElement'' and ''QueryBuilder'' interface implementations, the constructor of which accepts at least a reference to a corresponding upcasted object which will be next in the chain of responsibility. All requests that cannot be handled, or require additional processing by subsequent links in the chain, should be forwarded to the next link in the chain.<br />
*Implement a ''URLElementFactory'' and a ''QueryElementFactory'' interface implementations, the constructors of which accept a single argument, which is a reference to an upcasted factory of the same type corresponding to the factory used to create instances next in the chain of responsibility. An example of a ''URLElementFactory'' used for the construction of ''GeoURLElement''s which implement Geo extension functionality is the following:<br />
<source lang="java5"><br />
public class GeoURLElementFactory implements URLElementFactory {<br />
<br />
URLElementFactory f;<br />
<br />
public GeoURLElementFactory(URLElementFactory f) {<br />
this.f = f;<br />
}<br />
<br />
public GeoURLElement newInstance(Element url, Map<String, String> nsPrefixes) throws Exception {<br />
URLElement el = f.newInstance(url, nsPrefixes);<br />
return new GeoURLElement(url, el);<br />
}<br />
}<br />
</source><br />
*See that the ''getQueryBuilder'' method of the ''URLElement'' implementation of the new extension correctly constructs a ''QueryBuilder'' instance. For example, the ''getQueryBuilder'' method of the ''GeoQueryElement'' presented above could look like this:<br />
<source lang="java5"><br />
public QueryBuilder getQueryBuilder() throws Exception {<br />
return new GeoQueryBuilder(el.getQueryBuilder());<br />
}<br />
</source><br />
where <code>el</code> is the next ''URLElement'' in the chain.<br />
*Add a mapping for the two new factories to the set of mappings passed to the ''FactoryResolver'' utility upon initialization of the library. For example, given that ''GeoURLElementFactory'' and ''GeoQueryElementFactory'' are implemented for Geo Extensions, one could add the mapping as follows:<br />
<source lang=java5><br />
factoryMappings.add("http://a9.com/-/opensearch/extensions/geo/1.0/",<br />
new FactoryClassNamePair("org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElement", "org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElement"));<br />
</source><br />
<br />
===The OpenSearch Operator===<br />
====Description====<br />
The role of the ''OpenSearch Operator'' is to provide support for querying and retrieval of search results via [http://www.opensearch.org/Home OpenSearch] from providers which expose an [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch description document]. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an [[#OpenSearch Resource|OpenSearch Resource]] reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.<br />
<br />
====Extensibility Points====<br />
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special [[#OpenSearch Resource|OpenSearch Resource]] structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the ''OpenSearch Core Library''. The extra functionalities are summarized as follows:<br />
*Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.<br />
*The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the single direct provider in the direct provider case. The purpose of these parameters is, first, to facilitate the creation of dynamic collections from results obtained by brokers by superseeding the caller's query parameters while querying the broker and using the full set of the caller parameters only on lower levels and, second, to customize the behaviour of some provider to the needs of the gCube Framework (for example set a value for a required query parameter that the framework cannot handle). These options can be used both in tandem, if desired.<br />
*Support for one or more security schemes is planned for a subsequent version of the ''OpenSearch Library''.<br />
<br />
====OpenSearch Resource====<br />
The purpose of an ''OpenSearch Resource'' object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the [[#Extensibility Points|Extensibility Points]] section. The attributes included are the following:<br />
* The name of the resource<br />
* The URL of the OpenSearch Description Document of the provider to be queried<br />
* Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.<br />
* Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:<br />
**The ''RecordSplitXPath'' expression is used to split a page of search results into individual records. For example for the rss format, the <code><item></code> elements under <code>rss/channel</code> could be of interest<br />
**The ''presentationInfo'' expression which is actually a map between fieldnames and XPath expressions that are used to extract the desired value from the response.<br />
**The optional ''RecordIdXPath'' expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a <code>DocID</code> record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.<br />
* Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,<br />
<br />
The serialization of an ''OpenSearch Resource'' can be easily incorporated into a Generic Resource. The default mode of operation for the ''OpenSearch Operator'' in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resource from the [[Information_System|IS]].<br />
The Generic Resources utilized by the ''OpenSearch Operator'' is:<br />
*The ''OpenSearchResource'' which contains the body of the OpenSearch Resource as described below<br />
<br />
<br />
Note that, solely for testing purposes, the ''OpenSearch Operator'' also supports a local mode of operation, whereby all ''OpenSearch Resources'' are loaded from the local file system. <br />
<br />
The XML Schema that all OpenSearch Resource serializations should conform to is the following:<br />
<source lang="xml"><br />
<?xml version="1.0"?><br />
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><br />
<xs:element name="OpenSearchResource"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="name" type="xs:string"/><br />
<xs:element name="descriptionDocumentURI" type="xs:string"/><br />
<xs:element name="brokeredResults" type="xs:boolean"/><br />
<xs:element name="transformation" maxOccurs="unbounded"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="MIMEType" type="xs:string"/><br />
<xs:element name="recordSplitXPath" type="xs:string"/><br />
<xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/><br />
<xs:element name="presentationInfo" maxOccurs="unbounded"><br />
<xs:element name="presentable" maxOccurs="unbounded"><br />
<xs:element name="fieldName" type="xs:string"/><br />
<xs:element name="expression" type="xs:string"/><br />
</xs:element><br />
</xs:element><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
<xs:element name="security" minOccurs="0"><br />
<xs:complexType><br />
<xs:sequence><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element> <br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
</xs:schema><br />
</source><br />
<br />
The transformation element can appear multiple times within an ''OpenSearch Resource''. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types. <br />
<br />
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch providers from the initial results provided by the OpenSearch provider acting as a broker.<br />
<br />
An example of an ''OpenSearch Resource'' serialization describing the [http://www.bing.com/ Bing] external repository as a direct OpenSearch provider, currently in use by the gCube Framework is the following:<br />
<source lang="xml"><br />
<OpenSearchResource><br />
<name>Bing</name><br />
<descriptionDocumentURI>http://imarine.web.cern.ch/imarine/OpenSearch/bing.xml</descriptionDocumentURI><br />
<brokeredResults>false</brokeredResults><br />
<parameters><br />
<parameter><br />
<fieldName>allIndexes</fieldName><br />
<qName>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms</qName><br />
</parameter><br />
</parameters><br />
<transformation><br />
<MIMEType>application/rss+xml</MIMEType><br />
<recordSplitXPath>*[local-name()='rss']/*[local-name()='channel']/*[local-name()='item']</recordSplitXPath><br />
<recordIdXPath>//*[local-name()='item']/*[local-name()='link']</recordIdXPath><br />
<presentationInfo><br />
<presentable><br />
<fieldName>title</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='title']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>link</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='link']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>description</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>S</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>pubDate</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='pubDate']</expression><br />
</presentable><br />
</presentationInfo><br />
</transformation><br />
</OpenSearchResource><br />
</source><br />
<br />
In case our datasrouce is broker the ''brokeredResults'' should be changed to true as:<br />
<br />
<source lang="xml"><br />
<brokeredResults>true</brokeredResults><br />
</source><br />
<br />
====OpenSearch Operator Logic====<br />
[[File:Opensearch_op_flowchart.png|right|Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator]]<br />
The ''OpenSearch Operator'' employs the functionality provided by the [[#The OpenSearch Core Library|OpenSearch Core Library]] in order to extract the required information from the Description Document of the external provider and the ''QueryBuilder''s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. <br />
<br />
It should be noted that the ''OpenSearch Operator'' abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way ''OpenSearch Resource''s are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options<br />
*If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional ''OpenSearch Resource''s can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.<br />
*If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.<br />
<br />
In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the operator, before issuing any actual queries. <br />
Given that an ''OpenSearch Resource'' can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element URL Element] is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the ''QueryBuilder'' whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the ''OpenSearch Resource'' describing the provider are discarded. The ''QueryBuilder''s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:<br />
*''QueryBuilder''s whose required parameters are not covered by the parameters of the caller's query are discarded<br />
*''QueryBuilder''s are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered<br />
*A ''QueryBuilder'' which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.<br />
<br />
The most usual case is for the provider's ''OpenSearch Resource'' to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one ''QueryBuilder'' being available to construct queries, thereby resulting to a degenerate reordering step.<br />
<br />
The functions performed by the operator in order for a set of results to be retrieved, given that the proper ''QueryBuilder'' is selected, are summarized in the simplified diagram of Figure 1.<br />
<br />
As shown, the operator accepts a set of query terms and a set of query parameters.<br />
<br />
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator. <br />
<br />
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.<br />
<br />
====Configurable Parameters====<br />
The ''OpenSearch Operator'' can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:<br />
*The ''resultsPerPage'' parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is ''100''.<br />
*The ''sequentialResults'' parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is ''false''.<br />
*The ''useLocalResource'' parameter, when enabled, permits the operator to operate in the absence of an IS. The ''OpenSearch Resources'' are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to ''false''.<br />
<br />
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in [[#Library Extensibility|Library Extensibility]].<br />
The ''sequentialResults'' parameter can also be configured in a per-query manner, including it in the query string as a query parameter.<br />
<br />
====Query Format====<br />
The ''OpenSearch Operator'' expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form<br />
<br />
<code><br />
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"<br />
</code><br />
<br />
and should be space-delimited.<br />
Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well.<br />
Any free-text parameter value should be URL-encoded.<br />
<br />
The reserved keyword ''config'' when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the<br />
sequentialResults parameter described in [[#Configurable Parameters|Configurable Parameters]], plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.<br />
The following hold for the query configuration parameter values<br />
*The ''sequentialResults'' parameter should be assigned a value equal to ''true'' or ''false''. Its absence implies the default value of the corresponding configurable parameter of the operator.<br />
*The ''numOfResults'' parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.<br />
<br />
Taking everything into account, an example of a legitimate query for the ''OpenSearch Operator'' could be the following:<br />
<br />
<code>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"</code><br />
<br />
which instructs the operator to use the string <code>Hello World</code> as the value for the <code>SearchTerms</code> standard OpenSearch parameter and to retrieve up to 300 results from the provider.<br />
<br />
==The OpenSearch Service==<br />
===Description===<br />
The ''OpenSearch Service'' is a stateful web service responsible for the invocation of the ''OpenSearch Operator'' in the context of the provider to be queried.<br />
It also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all<br />
previously queried brokered providers and the corresponding Description Documents.<br />
<br />
<br />
===Deployment Instructions===<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* opensearchdatasource-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl08.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
===WS and Generic Resource Interrelation===<br />
Provided that a Collection for the provider to be queried is available, the ''OpenSearch Service'' uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Collection corresponding to the provider.<br />
<br />
An ''OpenSearch Service'' WS-Resource contains the following properties:<br />
*The ''AdaptorID'' which is unique for every WSResource and is used for referencing the right WSResource on querying (it is optional).<br />
*The ''CollectionID'' of the collection to be used.<br />
*The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.<br />
*The URI of the Description Document (''DescriptionDocumentURI'') of the top-provider.<br />
*A set of fields (presentables and searchables) as extracted from the ''OpenSearchGenericResource''.<br />
*A set of ''FixedParameters'', which are used in every invocation of the Operator. See also [[#Extensibility Points|Extensibility Points]].<br />
<br />
As mentioned above, the WS-Resource contains a reference only to the ''OpenSearchResource'' of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System implementation and are therefore not directly referenced by the WS-Resource.<br />
<br />
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.<br />
<br />
On WS-Resource creation, only the Metadata Collection ID and the ID of the ''OpenSearchResource'' of the top-provider need to be supplied to the ''create'' operation of the service's factory. All other properties are created internally by the service itself.<br />
<br />
===Resource Caching===<br />
For performance and reliability reasons, the ''OpenSearch Service'' maintains one cache per WS-Resource which initially contains the Generic Resource (''OpenSearchResource'') and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant ''OpenSearch Operator'' module with the Generic Resources and Description Documents of all providers reached through the broker.<br />
<br />
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically, based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.<br />
<br />
The cache refresh cycle policy used is described as follows:<br />
*The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.<br />
*The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.<br />
*In the event of failure, the previously cached version is kept.<br />
<br />
===Operations===<br />
The operations exposed by the OpenSearch Service are the following:<br />
*The ''query'' operation, with a single input message containing the query string to be sent to the operator, whose format is described in [[#Query Format|Query Format]].<br />
*The ''refreshCache'' operation, which sends a request in order to force the cache of the service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.<br />
<br />
===Configurable Parameters===<br />
The Service currently supports three configurable parameters, which are exposed to its deployment descriptor<br />
*The ''clearCacheOnStartup'' parameter, of ''boolean'' type, when enabled instructs the service to discard the stored cache on startup.<br />
*The ''cacheRefreshIntervalMillis'' parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value ''0'' can be used to disable periodic cache refresh cycles.<br />
* The ''openSearchLibraryFactories'' parameter, of ''string'' type is used to supply the ''OpenSearch Core Library'' with the factory mappings for all namespaces for which there exists an implementation of a library extension. For more information on the mappings, see also the section referring to the [[#Extensibility Mechanism|extensibility mechanism]] of the library.<br />
<br />
The ''openSearchLibraryFactories'' parameter is encoded as a sequence of mappings from strings to pairs, where each mapping is enclosed in braces, association is denoted by the ''='' sign and each pair is enclosed in parentheses. For example, given that there are implementations for core functionality, Geo and Time extensions, the value of this configuration parameter could be the following:<br />
<br />
<code><br />
[<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/geo/1.0/</nowiki>/=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/time/1.0/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.time.TimeURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.time.TimeQueryElementFactory)]<br />
</code><br />
<br />
==The OpenSearchDataSource Client Library==<br />
In this section some examples of usage of the ''OpenSearchDataSource Client Library'' are provided.<br />
<br />
'''Query''' example:<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext";<br />
<br />
final OpenSearchClient client = new OpenSearchClient.Builder()<br />
// .endpoint(endpoint) // you can also give a specific endpoint<br />
.scope(scope)<br />
.build();<br />
<br />
final String queryString = "((((gDocCollectionID == \"ea5b9c70-01ce-4b45-96e7-6db037ebf2bc\") and (gDocCollectionLang == \"en\"))) and (5575bbdb-6d47-4297-ad12-2259b3405ce7 = greece)) project d91f3c47-e46e-4737-9496-a0f72361a397 3e3584f0-eed3-4089-99cd-86a7def1471e";<br />
<br />
String grs2Locator = client.query(queryString);<br />
<br />
// or<br />
<br />
List<Map<String, String>> records = client2.queryAndRead(queryString);<br />
<br />
</source><br />
<br />
'''Create Resource''' method example:<br />
<source lang="java"><br />
<br />
static void createResource(List<String> fieldParameters, List<String> fixedParameters, String collectionID, String openSearchResourceID, String scope) throws OpenSearchClientException {<br />
<br />
OpenSearchFactoryClient factory = injector.getInstance(OpenSearchFactoryClient.Builder.class)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
List<String> fieldParams = Lists.newArrayList();<br />
<br />
for (int i=0; i<fieldParameters.size(); i++) {<br />
fieldParams.add(collectionID + ":" + fieldParameters.get(i));<br />
System.out.println("Field parameter: " + (i+1) + " " + fieldParams.get(i));<br />
}<br />
<br />
Provider p = new Provider();<br />
p.setCollectionID(collectionID);<br />
p.setOpenSearchResourceID(openSearchResourceID);<br />
p.setFixedParameters(fixedParameters);<br />
<br />
List<Provider> providers = Lists.newArrayList();<br />
providers.add(p);<br />
<br />
factory.createResource(fieldParams, providers, scope);<br />
}<br />
<br />
</source><br />
<br />
==External Links==<br />
Some useful external links for further reading are provided here:<br />
*[http://www.opensearch.org/Home OpenSearch Home]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5 OpenSearch Specification]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Responce Elements]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch Description Document ]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=OpenSearch_Framework&diff=21692OpenSearch Framework2014-06-11T12:54:30Z<p>Alex.antoniadi: /* The OpenSearchDataSource Client Library */</p>
<hr />
<div>==Description==<br />
The role of the gCube ''OpenSearch Framework'' is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the [http://www.opensearch.org/Specifications/OpenSearch OpenSearch Specification]. The framework consists of two components <br />
*The ''OpenSearch Library'', which includes a core library providing general-purpose OpenSearch functionality, and the ''OpenSearch Operator'' which utilizes functionality provided by the former.<br />
*The ''OpenSearch Service'' (also called ''OpenSearchDataSource Service''), which binds collections with provider-specific information encapsulated in generic resources and invokes the ''OpenSearch Operator''<br />
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole ''OpenSearch Library'' component, whereas the name "OpenSearch Core Library" will be used when referring to the library constituent of the component.<br />
<br />
A client library library for OpenSearch Service, called ''OpenSearchDataSource Client Library'', also exists in order to assist the programmatically use of the service. <br />
<br />
''OpenSearch Library'', ''OpenSearch Service'' and ''OpenSearchDataSource Client Library'' are available in our Maven repositories with the following coordinates:<br />
<source lang="xml"><br />
<!-- OpenSearch Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchlibrary</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearch Service --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-service</artifactId><br />
<version>...</version><br />
<br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-stubs</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearchDataSource Client Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-client-library</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==The OpenSearch Library==<br />
<br />
===The OpenSearch Core Library===<br />
====Description====<br />
The ''OpenSearch Core Library'' conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers.<br />
It can be optionally extended, as described in the [[#Extensibility|Extensibility]] section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported.<br />
The ''OpenSearch Operator'', described in a [[#The OpenSearch Operator|following]] section functions atop this library.<br />
<br />
====Functionality====<br />
The central class which can be used in order to exploit the functionality provided by the library, is the ''DescriptionDocument'' class. For reasons explained in the [[#Library Extensibility|following]] section, the ''DescriptionDocument'' class needs to be provided with a pair of ''URLElementFactory'' and ''QueryElementFactory'' factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the ''FactoryResolver'' class, as follows:<br />
<source lang="java5"><br />
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);<br />
</source><br />
The ''DescriptionDocument'' is then instantiated as follows:<br />
<source lang="java5"><br />
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);<br />
</source><br />
where the ''descriptionDocumentXML'' parameter corresponds to a DOM Document object containing the parsed Description Document.<br />
Properly instantiated, the ''DescriptionDocument'' class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a ''QueryBuilder'' object, which can be obtained as follows:<br />
<source lang="java5"><br />
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);<br />
</source><br />
where <code>rel</code> is a rel value as described in the [http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_4#Url_rel_values OpenSearch Specification], e.g. <code>results</code> and <code>MimeType</code> is a MIME type, such as <code>application/rss+xml</code>. The returned list contains one ''QueryBuilder'' instance for each template contained in a URL Element with the specified <code>rel</code> and <code>type</code> attributes.<br />
Once the desired ''QueryBuilder'' is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.<br />
For example, <code>searchTerms</code> parameter can be set to some value as follows:<br />
<source lang="java5"><br />
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms); <br />
</source><br />
Once all the required parameters are set, the constructed query can be obtained as follows:<br />
<source lang="java5"><br />
URL query;<br />
try {<br />
query = qb.getQuery();<br />
}catch(IncompleteQueryException iqe) {<br />
//Incomplete query exception handling<br />
}catch(MalformedQueryException mqe {<br />
//Malformed query exception handling<br />
}<br />
</source><br />
<br />
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either ''HTMLResponse'' or ''XMLResponse'', depending on the MIME type of the OpenSearch response, in order for the [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Response Elements] and any other available information contained in the response to be processed.<br />
<source lang="java5"><br />
InputStream responseStream = query.openConnection().getInputStream();<br />
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());<br />
</source><br />
<br />
The raw XML data can then be obtained by the ''OpenSearchResponse'' object as follows:<br />
<source lang="java5"><br />
response.getResponse();<br />
</source><br />
and any information available, mainly relevant to paging, can be obtained by one of the methods of the ''OpenSearchResponse'' class. For example the total number of results, as reported by the <code>totalResults</code> Response Element, if present, can be obtained as follows:<br />
<source lang="java5"><br />
response.getTotalResults();<br />
</source><br />
<br />
====Library Extensibility====<br />
=====Motivation=====<br />
The core functionality provided by the ''OpenSearch Core Library'' is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored.<br />
An example of an assumption made by the ''OpenSearch Core Library'', in the form of a requirement, is that all parameter values passed to its ''QueryBuilder'' components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a client failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.<br />
<br />
There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the RFC specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the caller should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.<br />
<br />
=====Extensibility Mechanism=====<br />
The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, ''QueryBuilder'' components are included in the extensibility mechanism as they depend on the aforementioned elements.<br />
<br />
Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore, it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the caller.<br />
<br />
The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:<br />
*The ''URLElement'', ''QueryElement'' and ''QueryBuilder'' components are interfaces whose implementations support core or extension-related functionality.<br />
*Core functionality processing takes place in the last link of the chain of responsibility. For example, ''BasicQueryBuilder'' implements core ''QueryBuilder'' functionality.<br />
*Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if ''GeoQueryBuilder'' implements functionality related to the Geo OpenSearch Extension, it contains a reference to a ''QueryBuilder'' implementing either core functionality, or functionality related to some other extension.<br />
*Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.<br />
<br />
<br />
In order for a chain of responsibility to be dynamically created by the ''DescriptionDocument'' class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the ''DescriptionDocument'' in order for it to be able to construct the correct elements. The ''FactoryResolver'' utility is responsible for the construction of factories capable of constructing instances supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing ''QueryBuilder'' instances, the latter are constructed without explicitly supplying a factory to the ''DescriptionDocument'', by the ''getQueryBuilder'' method of the already constructed ''URLElement''.<br />
<br />
The ''FactoryResolver'' requires that two things be known in order to be able to construct the factories:<br />
*A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping could be: <code><<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)></code>, which declares that the implementations responsible for providing core functionality are the ''BasicURLElement'' and ''BasicQueryElement'' classes.<br />
*A list of all parameter namespaces present in the query string.<br />
<br />
Having the above information, and provided that the implementations of all factories and classes are available, the ''FactoryResolver'' will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed instances will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.<br />
<br />
It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. There are, therefore, two choices when using the library<br />
*Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the caller should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.<br />
*Extend the library whenever this proves useful or makes things easier.<br />
<br />
=====Implementing a new Extension=====<br />
In order to implement a new extension that will be correctly incorporated into the already existing library functionality, one should do the following:<br />
*Implement ''URLElement'', ''QueryElement'' and ''QueryBuilder'' interface implementations, the constructor of which accepts at least a reference to a corresponding upcasted object which will be next in the chain of responsibility. All requests that cannot be handled, or require additional processing by subsequent links in the chain, should be forwarded to the next link in the chain.<br />
*Implement a ''URLElementFactory'' and a ''QueryElementFactory'' interface implementations, the constructors of which accept a single argument, which is a reference to an upcasted factory of the same type corresponding to the factory used to create instances next in the chain of responsibility. An example of a ''URLElementFactory'' used for the construction of ''GeoURLElement''s which implement Geo extension functionality is the following:<br />
<source lang="java5"><br />
public class GeoURLElementFactory implements URLElementFactory {<br />
<br />
URLElementFactory f;<br />
<br />
public GeoURLElementFactory(URLElementFactory f) {<br />
this.f = f;<br />
}<br />
<br />
public GeoURLElement newInstance(Element url, Map<String, String> nsPrefixes) throws Exception {<br />
URLElement el = f.newInstance(url, nsPrefixes);<br />
return new GeoURLElement(url, el);<br />
}<br />
}<br />
</source><br />
*See that the ''getQueryBuilder'' method of the ''URLElement'' implementation of the new extension correctly constructs a ''QueryBuilder'' instance. For example, the ''getQueryBuilder'' method of the ''GeoQueryElement'' presented above could look like this:<br />
<source lang="java5"><br />
public QueryBuilder getQueryBuilder() throws Exception {<br />
return new GeoQueryBuilder(el.getQueryBuilder());<br />
}<br />
</source><br />
where <code>el</code> is the next ''URLElement'' in the chain.<br />
*Add a mapping for the two new factories to the set of mappings passed to the ''FactoryResolver'' utility upon initialization of the library. For example, given that ''GeoURLElementFactory'' and ''GeoQueryElementFactory'' are implemented for Geo Extensions, one could add the mapping as follows:<br />
<source lang=java5><br />
factoryMappings.add("http://a9.com/-/opensearch/extensions/geo/1.0/",<br />
new FactoryClassNamePair("org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElement", "org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElement"));<br />
</source><br />
<br />
===The OpenSearch Operator===<br />
====Description====<br />
The role of the ''OpenSearch Operator'' is to provide support for querying and retrieval of search results via [http://www.opensearch.org/Home OpenSearch] from providers which expose an [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch description document]. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an [[#OpenSearch Resource|OpenSearch Resource]] reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.<br />
<br />
====Extensibility Points====<br />
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special [[#OpenSearch Resource|OpenSearch Resource]] structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the ''OpenSearch Core Library''. The extra functionalities are summarized as follows:<br />
*Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.<br />
*The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the single direct provider in the direct provider case. The purpose of these parameters is, first, to facilitate the creation of dynamic collections from results obtained by brokers by superseeding the caller's query parameters while querying the broker and using the full set of the caller parameters only on lower levels and, second, to customize the behaviour of some provider to the needs of the gCube Framework (for example set a value for a required query parameter that the framework cannot handle). These options can be used both in tandem, if desired.<br />
*Support for one or more security schemes is planned for a subsequent version of the ''OpenSearch Library''.<br />
<br />
====OpenSearch Resource====<br />
The purpose of an ''OpenSearch Resource'' object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the [[#Extensibility Points|Extensibility Points]] section. The attributes included are the following:<br />
* The name of the resource<br />
* The URL of the OpenSearch Description Document of the provider to be queried<br />
* Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.<br />
* Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:<br />
**The ''RecordSplitXPath'' expression is used to split a page of search results into individual records. For example for the rss format, the <code><item></code> elements under <code>rss/channel</code> could be of interest<br />
**The ''presentationInfo'' expression which is actually a map between fieldnames and XPath expressions that are used to extract the desired value from the response.<br />
**The optional ''RecordIdXPath'' expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a <code>DocID</code> record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.<br />
* Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,<br />
<br />
The serialization of an ''OpenSearch Resource'' can be easily incorporated into a Generic Resource. The default mode of operation for the ''OpenSearch Operator'' in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resource from the [[Information_System|IS]].<br />
The Generic Resources utilized by the ''OpenSearch Operator'' is:<br />
*The ''OpenSearchResource'' which contains the body of the OpenSearch Resource as described below<br />
<br />
<br />
Note that, solely for testing purposes, the ''OpenSearch Operator'' also supports a local mode of operation, whereby all ''OpenSearch Resources'' are loaded from the local file system. <br />
<br />
The XML Schema that all OpenSearch Resource serializations should conform to is the following:<br />
<source lang="xml"><br />
<?xml version="1.0"?><br />
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><br />
<xs:element name="OpenSearchResource"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="name" type="xs:string"/><br />
<xs:element name="descriptionDocumentURI" type="xs:string"/><br />
<xs:element name="brokeredResults" type="xs:boolean"/><br />
<xs:element name="transformation" maxOccurs="unbounded"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="MIMEType" type="xs:string"/><br />
<xs:element name="recordSplitXPath" type="xs:string"/><br />
<xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/><br />
<xs:element name="presentationInfo" maxOccurs="unbounded"><br />
<xs:element name="presentable" maxOccurs="unbounded"><br />
<xs:element name="fieldName" type="xs:string"/><br />
<xs:element name="expression" type="xs:string"/><br />
</xs:element><br />
</xs:element><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
<xs:element name="security" minOccurs="0"><br />
<xs:complexType><br />
<xs:sequence><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element> <br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
</xs:schema><br />
</source><br />
<br />
The transformation element can appear multiple times within an ''OpenSearch Resource''. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types. <br />
<br />
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch providers from the initial results provided by the OpenSearch provider acting as a broker.<br />
<br />
An example of an ''OpenSearch Resource'' serialization describing the [http://www.bing.com/ Bing] external repository as a direct OpenSearch provider, currently in use by the gCube Framework is the following:<br />
<source lang="xml"><br />
<OpenSearchResource><br />
<name>Bing</name><br />
<descriptionDocumentURI>http://imarine.web.cern.ch/imarine/OpenSearch/bing.xml</descriptionDocumentURI><br />
<brokeredResults>false</brokeredResults><br />
<parameters><br />
<parameter><br />
<fieldName>allIndexes</fieldName><br />
<qName>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms</qName><br />
</parameter><br />
</parameters><br />
<transformation><br />
<MIMEType>application/rss+xml</MIMEType><br />
<recordSplitXPath>*[local-name()='rss']/*[local-name()='channel']/*[local-name()='item']</recordSplitXPath><br />
<recordIdXPath>//*[local-name()='item']/*[local-name()='link']</recordIdXPath><br />
<presentationInfo><br />
<presentable><br />
<fieldName>title</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='title']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>link</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='link']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>description</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>S</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>pubDate</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='pubDate']</expression><br />
</presentable><br />
</presentationInfo><br />
</transformation><br />
</OpenSearchResource><br />
</source><br />
<br />
In case our datasrouce is broker the ''brokeredResults'' should be changed to true as:<br />
<br />
<source lang="xml"><br />
<brokeredResults>true</brokeredResults><br />
</source><br />
<br />
====OpenSearch Operator Logic====<br />
[[File:Opensearch_op_flowchart.png|right|Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator]]<br />
The ''OpenSearch Operator'' employs the functionality provided by the [[#The OpenSearch Core Library|OpenSearch Core Library]] in order to extract the required information from the Description Document of the external provider and the ''QueryBuilder''s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. <br />
<br />
It should be noted that the ''OpenSearch Operator'' abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way ''OpenSearch Resource''s are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options<br />
*If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional ''OpenSearch Resource''s can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.<br />
*If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.<br />
<br />
In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the operator, before issuing any actual queries. <br />
Given that an ''OpenSearch Resource'' can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element URL Element] is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the ''QueryBuilder'' whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the ''OpenSearch Resource'' describing the provider are discarded. The ''QueryBuilder''s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:<br />
*''QueryBuilder''s whose required parameters are not covered by the parameters of the caller's query are discarded<br />
*''QueryBuilder''s are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered<br />
*A ''QueryBuilder'' which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.<br />
<br />
The most usual case is for the provider's ''OpenSearch Resource'' to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one ''QueryBuilder'' being available to construct queries, thereby resulting to a degenerate reordering step.<br />
<br />
The functions performed by the operator in order for a set of results to be retrieved, given that the proper ''QueryBuilder'' is selected, are summarized in the simplified diagram of Figure 1.<br />
<br />
As shown, the operator accepts a set of query terms and a set of query parameters.<br />
<br />
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator. <br />
<br />
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.<br />
<br />
====Configurable Parameters====<br />
The ''OpenSearch Operator'' can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:<br />
*The ''resultsPerPage'' parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is ''100''.<br />
*The ''sequentialResults'' parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is ''false''.<br />
*The ''useLocalResource'' parameter, when enabled, permits the operator to operate in the absence of an IS. The ''OpenSearch Resources'' are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to ''false''.<br />
<br />
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in [[#Library Extensibility|Library Extensibility]].<br />
The ''sequentialResults'' parameter can also be configured in a per-query manner, including it in the query string as a query parameter.<br />
<br />
====Query Format====<br />
The ''OpenSearch Operator'' expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form<br />
<br />
<code><br />
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"<br />
</code><br />
<br />
and should be space-delimited.<br />
Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well.<br />
Any free-text parameter value should be URL-encoded.<br />
<br />
The reserved keyword ''config'' when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the<br />
sequentialResults parameter described in [[#Configurable Parameters|Configurable Parameters]], plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.<br />
The following hold for the query configuration parameter values<br />
*The ''sequentialResults'' parameter should be assigned a value equal to ''true'' or ''false''. Its absence implies the default value of the corresponding configurable parameter of the operator.<br />
*The ''numOfResults'' parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.<br />
<br />
Taking everything into account, an example of a legitimate query for the ''OpenSearch Operator'' could be the following:<br />
<br />
<code>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"</code><br />
<br />
which instructs the operator to use the string <code>Hello World</code> as the value for the <code>SearchTerms</code> standard OpenSearch parameter and to retrieve up to 300 results from the provider.<br />
<br />
==The OpenSearch Service==<br />
===Description===<br />
The ''OpenSearch Service'' is a stateful web service responsible for the invocation of the ''OpenSearch Operator'' in the context of the provider to be queried.<br />
It also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all<br />
previously queried brokered providers and the corresponding Description Documents.<br />
<br />
<br />
===Deployment Instructions===<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* opensearchdatasource-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
<br />
===WS and Generic Resource Interrelation===<br />
Provided that a Collection for the provider to be queried is available, the ''OpenSearch Service'' uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Collection corresponding to the provider.<br />
<br />
An ''OpenSearch Service'' WS-Resource contains the following properties:<br />
*The ''AdaptorID'' which is unique for every WSResource and is used for referencing the right WSResource on querying (it is optional).<br />
*The ''CollectionID'' of the collection to be used.<br />
*The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.<br />
*The URI of the Description Document (''DescriptionDocumentURI'') of the top-provider.<br />
*A set of fields (presentables and searchables) as extracted from the ''OpenSearchGenericResource''.<br />
*A set of ''FixedParameters'', which are used in every invocation of the Operator. See also [[#Extensibility Points|Extensibility Points]].<br />
<br />
As mentioned above, the WS-Resource contains a reference only to the ''OpenSearchResource'' of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System implementation and are therefore not directly referenced by the WS-Resource.<br />
<br />
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.<br />
<br />
On WS-Resource creation, only the Metadata Collection ID and the ID of the ''OpenSearchResource'' of the top-provider need to be supplied to the ''create'' operation of the service's factory. All other properties are created internally by the service itself.<br />
<br />
===Resource Caching===<br />
For performance and reliability reasons, the ''OpenSearch Service'' maintains one cache per WS-Resource which initially contains the Generic Resource (''OpenSearchResource'') and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant ''OpenSearch Operator'' module with the Generic Resources and Description Documents of all providers reached through the broker.<br />
<br />
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically, based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.<br />
<br />
The cache refresh cycle policy used is described as follows:<br />
*The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.<br />
*The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.<br />
*In the event of failure, the previously cached version is kept.<br />
<br />
===Operations===<br />
The operations exposed by the OpenSearch Service are the following:<br />
*The ''query'' operation, with a single input message containing the query string to be sent to the operator, whose format is described in [[#Query Format|Query Format]].<br />
*The ''refreshCache'' operation, which sends a request in order to force the cache of the service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.<br />
<br />
===Configurable Parameters===<br />
The Service currently supports three configurable parameters, which are exposed to its deployment descriptor<br />
*The ''clearCacheOnStartup'' parameter, of ''boolean'' type, when enabled instructs the service to discard the stored cache on startup.<br />
*The ''cacheRefreshIntervalMillis'' parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value ''0'' can be used to disable periodic cache refresh cycles.<br />
* The ''openSearchLibraryFactories'' parameter, of ''string'' type is used to supply the ''OpenSearch Core Library'' with the factory mappings for all namespaces for which there exists an implementation of a library extension. For more information on the mappings, see also the section referring to the [[#Extensibility Mechanism|extensibility mechanism]] of the library.<br />
<br />
The ''openSearchLibraryFactories'' parameter is encoded as a sequence of mappings from strings to pairs, where each mapping is enclosed in braces, association is denoted by the ''='' sign and each pair is enclosed in parentheses. For example, given that there are implementations for core functionality, Geo and Time extensions, the value of this configuration parameter could be the following:<br />
<br />
<code><br />
[<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/geo/1.0/</nowiki>/=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/time/1.0/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.time.TimeURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.time.TimeQueryElementFactory)]<br />
</code><br />
<br />
==The OpenSearchDataSource Client Library==<br />
In this section some examples of usage of the ''OpenSearchDataSource Client Library'' are provided.<br />
<br />
'''Query''' example:<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext";<br />
<br />
final OpenSearchClient client = new OpenSearchClient.Builder()<br />
// .endpoint(endpoint) // you can also give a specific endpoint<br />
.scope(scope)<br />
.build();<br />
<br />
final String queryString = "((((gDocCollectionID == \"ea5b9c70-01ce-4b45-96e7-6db037ebf2bc\") and (gDocCollectionLang == \"en\"))) and (5575bbdb-6d47-4297-ad12-2259b3405ce7 = greece)) project d91f3c47-e46e-4737-9496-a0f72361a397 3e3584f0-eed3-4089-99cd-86a7def1471e";<br />
<br />
String grs2Locator = client.query(queryString);<br />
<br />
// or<br />
<br />
List<Map<String, String>> records = client2.queryAndRead(queryString);<br />
<br />
</source><br />
<br />
'''Create Resource''' method example:<br />
<source lang="java"><br />
<br />
static void createResource(List<String> fieldParameters, List<String> fixedParameters, String collectionID, String openSearchResourceID, String scope) throws OpenSearchClientException {<br />
<br />
OpenSearchFactoryClient factory = injector.getInstance(OpenSearchFactoryClient.Builder.class)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
List<String> fieldParams = Lists.newArrayList();<br />
<br />
for (int i=0; i<fieldParameters.size(); i++) {<br />
fieldParams.add(collectionID + ":" + fieldParameters.get(i));<br />
System.out.println("Field parameter: " + (i+1) + " " + fieldParams.get(i));<br />
}<br />
<br />
Provider p = new Provider();<br />
p.setCollectionID(collectionID);<br />
p.setOpenSearchResourceID(openSearchResourceID);<br />
p.setFixedParameters(fixedParameters);<br />
<br />
List<Provider> providers = Lists.newArrayList();<br />
providers.add(p);<br />
<br />
factory.createResource(fieldParams, providers, scope);<br />
}<br />
<br />
</source><br />
<br />
==External Links==<br />
Some useful external links for further reading are provided here:<br />
*[http://www.opensearch.org/Home OpenSearch Home]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5 OpenSearch Specification]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Responce Elements]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch Description Document ]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=OpenSearch_Framework&diff=21691OpenSearch Framework2014-06-11T12:53:55Z<p>Alex.antoniadi: /* The OpenSearchDataSource Client Library */</p>
<hr />
<div>==Description==<br />
The role of the gCube ''OpenSearch Framework'' is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the [http://www.opensearch.org/Specifications/OpenSearch OpenSearch Specification]. The framework consists of two components <br />
*The ''OpenSearch Library'', which includes a core library providing general-purpose OpenSearch functionality, and the ''OpenSearch Operator'' which utilizes functionality provided by the former.<br />
*The ''OpenSearch Service'' (also called ''OpenSearchDataSource Service''), which binds collections with provider-specific information encapsulated in generic resources and invokes the ''OpenSearch Operator''<br />
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole ''OpenSearch Library'' component, whereas the name "OpenSearch Core Library" will be used when referring to the library constituent of the component.<br />
<br />
A client library library for OpenSearch Service, called ''OpenSearchDataSource Client Library'', also exists in order to assist the programmatically use of the service. <br />
<br />
''OpenSearch Library'', ''OpenSearch Service'' and ''OpenSearchDataSource Client Library'' are available in our Maven repositories with the following coordinates:<br />
<source lang="xml"><br />
<!-- OpenSearch Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchlibrary</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearch Service --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-service</artifactId><br />
<version>...</version><br />
<br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-stubs</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearchDataSource Client Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-client-library</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==The OpenSearch Library==<br />
<br />
===The OpenSearch Core Library===<br />
====Description====<br />
The ''OpenSearch Core Library'' conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers.<br />
It can be optionally extended, as described in the [[#Extensibility|Extensibility]] section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported.<br />
The ''OpenSearch Operator'', described in a [[#The OpenSearch Operator|following]] section functions atop this library.<br />
<br />
====Functionality====<br />
The central class which can be used in order to exploit the functionality provided by the library, is the ''DescriptionDocument'' class. For reasons explained in the [[#Library Extensibility|following]] section, the ''DescriptionDocument'' class needs to be provided with a pair of ''URLElementFactory'' and ''QueryElementFactory'' factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the ''FactoryResolver'' class, as follows:<br />
<source lang="java5"><br />
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);<br />
</source><br />
The ''DescriptionDocument'' is then instantiated as follows:<br />
<source lang="java5"><br />
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);<br />
</source><br />
where the ''descriptionDocumentXML'' parameter corresponds to a DOM Document object containing the parsed Description Document.<br />
Properly instantiated, the ''DescriptionDocument'' class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a ''QueryBuilder'' object, which can be obtained as follows:<br />
<source lang="java5"><br />
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);<br />
</source><br />
where <code>rel</code> is a rel value as described in the [http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_4#Url_rel_values OpenSearch Specification], e.g. <code>results</code> and <code>MimeType</code> is a MIME type, such as <code>application/rss+xml</code>. The returned list contains one ''QueryBuilder'' instance for each template contained in a URL Element with the specified <code>rel</code> and <code>type</code> attributes.<br />
Once the desired ''QueryBuilder'' is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.<br />
For example, <code>searchTerms</code> parameter can be set to some value as follows:<br />
<source lang="java5"><br />
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms); <br />
</source><br />
Once all the required parameters are set, the constructed query can be obtained as follows:<br />
<source lang="java5"><br />
URL query;<br />
try {<br />
query = qb.getQuery();<br />
}catch(IncompleteQueryException iqe) {<br />
//Incomplete query exception handling<br />
}catch(MalformedQueryException mqe {<br />
//Malformed query exception handling<br />
}<br />
</source><br />
<br />
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either ''HTMLResponse'' or ''XMLResponse'', depending on the MIME type of the OpenSearch response, in order for the [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Response Elements] and any other available information contained in the response to be processed.<br />
<source lang="java5"><br />
InputStream responseStream = query.openConnection().getInputStream();<br />
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());<br />
</source><br />
<br />
The raw XML data can then be obtained by the ''OpenSearchResponse'' object as follows:<br />
<source lang="java5"><br />
response.getResponse();<br />
</source><br />
and any information available, mainly relevant to paging, can be obtained by one of the methods of the ''OpenSearchResponse'' class. For example the total number of results, as reported by the <code>totalResults</code> Response Element, if present, can be obtained as follows:<br />
<source lang="java5"><br />
response.getTotalResults();<br />
</source><br />
<br />
====Library Extensibility====<br />
=====Motivation=====<br />
The core functionality provided by the ''OpenSearch Core Library'' is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored.<br />
An example of an assumption made by the ''OpenSearch Core Library'', in the form of a requirement, is that all parameter values passed to its ''QueryBuilder'' components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a client failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.<br />
<br />
There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the RFC specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the caller should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.<br />
<br />
=====Extensibility Mechanism=====<br />
The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, ''QueryBuilder'' components are included in the extensibility mechanism as they depend on the aforementioned elements.<br />
<br />
Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore, it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the caller.<br />
<br />
The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:<br />
*The ''URLElement'', ''QueryElement'' and ''QueryBuilder'' components are interfaces whose implementations support core or extension-related functionality.<br />
*Core functionality processing takes place in the last link of the chain of responsibility. For example, ''BasicQueryBuilder'' implements core ''QueryBuilder'' functionality.<br />
*Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if ''GeoQueryBuilder'' implements functionality related to the Geo OpenSearch Extension, it contains a reference to a ''QueryBuilder'' implementing either core functionality, or functionality related to some other extension.<br />
*Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.<br />
<br />
<br />
In order for a chain of responsibility to be dynamically created by the ''DescriptionDocument'' class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the ''DescriptionDocument'' in order for it to be able to construct the correct elements. The ''FactoryResolver'' utility is responsible for the construction of factories capable of constructing instances supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing ''QueryBuilder'' instances, the latter are constructed without explicitly supplying a factory to the ''DescriptionDocument'', by the ''getQueryBuilder'' method of the already constructed ''URLElement''.<br />
<br />
The ''FactoryResolver'' requires that two things be known in order to be able to construct the factories:<br />
*A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping could be: <code><<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)></code>, which declares that the implementations responsible for providing core functionality are the ''BasicURLElement'' and ''BasicQueryElement'' classes.<br />
*A list of all parameter namespaces present in the query string.<br />
<br />
Having the above information, and provided that the implementations of all factories and classes are available, the ''FactoryResolver'' will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed instances will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.<br />
<br />
It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. There are, therefore, two choices when using the library<br />
*Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the caller should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.<br />
*Extend the library whenever this proves useful or makes things easier.<br />
<br />
=====Implementing a new Extension=====<br />
In order to implement a new extension that will be correctly incorporated into the already existing library functionality, one should do the following:<br />
*Implement ''URLElement'', ''QueryElement'' and ''QueryBuilder'' interface implementations, the constructor of which accepts at least a reference to a corresponding upcasted object which will be next in the chain of responsibility. All requests that cannot be handled, or require additional processing by subsequent links in the chain, should be forwarded to the next link in the chain.<br />
*Implement a ''URLElementFactory'' and a ''QueryElementFactory'' interface implementations, the constructors of which accept a single argument, which is a reference to an upcasted factory of the same type corresponding to the factory used to create instances next in the chain of responsibility. An example of a ''URLElementFactory'' used for the construction of ''GeoURLElement''s which implement Geo extension functionality is the following:<br />
<source lang="java5"><br />
public class GeoURLElementFactory implements URLElementFactory {<br />
<br />
URLElementFactory f;<br />
<br />
public GeoURLElementFactory(URLElementFactory f) {<br />
this.f = f;<br />
}<br />
<br />
public GeoURLElement newInstance(Element url, Map<String, String> nsPrefixes) throws Exception {<br />
URLElement el = f.newInstance(url, nsPrefixes);<br />
return new GeoURLElement(url, el);<br />
}<br />
}<br />
</source><br />
*See that the ''getQueryBuilder'' method of the ''URLElement'' implementation of the new extension correctly constructs a ''QueryBuilder'' instance. For example, the ''getQueryBuilder'' method of the ''GeoQueryElement'' presented above could look like this:<br />
<source lang="java5"><br />
public QueryBuilder getQueryBuilder() throws Exception {<br />
return new GeoQueryBuilder(el.getQueryBuilder());<br />
}<br />
</source><br />
where <code>el</code> is the next ''URLElement'' in the chain.<br />
*Add a mapping for the two new factories to the set of mappings passed to the ''FactoryResolver'' utility upon initialization of the library. For example, given that ''GeoURLElementFactory'' and ''GeoQueryElementFactory'' are implemented for Geo Extensions, one could add the mapping as follows:<br />
<source lang=java5><br />
factoryMappings.add("http://a9.com/-/opensearch/extensions/geo/1.0/",<br />
new FactoryClassNamePair("org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElement", "org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElement"));<br />
</source><br />
<br />
===The OpenSearch Operator===<br />
====Description====<br />
The role of the ''OpenSearch Operator'' is to provide support for querying and retrieval of search results via [http://www.opensearch.org/Home OpenSearch] from providers which expose an [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch description document]. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an [[#OpenSearch Resource|OpenSearch Resource]] reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.<br />
<br />
====Extensibility Points====<br />
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special [[#OpenSearch Resource|OpenSearch Resource]] structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the ''OpenSearch Core Library''. The extra functionalities are summarized as follows:<br />
*Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.<br />
*The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the single direct provider in the direct provider case. The purpose of these parameters is, first, to facilitate the creation of dynamic collections from results obtained by brokers by superseeding the caller's query parameters while querying the broker and using the full set of the caller parameters only on lower levels and, second, to customize the behaviour of some provider to the needs of the gCube Framework (for example set a value for a required query parameter that the framework cannot handle). These options can be used both in tandem, if desired.<br />
*Support for one or more security schemes is planned for a subsequent version of the ''OpenSearch Library''.<br />
<br />
====OpenSearch Resource====<br />
The purpose of an ''OpenSearch Resource'' object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the [[#Extensibility Points|Extensibility Points]] section. The attributes included are the following:<br />
* The name of the resource<br />
* The URL of the OpenSearch Description Document of the provider to be queried<br />
* Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.<br />
* Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:<br />
**The ''RecordSplitXPath'' expression is used to split a page of search results into individual records. For example for the rss format, the <code><item></code> elements under <code>rss/channel</code> could be of interest<br />
**The ''presentationInfo'' expression which is actually a map between fieldnames and XPath expressions that are used to extract the desired value from the response.<br />
**The optional ''RecordIdXPath'' expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a <code>DocID</code> record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.<br />
* Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,<br />
<br />
The serialization of an ''OpenSearch Resource'' can be easily incorporated into a Generic Resource. The default mode of operation for the ''OpenSearch Operator'' in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resource from the [[Information_System|IS]].<br />
The Generic Resources utilized by the ''OpenSearch Operator'' is:<br />
*The ''OpenSearchResource'' which contains the body of the OpenSearch Resource as described below<br />
<br />
<br />
Note that, solely for testing purposes, the ''OpenSearch Operator'' also supports a local mode of operation, whereby all ''OpenSearch Resources'' are loaded from the local file system. <br />
<br />
The XML Schema that all OpenSearch Resource serializations should conform to is the following:<br />
<source lang="xml"><br />
<?xml version="1.0"?><br />
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><br />
<xs:element name="OpenSearchResource"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="name" type="xs:string"/><br />
<xs:element name="descriptionDocumentURI" type="xs:string"/><br />
<xs:element name="brokeredResults" type="xs:boolean"/><br />
<xs:element name="transformation" maxOccurs="unbounded"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="MIMEType" type="xs:string"/><br />
<xs:element name="recordSplitXPath" type="xs:string"/><br />
<xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/><br />
<xs:element name="presentationInfo" maxOccurs="unbounded"><br />
<xs:element name="presentable" maxOccurs="unbounded"><br />
<xs:element name="fieldName" type="xs:string"/><br />
<xs:element name="expression" type="xs:string"/><br />
</xs:element><br />
</xs:element><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
<xs:element name="security" minOccurs="0"><br />
<xs:complexType><br />
<xs:sequence><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element> <br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
</xs:schema><br />
</source><br />
<br />
The transformation element can appear multiple times within an ''OpenSearch Resource''. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types. <br />
<br />
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch providers from the initial results provided by the OpenSearch provider acting as a broker.<br />
<br />
An example of an ''OpenSearch Resource'' serialization describing the [http://www.bing.com/ Bing] external repository as a direct OpenSearch provider, currently in use by the gCube Framework is the following:<br />
<source lang="xml"><br />
<OpenSearchResource><br />
<name>Bing</name><br />
<descriptionDocumentURI>http://imarine.web.cern.ch/imarine/OpenSearch/bing.xml</descriptionDocumentURI><br />
<brokeredResults>false</brokeredResults><br />
<parameters><br />
<parameter><br />
<fieldName>allIndexes</fieldName><br />
<qName>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms</qName><br />
</parameter><br />
</parameters><br />
<transformation><br />
<MIMEType>application/rss+xml</MIMEType><br />
<recordSplitXPath>*[local-name()='rss']/*[local-name()='channel']/*[local-name()='item']</recordSplitXPath><br />
<recordIdXPath>//*[local-name()='item']/*[local-name()='link']</recordIdXPath><br />
<presentationInfo><br />
<presentable><br />
<fieldName>title</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='title']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>link</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='link']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>description</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>S</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>pubDate</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='pubDate']</expression><br />
</presentable><br />
</presentationInfo><br />
</transformation><br />
</OpenSearchResource><br />
</source><br />
<br />
In case our datasrouce is broker the ''brokeredResults'' should be changed to true as:<br />
<br />
<source lang="xml"><br />
<brokeredResults>true</brokeredResults><br />
</source><br />
<br />
====OpenSearch Operator Logic====<br />
[[File:Opensearch_op_flowchart.png|right|Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator]]<br />
The ''OpenSearch Operator'' employs the functionality provided by the [[#The OpenSearch Core Library|OpenSearch Core Library]] in order to extract the required information from the Description Document of the external provider and the ''QueryBuilder''s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. <br />
<br />
It should be noted that the ''OpenSearch Operator'' abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way ''OpenSearch Resource''s are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options<br />
*If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional ''OpenSearch Resource''s can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.<br />
*If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.<br />
<br />
In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the operator, before issuing any actual queries. <br />
Given that an ''OpenSearch Resource'' can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element URL Element] is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the ''QueryBuilder'' whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the ''OpenSearch Resource'' describing the provider are discarded. The ''QueryBuilder''s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:<br />
*''QueryBuilder''s whose required parameters are not covered by the parameters of the caller's query are discarded<br />
*''QueryBuilder''s are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered<br />
*A ''QueryBuilder'' which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.<br />
<br />
The most usual case is for the provider's ''OpenSearch Resource'' to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one ''QueryBuilder'' being available to construct queries, thereby resulting to a degenerate reordering step.<br />
<br />
The functions performed by the operator in order for a set of results to be retrieved, given that the proper ''QueryBuilder'' is selected, are summarized in the simplified diagram of Figure 1.<br />
<br />
As shown, the operator accepts a set of query terms and a set of query parameters.<br />
<br />
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator. <br />
<br />
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.<br />
<br />
====Configurable Parameters====<br />
The ''OpenSearch Operator'' can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:<br />
*The ''resultsPerPage'' parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is ''100''.<br />
*The ''sequentialResults'' parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is ''false''.<br />
*The ''useLocalResource'' parameter, when enabled, permits the operator to operate in the absence of an IS. The ''OpenSearch Resources'' are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to ''false''.<br />
<br />
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in [[#Library Extensibility|Library Extensibility]].<br />
The ''sequentialResults'' parameter can also be configured in a per-query manner, including it in the query string as a query parameter.<br />
<br />
====Query Format====<br />
The ''OpenSearch Operator'' expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form<br />
<br />
<code><br />
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"<br />
</code><br />
<br />
and should be space-delimited.<br />
Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well.<br />
Any free-text parameter value should be URL-encoded.<br />
<br />
The reserved keyword ''config'' when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the<br />
sequentialResults parameter described in [[#Configurable Parameters|Configurable Parameters]], plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.<br />
The following hold for the query configuration parameter values<br />
*The ''sequentialResults'' parameter should be assigned a value equal to ''true'' or ''false''. Its absence implies the default value of the corresponding configurable parameter of the operator.<br />
*The ''numOfResults'' parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.<br />
<br />
Taking everything into account, an example of a legitimate query for the ''OpenSearch Operator'' could be the following:<br />
<br />
<code>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"</code><br />
<br />
which instructs the operator to use the string <code>Hello World</code> as the value for the <code>SearchTerms</code> standard OpenSearch parameter and to retrieve up to 300 results from the provider.<br />
<br />
==The OpenSearch Service==<br />
===Description===<br />
The ''OpenSearch Service'' is a stateful web service responsible for the invocation of the ''OpenSearch Operator'' in the context of the provider to be queried.<br />
It also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all<br />
previously queried brokered providers and the corresponding Description Documents.<br />
<br />
<br />
===Deployment Instructions===<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* opensearchdatasource-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
<br />
===WS and Generic Resource Interrelation===<br />
Provided that a Collection for the provider to be queried is available, the ''OpenSearch Service'' uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Collection corresponding to the provider.<br />
<br />
An ''OpenSearch Service'' WS-Resource contains the following properties:<br />
*The ''AdaptorID'' which is unique for every WSResource and is used for referencing the right WSResource on querying (it is optional).<br />
*The ''CollectionID'' of the collection to be used.<br />
*The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.<br />
*The URI of the Description Document (''DescriptionDocumentURI'') of the top-provider.<br />
*A set of fields (presentables and searchables) as extracted from the ''OpenSearchGenericResource''.<br />
*A set of ''FixedParameters'', which are used in every invocation of the Operator. See also [[#Extensibility Points|Extensibility Points]].<br />
<br />
As mentioned above, the WS-Resource contains a reference only to the ''OpenSearchResource'' of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System implementation and are therefore not directly referenced by the WS-Resource.<br />
<br />
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.<br />
<br />
On WS-Resource creation, only the Metadata Collection ID and the ID of the ''OpenSearchResource'' of the top-provider need to be supplied to the ''create'' operation of the service's factory. All other properties are created internally by the service itself.<br />
<br />
===Resource Caching===<br />
For performance and reliability reasons, the ''OpenSearch Service'' maintains one cache per WS-Resource which initially contains the Generic Resource (''OpenSearchResource'') and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant ''OpenSearch Operator'' module with the Generic Resources and Description Documents of all providers reached through the broker.<br />
<br />
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically, based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.<br />
<br />
The cache refresh cycle policy used is described as follows:<br />
*The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.<br />
*The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.<br />
*In the event of failure, the previously cached version is kept.<br />
<br />
===Operations===<br />
The operations exposed by the OpenSearch Service are the following:<br />
*The ''query'' operation, with a single input message containing the query string to be sent to the operator, whose format is described in [[#Query Format|Query Format]].<br />
*The ''refreshCache'' operation, which sends a request in order to force the cache of the service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.<br />
<br />
===Configurable Parameters===<br />
The Service currently supports three configurable parameters, which are exposed to its deployment descriptor<br />
*The ''clearCacheOnStartup'' parameter, of ''boolean'' type, when enabled instructs the service to discard the stored cache on startup.<br />
*The ''cacheRefreshIntervalMillis'' parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value ''0'' can be used to disable periodic cache refresh cycles.<br />
* The ''openSearchLibraryFactories'' parameter, of ''string'' type is used to supply the ''OpenSearch Core Library'' with the factory mappings for all namespaces for which there exists an implementation of a library extension. For more information on the mappings, see also the section referring to the [[#Extensibility Mechanism|extensibility mechanism]] of the library.<br />
<br />
The ''openSearchLibraryFactories'' parameter is encoded as a sequence of mappings from strings to pairs, where each mapping is enclosed in braces, association is denoted by the ''='' sign and each pair is enclosed in parentheses. For example, given that there are implementations for core functionality, Geo and Time extensions, the value of this configuration parameter could be the following:<br />
<br />
<code><br />
[<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/geo/1.0/</nowiki>/=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/time/1.0/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.time.TimeURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.time.TimeQueryElementFactory)]<br />
</code><br />
<br />
==The OpenSearchDataSource Client Library==<br />
In this section some examples of usage of the ''OpenSearchDataSource Client Library'' are provided.<br />
<br />
'''Query''' example:<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext";<br />
<br />
final OpenSearchClient client = new OpenSearchClient.Builder()<br />
// .endpoint(endpoint) // you can also give a specific endpoint<br />
.scope(scope)<br />
.build();<br />
<br />
final String queryString = "((((gDocCollectionID == \"ea5b9c70-01ce-4b45-96e7-6db037ebf2bc\") and (gDocCollectionLang == \"en\"))) and (5575bbdb-6d47-4297-ad12-2259b3405ce7 = greece)) project d91f3c47-e46e-4737-9496-a0f72361a397 3e3584f0-eed3-4089-99cd-86a7def1471e";<br />
<br />
String grs2Locator = client.query(queryString);<br />
<br />
// or<br />
<br />
List<Map<String, String>> records = client2.queryAndRead(queryString);<br />
<br />
</source><br />
<br />
'''Create Resource''' method example:<br />
<source lang="java"><br />
<br />
static void createResource(List<String> fieldParameters, List<String> fixedParameters, String collectionID, String openSearchResourceID, String scope) throws OpenSearchClientException {<br />
<br />
OpenSearchFactoryClient factory = injector.getInstance(OpenSearchFactoryClient.Builder.class)<br />
.scope(scope)<br />
.build();<br />
<br />
<br />
List<String> fieldParams = Lists.newArrayList();<br />
<br />
for (int i=0; i<fieldParameters.size(); i++) {<br />
fieldParams.add(collectionID + ":" + fieldParameters.get(i));<br />
System.out.println("Field parameter: " + (i+1) + " " + fieldParams.get(i));<br />
}<br />
<br />
Provider p = new Provider();<br />
p.setCollectionID(collectionID);<br />
p.setOpenSearchResourceID(openSearchResourceID);<br />
p.setFixedParameters(fixedParameters);<br />
<br />
List<Provider> providers = Lists.newArrayList();<br />
providers.add(p);<br />
<br />
factory.createResource(fieldParams, providers, scope);<br />
}<br />
<br />
</source><br />
<br />
==External Links==<br />
Some useful external links for further reading are provided here:<br />
*[http://www.opensearch.org/Home OpenSearch Home]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5 OpenSearch Specification]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Responce Elements]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch Description Document ]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=OpenSearch_Framework&diff=21690OpenSearch Framework2014-06-11T12:43:58Z<p>Alex.antoniadi: /* The OpenSearch Service */</p>
<hr />
<div>==Description==<br />
The role of the gCube ''OpenSearch Framework'' is to enable the gCube Framework to access external providers which publish their results through search engines conforming to the [http://www.opensearch.org/Specifications/OpenSearch OpenSearch Specification]. The framework consists of two components <br />
*The ''OpenSearch Library'', which includes a core library providing general-purpose OpenSearch functionality, and the ''OpenSearch Operator'' which utilizes functionality provided by the former.<br />
*The ''OpenSearch Service'' (also called ''OpenSearchDataSource Service''), which binds collections with provider-specific information encapsulated in generic resources and invokes the ''OpenSearch Operator''<br />
To resolve ambiguity, the name "OpenSearch Library" will be used when referring to the whole ''OpenSearch Library'' component, whereas the name "OpenSearch Core Library" will be used when referring to the library constituent of the component.<br />
<br />
A client library library for OpenSearch Service, called ''OpenSearchDataSource Client Library'', also exists in order to assist the programmatically use of the service. <br />
<br />
''OpenSearch Library'', ''OpenSearch Service'' and ''OpenSearchDataSource Client Library'' are available in our Maven repositories with the following coordinates:<br />
<source lang="xml"><br />
<!-- OpenSearch Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchlibrary</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearch Service --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-service</artifactId><br />
<version>...</version><br />
<br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-stubs</artifactId><br />
<version>...</version><br />
<br />
<!-- OpenSearchDataSource Client Library --><br />
<groupId>org.gcube.opensearch</groupId><br />
<artifactId>opensearchdatasource-client-library</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==The OpenSearch Library==<br />
<br />
===The OpenSearch Core Library===<br />
====Description====<br />
The ''OpenSearch Core Library'' conforms to the latest OpenSearch specification and provides general OpenSearch-related functionality to any component which needs to query OpenSearch providers.<br />
It can be optionally extended, as described in the [[#Extensibility|Extensibility]] section, in order for OpenSearch Extensions whose parameters or other elements need special handling to be supported.<br />
The ''OpenSearch Operator'', described in a [[#The OpenSearch Operator|following]] section functions atop this library.<br />
<br />
====Functionality====<br />
The central class which can be used in order to exploit the functionality provided by the library, is the ''DescriptionDocument'' class. For reasons explained in the [[#Library Extensibility|following]] section, the ''DescriptionDocument'' class needs to be provided with a pair of ''URLElementFactory'' and ''QueryElementFactory'' factory classes. Provided that the query parameter namespaces present in the query string are extracted in some way and a namespace-to-factory mapping is available, this pair can be obtained by the ''FactoryResolver'' class, as follows:<br />
<source lang="java5"><br />
FactoryPair factories = FactoryResolver.getFactories(queryNamespaces, factoryMapping);<br />
</source><br />
The ''DescriptionDocument'' is then instantiated as follows:<br />
<source lang="java5"><br />
DescriptionDocument dd = new DescriptionDocument(descriptionDocumentXML, factories.urlElFactory, factories.queryElFactory);<br />
</source><br />
where the ''descriptionDocumentXML'' parameter corresponds to a DOM Document object containing the parsed Description Document.<br />
Properly instantiated, the ''DescriptionDocument'' class can provide any information relevant to the processed Description Document, as well as a mechanism to formulate search queries to send to the OpenSearch provider described by the Description Document. The latter is achieved by a ''QueryBuilder'' object, which can be obtained as follows:<br />
<source lang="java5"><br />
List<QueryBuilder> qbs = dd.getQueryBuilders(rel, MimeType);<br />
</source><br />
where <code>rel</code> is a rel value as described in the [http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_4#Url_rel_values OpenSearch Specification], e.g. <code>results</code> and <code>MimeType</code> is a MIME type, such as <code>application/rss+xml</code>. The returned list contains one ''QueryBuilder'' instance for each template contained in a URL Element with the specified <code>rel</code> and <code>type</code> attributes.<br />
Once the desired ''QueryBuilder'' is selected, it can be used to formulate a query by first assigning values to the parameters and then obtaining the constructed query.<br />
For example, <code>searchTerms</code> parameter can be set to some value as follows:<br />
<source lang="java5"><br />
qb.setParameter(OpenSearchConstants.searchTermsQName, searchTerms); <br />
</source><br />
Once all the required parameters are set, the constructed query can be obtained as follows:<br />
<source lang="java5"><br />
URL query;<br />
try {<br />
query = qb.getQuery();<br />
}catch(IncompleteQueryException iqe) {<br />
//Incomplete query exception handling<br />
}catch(MalformedQueryException mqe {<br />
//Malformed query exception handling<br />
}<br />
</source><br />
<br />
Once the query is properly constructed and is available, it can be sent to the search engine of the provider in order to retrieve results. The returned results should be passed to either ''HTMLResponse'' or ''XMLResponse'', depending on the MIME type of the OpenSearch response, in order for the [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Response Elements] and any other available information contained in the response to be processed.<br />
<source lang="java5"><br />
InputStream responseStream = query.openConnection().getInputStream();<br />
OpenSearchResponse response = new XMLResponse(responseStream, factories.queryElFactory, qb, outputEncoding, dd.getURIToPrefixMappings());<br />
</source><br />
<br />
The raw XML data can then be obtained by the ''OpenSearchResponse'' object as follows:<br />
<source lang="java5"><br />
response.getResponse();<br />
</source><br />
and any information available, mainly relevant to paging, can be obtained by one of the methods of the ''OpenSearchResponse'' class. For example the total number of results, as reported by the <code>totalResults</code> Response Element, if present, can be obtained as follows:<br />
<source lang="java5"><br />
response.getTotalResults();<br />
</source><br />
<br />
====Library Extensibility====<br />
=====Motivation=====<br />
The core functionality provided by the ''OpenSearch Core Library'' is not limited to the processing of only standard OpenSearch parameters. More specifically, the basic components of the library treat all extended query parameters in a uniform way, making only the assumptions holding for any OpenSearch parameter, be it standard or extended. Furthermore, any unrecognized markup or element value is simply ignored.<br />
An example of an assumption made by the ''OpenSearch Core Library'', in the form of a requirement, is that all parameter values passed to its ''QueryBuilder'' components should be URL-encoded. This requirement is in accordance with the OpenSearch Specification and causes no problems for most OpenSearch parameters. In fact, if a client failed to URL-encode free-text values, query formulation would fail in the query URL construction phase.<br />
<br />
There are, however, cases in which the previous requirement proves problematic. For example, the OpenSearch Geo Extension presents examples of parameter values in which the comma character is not URL encoded, regardless of what the RFC specifications state. Such parameter values could therefore call for an extra URL decoding preprocessing step, or otherwise the caller should be required to not URL encode the values. Furthermore, it would be quite useful if the library could be aware of the specific format and other peculiarities and rules governing the syntax of extended parameters, for the purpose of query validation and for supporting any extra functionality provided by the extension. The support of value-adding functionality provided by extended OpenSearch elements by the library could also prove useful.<br />
<br />
=====Extensibility Mechanism=====<br />
The extensibility mechanism chosen for the library focuses on extensible elements, as described in the OpenSearch Specification, namely URL Elements and Query Elements. Furthermore, ''QueryBuilder'' components are included in the extensibility mechanism as they depend on the aforementioned elements.<br />
<br />
Given that the number of available OpenSearch extensions is quite large and because of the fact that not all of these extensions are utilized by some OpenSearch provider at the same time, the extensibility mechanism should allow the easy inclusion of library extensions for specific OpenSearch extensions in a dynamic, pluggable fashion. Furthermore, it should allow extension-related functionality to be dynamically added depending on the complexity of the query of the caller.<br />
<br />
The mechanism found to best satisfy the above requirements and implemented as the extensibility pattern for the library, is the construction of a Chain of Responsibility for each extensible component. A more detailed explanation follows:<br />
*The ''URLElement'', ''QueryElement'' and ''QueryBuilder'' components are interfaces whose implementations support core or extension-related functionality.<br />
*Core functionality processing takes place in the last link of the chain of responsibility. For example, ''BasicQueryBuilder'' implements core ''QueryBuilder'' functionality.<br />
*Each component implementing extension-related functionality contains a reference to the next link in the chain of responsibility. For example if ''GeoQueryBuilder'' implements functionality related to the Geo OpenSearch Extension, it contains a reference to a ''QueryBuilder'' implementing either core functionality, or functionality related to some other extension.<br />
*Each link in the chain of responsibility should process whatever information it can handle, otherwise forward the request to the next link in the chain.<br />
<br />
<br />
In order for a chain of responsibility to be dynamically created by the ''DescriptionDocument'' class, a similar chain of abstract factories should be implemented. The resulting factories, one for URL Elements and another for Query Elements can then be passed to the constructor of the ''DescriptionDocument'' in order for it to be able to construct the correct elements. The ''FactoryResolver'' utility is responsible for the construction of factories capable of constructing instances supporting no more than the functionality necessary to process a given query. Since the chain structure is already known when constructing ''QueryBuilder'' instances, the latter are constructed without explicitly supplying a factory to the ''DescriptionDocument'', by the ''getQueryBuilder'' method of the already constructed ''URLElement''.<br />
<br />
The ''FactoryResolver'' requires that two things be known in order to be able to construct the factories:<br />
*A set of mappings from namespace URIs to factory class names, one for each component implementing either core functionality (in this case the namespace URI being equal to the OpenSearch namespace) or extension-related functionality. An example of such a mapping could be: <code><<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>, (org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElement, org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElement)></code>, which declares that the implementations responsible for providing core functionality are the ''BasicURLElement'' and ''BasicQueryElement'' classes.<br />
*A list of all parameter namespaces present in the query string.<br />
<br />
Having the above information, and provided that the implementations of all factories and classes are available, the ''FactoryResolver'' will be able to construct, via reflection, the factories which can be used in order for the query to be properly processed. For example, if there are implementations available for core functionality, as well as for Geo and Time extensions and the query contains parameters for both two of these methods, the resulting chain of responsibility of the constructed instances will contain all three implementations, the last one being the implementation which supports core functionality, and the other two appearing in the chain in any order. If the query contains only standard OpenSearch parameters, there is no need for the chain to be burdened with links that will never be used, therefore a chain consisting of only the core implementation is constructed. The same holds if, for example, there are not Geo parameters present in the query; the corresponding implementation will not be included in the chain.<br />
<br />
It should be stressed again that it is not necessary for all extensions that are expected to be met to be implemented in order for the library to work. The extension of the library remains a purely optional task. There are, therefore, two choices when using the library<br />
*Do not extend the library when in need of using extended parameters, relying in the core functionality provided by the library. In that case, the caller should be careful to supply the library with the correct values and format of parameters, so that a query can be constructed, albeit without the option of query validation or the ability to exploit additional functionality related to the extension.<br />
*Extend the library whenever this proves useful or makes things easier.<br />
<br />
=====Implementing a new Extension=====<br />
In order to implement a new extension that will be correctly incorporated into the already existing library functionality, one should do the following:<br />
*Implement ''URLElement'', ''QueryElement'' and ''QueryBuilder'' interface implementations, the constructor of which accepts at least a reference to a corresponding upcasted object which will be next in the chain of responsibility. All requests that cannot be handled, or require additional processing by subsequent links in the chain, should be forwarded to the next link in the chain.<br />
*Implement a ''URLElementFactory'' and a ''QueryElementFactory'' interface implementations, the constructors of which accept a single argument, which is a reference to an upcasted factory of the same type corresponding to the factory used to create instances next in the chain of responsibility. An example of a ''URLElementFactory'' used for the construction of ''GeoURLElement''s which implement Geo extension functionality is the following:<br />
<source lang="java5"><br />
public class GeoURLElementFactory implements URLElementFactory {<br />
<br />
URLElementFactory f;<br />
<br />
public GeoURLElementFactory(URLElementFactory f) {<br />
this.f = f;<br />
}<br />
<br />
public GeoURLElement newInstance(Element url, Map<String, String> nsPrefixes) throws Exception {<br />
URLElement el = f.newInstance(url, nsPrefixes);<br />
return new GeoURLElement(url, el);<br />
}<br />
}<br />
</source><br />
*See that the ''getQueryBuilder'' method of the ''URLElement'' implementation of the new extension correctly constructs a ''QueryBuilder'' instance. For example, the ''getQueryBuilder'' method of the ''GeoQueryElement'' presented above could look like this:<br />
<source lang="java5"><br />
public QueryBuilder getQueryBuilder() throws Exception {<br />
return new GeoQueryBuilder(el.getQueryBuilder());<br />
}<br />
</source><br />
where <code>el</code> is the next ''URLElement'' in the chain.<br />
*Add a mapping for the two new factories to the set of mappings passed to the ''FactoryResolver'' utility upon initialization of the library. For example, given that ''GeoURLElementFactory'' and ''GeoQueryElementFactory'' are implemented for Geo Extensions, one could add the mapping as follows:<br />
<source lang=java5><br />
factoryMappings.add("http://a9.com/-/opensearch/extensions/geo/1.0/",<br />
new FactoryClassNamePair("org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElement", "org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElement"));<br />
</source><br />
<br />
===The OpenSearch Operator===<br />
====Description====<br />
The role of the ''OpenSearch Operator'' is to provide support for querying and retrieval of search results via [http://www.opensearch.org/Home OpenSearch] from providers which expose an [http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch description document]. The operator accepts a query string consisting of a set query parameters which may include a number of search terms and an [[#OpenSearch Resource|OpenSearch Resource]] reference which contains the URL of an OpenSearch description document and various specifications relevant to the OpenSearch provider to be queried. After performing the number of OpenSearch queries required to obtain the desired results, it returns these results wrapping them in a ResultSet.<br />
<br />
====Extensibility Points====<br />
The operator introduces and makes use of a set of functionalities beyond those of the standard OpenSearch specification. These extensions are supported by the introduction of a special [[#OpenSearch Resource|OpenSearch Resource]] structure and by the internal logic of the operator, the latter using standard OpenSearch functionality provided by the ''OpenSearch Core Library''. The extra functionalities are summarized as follows:<br />
*Both direct and brokered result processing is supported. Some OpenSearch-enabled providers diverge from the common case of returning a set of direct results and instead provide their results indirectly, by returning a set of links to other OpenSearch-enabled providers. Provided that both a transformation specification used to extract these links from the returned results as well as the OpenSearch resources for each one of the brokered OpenSearch services are available, the operator will return the full set of results provided by the brokered OpenSearch services.<br />
*The support of a set of fixed parameters, which override the user-provided parameters only at the level of the top provider, i.e either the broker or the single direct provider in the direct provider case. The purpose of these parameters is, first, to facilitate the creation of dynamic collections from results obtained by brokers by superseeding the caller's query parameters while querying the broker and using the full set of the caller parameters only on lower levels and, second, to customize the behaviour of some provider to the needs of the gCube Framework (for example set a value for a required query parameter that the framework cannot handle). These options can be used both in tandem, if desired.<br />
*Support for one or more security schemes is planned for a subsequent version of the ''OpenSearch Library''.<br />
<br />
====OpenSearch Resource====<br />
The purpose of an ''OpenSearch Resource'' object is to describe the specifications of an OpenSearch provider. It encapsulates the extensions described in the [[#Extensibility Points|Extensibility Points]] section. The attributes included are the following:<br />
* The name of the resource<br />
* The URL of the OpenSearch Description Document of the provider to be queried<br />
* Information about whether the provider returns direct or brokered results, used by the operator to adapt its operation to both kinds of providers.<br />
* Data transformation specifications for a subset of the MIME types of the results which the result provider returns. The data transformation consists of two or, optionally, three parts:<br />
**The ''RecordSplitXPath'' expression is used to split a page of search results into individual records. For example for the rss format, the <code><item></code> elements under <code>rss/channel</code> could be of interest<br />
**The ''presentationInfo'' expression which is actually a map between fieldnames and XPath expressions that are used to extract the desired value from the response.<br />
**The optional ''RecordIdXPath'' expression can be used to tag each record with a unique identifier, extracted from the record itself. If the element is found, its payload is added as a <code>DocID</code> record attribute. Otherwise, if the element is not found or is empty, no DocID attribute is added to the record.<br />
* Security specifications (planned for a future version, when the supported security specifications are decided on). This element is optional, its absence implying the absence of a security scheme,<br />
<br />
The serialization of an ''OpenSearch Resource'' can be easily incorporated into a Generic Resource. The default mode of operation for the ''OpenSearch Operator'' in fact obtains the necessary OpenSearch resources by retrieving the corresponding Generic Resource from the [[Information_System|IS]].<br />
The Generic Resources utilized by the ''OpenSearch Operator'' is:<br />
*The ''OpenSearchResource'' which contains the body of the OpenSearch Resource as described below<br />
<br />
<br />
Note that, solely for testing purposes, the ''OpenSearch Operator'' also supports a local mode of operation, whereby all ''OpenSearch Resources'' are loaded from the local file system. <br />
<br />
The XML Schema that all OpenSearch Resource serializations should conform to is the following:<br />
<source lang="xml"><br />
<?xml version="1.0"?><br />
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><br />
<xs:element name="OpenSearchResource"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="name" type="xs:string"/><br />
<xs:element name="descriptionDocumentURI" type="xs:string"/><br />
<xs:element name="brokeredResults" type="xs:boolean"/><br />
<xs:element name="transformation" maxOccurs="unbounded"><br />
<xs:complexType><br />
<xs:sequence><br />
<xs:element name="MIMEType" type="xs:string"/><br />
<xs:element name="recordSplitXPath" type="xs:string"/><br />
<xs:element name="recordIdXPath" type="xs:string" minOccurs="0" maxOccurs="1"/><br />
<xs:element name="presentationInfo" maxOccurs="unbounded"><br />
<xs:element name="presentable" maxOccurs="unbounded"><br />
<xs:element name="fieldName" type="xs:string"/><br />
<xs:element name="expression" type="xs:string"/><br />
</xs:element><br />
</xs:element><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
<xs:element name="security" minOccurs="0"><br />
<xs:complexType><br />
<xs:sequence><br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element> <br />
</xs:sequence><br />
</xs:complexType><br />
</xs:element><br />
</xs:schema><br />
</source><br />
<br />
The transformation element can appear multiple times within an ''OpenSearch Resource''. The usual case is for a single transformation element per provider to be specified, but if transformation elements are present for more than one MIME type, the operator has the alternative of resorting to the next available transformation in sequence, if the result retrieval procedure fails for some reason. This strategy can only be meaningful if the same amount of information can be obtained from different result MIME types. <br />
<br />
In the case of querying providers which return brokered results, the transformation element is used to specify a data transformation that extracts the URLs of the Description Documents of the brokered OpenSearch providers from the initial results provided by the OpenSearch provider acting as a broker.<br />
<br />
An example of an ''OpenSearch Resource'' serialization describing the [http://www.bing.com/ Bing] external repository as a direct OpenSearch provider, currently in use by the gCube Framework is the following:<br />
<source lang="xml"><br />
<OpenSearchResource><br />
<name>Bing</name><br />
<descriptionDocumentURI>http://imarine.web.cern.ch/imarine/OpenSearch/bing.xml</descriptionDocumentURI><br />
<brokeredResults>false</brokeredResults><br />
<parameters><br />
<parameter><br />
<fieldName>allIndexes</fieldName><br />
<qName>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms</qName><br />
</parameter><br />
</parameters><br />
<transformation><br />
<MIMEType>application/rss+xml</MIMEType><br />
<recordSplitXPath>*[local-name()='rss']/*[local-name()='channel']/*[local-name()='item']</recordSplitXPath><br />
<recordIdXPath>//*[local-name()='item']/*[local-name()='link']</recordIdXPath><br />
<presentationInfo><br />
<presentable><br />
<fieldName>title</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='title']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>link</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='link']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>description</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>S</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='description']</expression><br />
</presentable><br />
<presentable><br />
<fieldName>pubDate</fieldName><br />
<expression>//*[local-name()='item']/*[local-name()='pubDate']</expression><br />
</presentable><br />
</presentationInfo><br />
</transformation><br />
</OpenSearchResource><br />
</source><br />
<br />
In case our datasrouce is broker the ''brokeredResults'' should be changed to true as:<br />
<br />
<source lang="xml"><br />
<brokeredResults>true</brokeredResults><br />
</source><br />
<br />
====OpenSearch Operator Logic====<br />
[[File:Opensearch_op_flowchart.png|right|Figure 1: A simplified flowchart of the operations performed by the OpenSearch operator]]<br />
The ''OpenSearch Operator'' employs the functionality provided by the [[#The OpenSearch Core Library|OpenSearch Core Library]] in order to extract the required information from the Description Document of the external provider and the ''QueryBuilder''s needed in order to perform queries, in a fashion similar to that described in the [[#Functionality|OpenSearch Library Functionality]] section. <br />
<br />
It should be noted that the ''OpenSearch Operator'' abstracts away the MIME type of the results that are to be obtained, treating it as low-level information which can be exploited by the way ''OpenSearch Resource''s are structured. Given that the OpenSearch Specification makes no assumptions about differences in the amount of information returned by results of different MIME types, there are two options<br />
*If the amount of information returned from results of MIME Type A and MIME Type B are different, the desired MIME Type should be selected and an OpenSearch Resource constructed with only this MIME Type present in the transformation specifications. If needed, additional ''OpenSearch Resource''s can be constructed to exploit information returned from different MIME types. In this way, the MIME Type is abstracted away by the conceptual level of information detail obtained by the provider.<br />
*If there more than one result MIME types exposing the same amount of information, or containing the same subset of information of interest, there exists the option of specifying more than one transformation specifications, in a way which will result in the uniform presentation of the data to the caller. In this way, the MIME Types are abstracted away by unifying result formats to a provider-specific schema. The option of having a single transformation specification is of course available in this case as well.<br />
<br />
In order for the proper way of constructing queries to be selected, a preprocessing step is performed by the operator, before issuing any actual queries. <br />
Given that an ''OpenSearch Resource'' can contain more that one transformation specifications and that the number of the templates present in a [http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22Url.22_element URL Element] is not necessarily limited to one, there can be more than one potential template matches for the caller's query. The purpose of the preprocessing step is to select the ''QueryBuilder'' whose parameters best match the query of the latter. First, all MIME types supported by the external provider, but lacking an associated transformation specification in the ''OpenSearch Resource'' describing the provider are discarded. The ''QueryBuilder''s of the first available MIME type for which a transformation specification exists are processed and reordered according to the following rules:<br />
*''QueryBuilder''s whose required parameters are not covered by the parameters of the caller's query are discarded<br />
*''QueryBuilder''s are reordered so that the first one best matches the caller's query, i.e. all of its required parameters and as many of its optional parameters as possible are covered<br />
*A ''QueryBuilder'' which lacks a parameter present in the caller's query is considered a match. In that case the extra parameter is discarded. This rule assumes that query parameters narrow the search down and is enforced in order to account for brokered providers exposing slightly different sets of parameters than the broker or their siblings.<br />
<br />
The most usual case is for the provider's ''OpenSearch Resource'' to be set up with a single transformation specification for the MIME type of interest. Furthermore, most URL Elements provide a single template for each MIME type. This case results to only one ''QueryBuilder'' being available to construct queries, thereby resulting to a degenerate reordering step.<br />
<br />
The functions performed by the operator in order for a set of results to be retrieved, given that the proper ''QueryBuilder'' is selected, are summarized in the simplified diagram of Figure 1.<br />
<br />
As shown, the operator accepts a set of query terms and a set of query parameters.<br />
<br />
The operator's main course of action is to formulate and send queries requesting pages of search results as long as there still are results to be returned and the caller requirement of the number of results, if present, is not met. A pager component sees that page switching is performed correctly, managing the relevant standard OpenSearch query parameters, namely <code>startPage</code> or <code>startIndex</code> and <code>count</code>. These parameters are therefore abstracted away by the OpenSearch Operator. <br />
<br />
In the case of resources which return brokered results, the operator first retrieves the endpoints of the underlying brokered OpenSearch providers and reads their corresponding OpenSearch Resources so as to be able to retrieve the actual search results from them, either sequentially or concurrently. The extraction of brokered provider endpoints is not explicitly shown in the diagram. Furthermore, if an OpenSearch Resource structure is missing for one or more of the brokered services, the operator continues with the retrieval of results from the next available brokered service, ignoring it if it cannot obtain information for it. The same holds if all query formulation attempts for a provider fail.<br />
<br />
====Configurable Parameters====<br />
The ''OpenSearch Operator'' can be programmaticaly configured by passing to it a special configuration construct upon creation. The configuration parameters are the following:<br />
*The ''resultsPerPage'' parameter instructs the operator as to how many results per page should be requested when no other paging restrictions are in effect. The default value for this parameter currently is ''100''.<br />
*The ''sequentialResults'' parameter disables or enables multi-threaded result retrieval from brokered providers. When enabled, the results are retrieved from each provider in a sequential manner i.e the results retrieved from different providers are not intermingled. There is, however, a negative impact on performance. The default value for this configuration parameter currently is ''false''.<br />
*The ''useLocalResource'' parameter, when enabled, permits the operator to operate in the absence of an IS. The ''OpenSearch Resources'' are instead retrieved from the local file system. It is used solely for testing reasons, which is why its default value is, and will remain equal to ''false''.<br />
<br />
An additional configurable element are the mappings from query namespaces to the corresponding factories, as described in [[#Library Extensibility|Library Extensibility]].<br />
The ''sequentialResults'' parameter can also be configured in a per-query manner, including it in the query string as a query parameter.<br />
<br />
====Query Format====<br />
The ''OpenSearch Operator'' expects to receive all query parameters, including the search terms, in a single query string. All query parameters should be of the form<br />
<br />
<code><br />
<URL-Encoded_Namespace_URI>:<Parameter_Name>="<Parameter_Value>"<br />
</code><br />
<br />
and should be space-delimited.<br />
Note that the presence of a namespace is mandatory for standard OpenSearch parameters as well.<br />
Any free-text parameter value should be URL-encoded.<br />
<br />
The reserved keyword ''config'' when used as a parameter namespace denotes a configuration parameter. The query configuration parameters under the special configuration namespace include the<br />
sequentialResults parameter described in [[#Configurable Parameters|Configurable Parameters]], plus the numOfResults parameter, which can be used to impose a limit on the number of retrieved results. These two query configuration parameters are optional.<br />
The following hold for the query configuration parameter values<br />
*The ''sequentialResults'' parameter should be assigned a value equal to ''true'' or ''false''. Its absence implies the default value of the corresponding configurable parameter of the operator.<br />
*The ''numOfResults'' parameter should be assigned an integral valie, its absence implying that all available results should be retrieved.<br />
<br />
Taking everything into account, an example of a legitimate query for the ''OpenSearch Operator'' could be the following:<br />
<br />
<code>http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:searchTerms="Hello+World" config:numOfResults="300"</code><br />
<br />
which instructs the operator to use the string <code>Hello World</code> as the value for the <code>SearchTerms</code> standard OpenSearch parameter and to retrieve up to 300 results from the provider.<br />
<br />
==The OpenSearch Service==<br />
===Description===<br />
The ''OpenSearch Service'' is a stateful web service responsible for the invocation of the ''OpenSearch Operator'' in the context of the provider to be queried.<br />
It also maintains a cache per provider WS-Resource, which contains the Generic Resources relevant to the top provider, the Generic Resources of all<br />
previously queried brokered providers and the corresponding Description Documents.<br />
<br />
<br />
===Deployment Instructions===<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* opensearchdatasource-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
<br />
===WS and Generic Resource Interrelation===<br />
Provided that a Collection for the provider to be queried is available, the ''OpenSearch Service'' uses a WS-Resource for each OpenSearch provider in order to bind the Service to the Collection corresponding to the provider.<br />
<br />
An ''OpenSearch Service'' WS-Resource contains the following properties:<br />
*The ''AdaptorID'' which is unique for every WSResource and is used for referencing the right WSResource on querying (it is optional).<br />
*The ''CollectionID'' of the collection to be used.<br />
*The ID of the Generic Resource of the top-provider, where the top-provider is the broker in the brokered case or the one and only direct provider in the direct case.<br />
*The URI of the Description Document (''DescriptionDocumentURI'') of the top-provider.<br />
*A set of fields (presentables and searchables) as extracted from the ''OpenSearchGenericResource''.<br />
*A set of ''FixedParameters'', which are used in every invocation of the Operator. See also [[#Extensibility Points|Extensibility Points]].<br />
<br />
As mentioned above, the WS-Resource contains a reference only to the ''OpenSearchResource'' of the top-provider. The Generic Resources of any providers reached through a broker are retrieved through an Information System implementation and are therefore not directly referenced by the WS-Resource.<br />
<br />
Some properties are dependent on information residing in the Generic Resources describing the providers and should, therefore, be updated accordingly when these Generic Resources are modified. Examples of such properties include the Description Document URI and the templates.<br />
<br />
On WS-Resource creation, only the Metadata Collection ID and the ID of the ''OpenSearchResource'' of the top-provider need to be supplied to the ''create'' operation of the service's factory. All other properties are created internally by the service itself.<br />
<br />
===Resource Caching===<br />
For performance and reliability reasons, the ''OpenSearch Service'' maintains one cache per WS-Resource which initially contains the Generic Resource (''OpenSearchResource'') and the Description Document of the top-provider. In the brokered case, the cache is updated at run-time by the relevant ''OpenSearch Operator'' module with the Generic Resources and Description Documents of all providers reached through the broker.<br />
<br />
To account for potential updates of Description Documents and/or the Generic Resources, the cache can be refreshed either on demand or periodically, based on a configurable time interval. The periodic refresh operation can be disabled if the Description Documents and the Generic Resource configuration is considered to be stable enough.<br />
<br />
The cache refresh cycle policy used is described as follows:<br />
*The Description Document and Generic Resources of the top-provider are discarded and their updated versions are retrieved from the external site and the Information System respectively. All dependent WS-Resource properties are also updated accordingly.<br />
*The Description Documents and Generic Resources of all brokered providers, if any, are discarded. Because of the potentially large number of brokered providers, the cache is not repopulated with their updated versions to avoid locking the cache for large periods of time or using a partially updated cache. These pieces of information are re-cached at run-time instead.<br />
*In the event of failure, the previously cached version is kept.<br />
<br />
===Operations===<br />
The operations exposed by the OpenSearch Service are the following:<br />
*The ''query'' operation, with a single input message containing the query string to be sent to the operator, whose format is described in [[#Query Format|Query Format]].<br />
*The ''refreshCache'' operation, which sends a request in order to force the cache of the service to be refreshed. No refresh cycle will be initiated if a periodic refresh cycle is currently in progress.<br />
<br />
===Configurable Parameters===<br />
The Service currently supports three configurable parameters, which are exposed to its deployment descriptor<br />
*The ''clearCacheOnStartup'' parameter, of ''boolean'' type, when enabled instructs the service to discard the stored cache on startup.<br />
*The ''cacheRefreshIntervalMillis'' parameter, of integral type, defines a time interval for the periodic cache refresh operation. The value ''0'' can be used to disable periodic cache refresh cycles.<br />
* The ''openSearchLibraryFactories'' parameter, of ''string'' type is used to supply the ''OpenSearch Core Library'' with the factory mappings for all namespaces for which there exists an implementation of a library extension. For more information on the mappings, see also the section referring to the [[#Extensibility Mechanism|extensibility mechanism]] of the library.<br />
<br />
The ''openSearchLibraryFactories'' parameter is encoded as a sequence of mappings from strings to pairs, where each mapping is enclosed in braces, association is denoted by the ''='' sign and each pair is enclosed in parentheses. For example, given that there are implementations for core functionality, Geo and Time extensions, the value of this configuration parameter could be the following:<br />
<br />
<code><br />
[<nowiki>http://a9.com/-/spec/opensearch/1.1/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.BasicURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.BasicQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/geo/1.0/</nowiki>/=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.geo.GeoURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.geo.GeoQueryElementFactory)]<br />
[<nowiki>http://a9.com/-/opensearch/extensions/time/1.0/</nowiki>=(org.gcube.opensearch.opensearchlibrary.urlelements.extensions.time.TimeURLElementFactory,org.gcube.opensearch.opensearchlibrary.queryelements.extensions.time.TimeQueryElementFactory)]<br />
</code><br />
<br />
==The OpenSearchDataSource Client Library==<br />
In this section some examples of usage of the ''OpenSearchDataSource Client Library'' are provided.<br />
<br />
'''Query''' example:<br />
<source lang="java"><br />
final String scope = "/gcube/devNext";<br />
final String adaptorID = "deb7b8a0-f54f-11e2-8f30-fcb98a083f0c";<br />
final String query =<br />
"((((gDocCollectionID == \"8afa8050-7b7c-11e2-b8a4-e3f7b403b9a5\") and (gDocCollectionLang == \"en\"))) and (b3f89be0-0870-4d96-917c-cabf3a1e30b1 = \"tuna\")) project 4e52c3ad-b891-403c-a6fa-26930e6262fb f4956f3f-9f35-4f16-99c3-76976643fe6b b91e2b65-77d8-4a98-be20-cd8da3f5e95d e4988234-5ee1-4286-bbe5-e3cb1eeb9742 9c2d7794-8faf-471d-9ddf-eee05f56d795 20d533c6-167a-46fc-b564-d1d08d435364 ";<br />
<br />
<br />
ScopeProvider.instance.set(scope);<br />
StatefulQuery q = OpenSearchDataSourceDSL.getSource().withAdaptorID(adaptorID).build();<br />
List<EndpointReference> refs = q.fire();<br />
OpenSearchDataSourceCLProxyI proxyRandom = OpenSearchDataSourceDSL.getOpenSearchDataSourceProxyBuilder().at((W3CEndpointReference)refs.get(0)).build();<br />
Stream<GenericRecord> records = proxyRandom.search(query);<br />
</source><br />
<br />
'''Create Resource''' method example:<br />
<source lang="java"><br />
void createOpenSearchResource(String scope, List<String> fieldParameters, List<String> fixedParameters, String collectionID, String openSearchResourceID, String resourceToAppend) throws Exception {<br />
ScopeProvider.instance.set(scope);<br />
OpenSearchDataSourceFactoryCLProxyI proxyRandomf = OpenSearchDataSourceFactoryDSL.getOpenSearchDataSourceFactoryProxyBuilder().build();<br />
CreateResourceParams createResource = new CreateResourceParams();<br />
<br />
try {<br />
List<String> fieldParamsArray = new ArrayList<String>();<br />
for (int i=0; i<fieldParameters.size(); i++) {<br />
fieldParamsArray.add(collectionID + ":" + fieldParameters.get(i));<br />
}<br />
<br />
StringArray fieldsArray = new StringArray();<br />
// Set the field parameters<br />
if (fieldParamsArray != null && !fieldParamsArray.isEmpty())<br />
fieldsArray.array = fieldParamsArray;<br />
else<br />
fieldsArray.array = null;<br />
<br />
createResource.Fields = fieldsArray;<br />
createResource.resourceKey = resourceToAppend;<br />
<br />
Provider p = new Provider();<br />
p.collectionID = collectionID;<br />
p.OpenSearchResourceID = openSearchResourceID;<br />
StringArray paramFixedParams = new StringArray();<br />
if (fixedParameters != null)<br />
paramFixedParams.array = fixedParameters;<br />
else<br />
paramFixedParams.array = null;<br />
p.fixedParameters = paramFixedParams;<br />
<br />
List<Provider> providers = new ArrayList<Provider>();<br />
providers.add(p);<br />
createResource.Providers = providers;<br />
<br />
proxyRandomf.createResource(createResource); <br />
} catch (OpenSearchDataSourceException e) {<br />
throw new Exception("Failed to create the open search resource", e);<br />
}<br />
}<br />
</source><br />
<br />
==External Links==<br />
Some useful external links for further reading are provided here:<br />
*[http://www.opensearch.org/Home OpenSearch Home]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5 OpenSearch Specification]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_response_elements OpenSearch Responce Elements]<br />
*[http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document OpenSearch Description Document ]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21689Index Management Framework2014-06-11T12:42:42Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21688Index Management Framework2014-06-11T12:39:13Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:<br />
<br />
<pre><br />
clientMode=false<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21687Index Management Framework2014-06-11T12:36:43Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21686Index Management Framework2014-06-11T12:36:08Z<p>Alex.antoniadi: /* Services */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
port=8080<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21391Creating Indices at the VO Level2014-05-09T16:04:23Z<p>Alex.antoniadi: /* FtsRowset_Transformer */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following section:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating an Index for an OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexType for (FullText) Index.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<br />
[[File:Bootstrapper_Configuration.xml]]<br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<br />
[[File:BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage.xml]]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21390Creating Indices at the VO Level2014-05-09T16:04:10Z<p>Alex.antoniadi: /* Creating an Index for an OAI-DC collection */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following section:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating an Index for an OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
[[File:FwRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexType for (FullText) Index.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<br />
[[File:Bootstrapper_Configuration.xml]]<br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<br />
[[File:BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage.xml]]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21389Creating Indices at the VO Level2014-05-09T16:02:30Z<p>Alex.antoniadi: /* Indexing Procedure */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following section:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating a Full Text and a Forward Index for a OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
====FwRowset_Transformer====<br />
The following transformation program is called for forward rowset creation. Transformation unit with id="1" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FwRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexTypes for both FullText and Forward Indices.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
====ForwardIndexType====<br />
In OAI-DC many fields, such as "title" and "creator", have string values, so we just need to create a ForwardIndexType for string-string key-value pairs in order to <br />
be able to create the Forward Index:<br />
<br />
<source lang="xml"><br />
<SecondaryType>ForwardIndexType</SecondaryType><br />
<Name>IndexType_fwd_string_string</Name><br />
<Description>Definition of the index type 'string_string' for the forward index</Description><br />
<Body><br />
<field-list> <br />
<field name="key"> <br />
<type>string</type><br />
<sort>ascending</sort><br />
</field><br />
<field name="value"><br />
<type>string</type><br />
</field><br />
</field-list><br />
</Body><br />
</source><br />
<br />
Note that, in contrast to the FullTextIndexType in ForwardIndexType there is no field-datatype mapping but just declaration of the datatypes supported in the index.<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<br />
[[File:Bootstrapper_Configuration.xml]]<br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<br />
[[File:BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage.xml]]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21364Index Management Framework2014-04-25T16:07:06Z<p>Alex.antoniadi: /* Services */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy.properties''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties''.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21363Index Management Framework2014-04-25T16:05:22Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope'' in the ''deploy.properties'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21342Index Management Framework2014-04-11T11:15:58Z<p>Alex.antoniadi: /* Index Service */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.<br />
For example, the following HTTP GET call is used in order to query the index:<br />
<br />
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21341Index Management Framework2014-04-11T11:02:18Z<p>Alex.antoniadi: /* Deployment Instructions */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-''{version}''.war<br />
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)<br />
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation<br />
* an application server (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry].<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21340Index Management Framework2014-04-11T10:57:25Z<p>Alex.antoniadi: </p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
<!--<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
// should be the same for all nodes in the cluster <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<br />
--><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21339Index Management Framework2014-04-11T10:52:37Z<p>Alex.antoniadi: /* IndexType */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21338Index Management Framework2014-04-11T10:51:03Z<p>Alex.antoniadi: /* IndexType */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<source lang="xml"><br />
<index-type><br />
<field-list><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>yes</highlightable><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<highlightable>no</highlightable> <!-- will not be included in the highlight snippet --><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</source><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''highlightable'''<br />
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.<br />
*'''tokenize'''<br />
:Not used<br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21337Index Management Framework2014-04-11T10:45:44Z<p>Alex.antoniadi: /* Index Service */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:<br />
<br />
<source lang="xml"><br />
<br />
<!-- index service web app --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service</artifactId><br />
<version>...</version><br />
<br />
<br />
<!-- index service commons library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-commons</artifactId><br />
<version>...</version><br />
<br />
<!-- index service client library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>index-service-client-library</artifactId><br />
<version>...</version><br />
<br />
<!-- helper common library --><br />
<groupId>org.gcube.index</groupId><br />
<artifactId>indexcommon</artifactId><br />
<version>...</version><br />
<br />
</source><br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21336Index Management Framework2014-04-11T10:37:11Z<p>Alex.antoniadi: /* Create an Index Service Node, feed and query using the corresponding client library */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
The following example demonstrate the usage of the IndexClient and IndexServiceClient.<br />
Both are created according to the Builder pattern.<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
// create a client for the given scope (we can provide endpoint as extra filter)<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21335Index Management Framework2014-04-11T10:34:59Z<p>Alex.antoniadi: /* Index Service */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
<br />
==Deployment Instructions==<br />
<br />
In order to deploy and run Index Service on a node we will need the following:<br />
* index-service-....war<br />
* smartgears (to publish the running instance of the service on the IS and be discoverable)<br />
* an application container (such as Tomcat, JBoss, Jetty)<br />
<br />
Before starting the application service we should provide the configuration needed by Resource Registry.<br />
This configuration should be placed in the folder ''$CATALINA/conf/infrastructure.properties'' and the variables<br />
that need to be set are: ''infrastructure'', ''scopes'' and ''clientMode'' (clientMode should be set to false)<br />
<br />
Example :<br />
<pre><br />
infrastructure=gcube<br />
scopes=devNext<br />
clientMode=false<br />
</pre><br />
<br />
<br />
==Usage Example==<br />
<br />
===Create an Index Service Node, feed and query using the corresponding client library===<br />
<br />
<source lang="java"><br />
<br />
final String scope = "/gcube/devNext"; <br />
<br />
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
factoryClient.createResource("myClusterID", scope);<br />
<br />
<br />
<br />
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)<br />
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();<br />
<br />
try{<br />
indexClient.feedLocator(locator);<br />
indexClient.query(query);<br />
} catch (IndexException) {<br />
// handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21334Index Management Framework2014-04-11T10:14:05Z<p>Alex.antoniadi: /* Full Text Index */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Index Service=<br />
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
defaultSameCluster=true<br />
</pre><br />
or<br />
<br />
<pre><br />
defaultSameCluster=false<br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file <br />
<br />
Example:<br />
<pre><br />
noReplicas=1<br />
noShards=2<br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:<br />
<br />
Example:<br />
<pre><br />
maxFragmentCnt=5<br />
maxFragmentSize=80<br />
</pre><br />
<br />
<br />
<br />
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location is the folder that the container runs).<br />
<br />
Example : <br />
<pre><br />
dataDir=./data<br />
</pre><br />
<br />
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor'' in the ''deploy-jndi-config.xml''<br />
<br />
Example : <br />
<pre><br />
useRRAdaptor=true<br />
</pre><br />
<br />
<br />
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml''.<br />
<br />
Example :<br />
<pre><br />
resourcesFoldername=/tmp/resources/index<br />
</pre><br />
<br />
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable ''resourcesFoldername'' in the ''deploy-jndi-config.xml'.<br />
<br />
Example :<br />
<pre><br />
hostname=dl015.madgik.di.uoa.gr<br />
scope=/gcube/devNext<br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
==Usage Example==<br />
<br />
===Create a FullTextIndex Node, feed and query using the corresponding client library===<br />
<br />
<source lang="java"><br />
FullTextIndexNodeFactoryCLProxyI proxyRandomf = FullTextIndexNodeFactoryDSL.getFullTextIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
FullTextIndexNodeCLProxyI proxyRandom = FullTextIndexNodeDSL.getFullTextIndexNodeProxyBuilder().at((W3CEndpointReference)refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (FullTextIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21333Index Management Framework2014-04-11T10:00:42Z<p>Alex.antoniadi: /* Contextual Query Language Compliance */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new full text index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''FullTextIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each FullTextIndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of FullTextIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''useClusterId'' variable in the ''deploy-jndi-config.xml'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
or<br />
<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Full Text Index, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy-jndi-config.xml'' file <br />
<br />
Example:<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="noShards" value="4" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy-jndi-config.xml'' file respectively:<br />
<br />
Example:<br />
<pre><br />
<environment name="maxFragmentCnt" value="100" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="maxFragmentSize" value="100" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
Finally, the folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location ''ServiceContext.getContext().getPersistenceRoot().getAbsolutePath() + "/indexData/elasticsearch/"'').<br />
<br />
Example : <br />
<pre><br />
<environment name="dataDir" value="/tmp" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
==Usage Example==<br />
<br />
===Create a FullTextIndex Node, feed and query using the corresponding client library===<br />
<br />
<source lang="java"><br />
FullTextIndexNodeFactoryCLProxyI proxyRandomf = FullTextIndexNodeFactoryDSL.getFullTextIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
FullTextIndexNodeCLProxyI proxyRandom = FullTextIndexNodeDSL.getFullTextIndexNodeProxyBuilder().at((W3CEndpointReference)refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (FullTextIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21332Index Management Framework2014-04-11T10:00:12Z<p>Alex.antoniadi: /* Contextual Query Language Compliance */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Full Text Index]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new full text index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''FullTextIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each FullTextIndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of FullTextIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''useClusterId'' variable in the ''deploy-jndi-config.xml'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
or<br />
<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Full Text Index, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy-jndi-config.xml'' file <br />
<br />
Example:<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="noShards" value="4" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy-jndi-config.xml'' file respectively:<br />
<br />
Example:<br />
<pre><br />
<environment name="maxFragmentCnt" value="100" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="maxFragmentSize" value="100" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
Finally, the folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location ''ServiceContext.getContext().getPersistenceRoot().getAbsolutePath() + "/indexData/elasticsearch/"'').<br />
<br />
Example : <br />
<pre><br />
<environment name="dataDir" value="/tmp" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
==Usage Example==<br />
<br />
===Create a FullTextIndex Node, feed and query using the corresponding client library===<br />
<br />
<source lang="java"><br />
FullTextIndexNodeFactoryCLProxyI proxyRandomf = FullTextIndexNodeFactoryDSL.getFullTextIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
FullTextIndexNodeCLProxyI proxyRandom = FullTextIndexNodeDSL.getFullTextIndexNodeProxyBuilder().at((W3CEndpointReference)refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (FullTextIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Index_Management_Framework&diff=21331Index Management Framework2014-04-11T09:59:07Z<p>Alex.antoniadi: /* Contextual Query Language Compliance */</p>
<hr />
<div>=Contextual Query Language Compliance=<br />
The gCube Index Framework consists of the Full Text Index and the Forward Index. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:<br />
<br />
* [[Index_Management_Framework#CQL_capabilities_implementation | Full Text Index]] : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within<br />
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch<br />
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : --><br />
<br />
=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new full text index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''FullTextIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index. <br />
<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
<br />
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
It is actually a wrapper over ElasticSearch and each FullTextIndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of FullTextIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster. <br />
<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters. <br />
<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''useClusterId'' variable in the ''deploy-jndi-config.xml'' file true of false respectively.<br />
<br />
Example<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
or<br />
<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
''ElasticSearch'', which is the underlying technology of the new Full Text Index, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy-jndi-config.xml'' file <br />
<br />
Example:<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="noShards" value="4" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy-jndi-config.xml'' file respectively:<br />
<br />
Example:<br />
<pre><br />
<environment name="maxFragmentCnt" value="100" type="java.lang.Integer" override="false" /><br />
<br />
<environment name="maxFragmentSize" value="100" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
Finally, the folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy-jndi-config.xml'' file (if the variable is not set the default location ''ServiceContext.getContext().getPersistenceRoot().getAbsolutePath() + "/indexData/elasticsearch/"'').<br />
<br />
Example : <br />
<pre><br />
<environment name="dataDir" value="/tmp" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki><!-- not a subfield --></nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
==Usage Example==<br />
<br />
===Create a FullTextIndex Node, feed and query using the corresponding client library===<br />
<br />
<source lang="java"><br />
FullTextIndexNodeFactoryCLProxyI proxyRandomf = FullTextIndexNodeFactoryDSL.getFullTextIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = FullTextIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
FullTextIndexNodeCLProxyI proxyRandom = FullTextIndexNodeDSL.getFullTextIndexNodeProxyBuilder().at((W3CEndpointReference)refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (FullTextIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
</source><br />
<br />
<!--=Full Text Index=<br />
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The full text index is implemented through three services. They are all implemented according to the Factory pattern:<br />
*The '''FullTextIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''FullTextIndexUpdater Service''' is responsible for feeding an Index. One FullTextIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''FullTextIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===CQL capabilities implementation===<br />
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:<br />
<br />
{| border="1"<br />
! CQL triple !! explanation !! lucene equivalent<br />
|-<br />
! title adj "sun is up" <br />
| documents with this phrase in their title <br />
| title:"sun is up"<br />
|-<br />
! title fuzzy "invorvement"<br />
| documents with words "similar" to invorvement in their title<br />
| title:invorvement~<br />
|-<br />
! allIndexes = "italy" (documents have 2 fields; title and abstract)<br />
| documents with the word italy in some of their fields<br />
| title:italy OR abstract:italy<br />
|-<br />
! title proximity "5 sun up"<br />
| documents with the words sun, up inside an interval of 5 words in their title<br />
| title:"sun up"~5<br />
|-<br />
! date within "2005 2008"<br />
| documents with a date between 2005 and 2008<br />
| date:[2005 TO 2008]<br />
|}<br />
<br />
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:<br />
<pre><br />
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"><br />
<ROW><br />
<FIELD name="ObjectID">doc1</FIELD><br />
<FIELD name="title">How to create an Index</FIELD><br />
<FIELD name="contents">Just read the WIKI</FIELD><br />
</ROW><br />
<ROW><br />
<FIELD name="ObjectID">doc2</FIELD><br />
<FIELD name="title">How to create a Nation</FIELD><br />
<FIELD name="contents">Talk to the UN</FIELD><br />
<FIELD name="references">un.org</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.<br />
<br />
===IndexType===<br />
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="title" lang="en"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="contents" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="references" lang="en><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:<br />
<br />
*'''index'''<br />
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)<br />
*'''store'''<br />
:specifies whether the field should be stored in its original format to be returned in the results from a query.<br />
*'''return'''<br />
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)<br />
*'''tokenize'''<br />
:specifies whether the field should be tokenized. Should usually contain "yes". <br />
*'''sort'''<br />
:Not used<br />
*'''boost'''<br />
:Not used<br />
<br />
For more complex content types, one can also specify sub-fields as in the following example:<br />
<br />
<index-type><br />
<field-list><br />
<field name="contents"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of contents</nowiki></span><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost> <br />
<br />
<span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span><br />
<field name="bookTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="chapterTitle"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<field name="foreword"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="startChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="endChapter"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field> <br />
<br />
<span style="color:green"><nowiki>// not a subfield</nowiki></span><br />
<field name="references"><br />
<index>yes</index><br />
<store>no</store><br />
<return>no</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field> <br />
<br />
</field-list><br />
</index-type><br />
<br />
<br />
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.<br />
<br />
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:<br />
*index-type-default-1.0 (DublinCore)<br />
*index-type-TEI-2.0<br />
*index-type-eiDB-1.0<br />
*index-type-iso-1.0<br />
*index-type-FT-1.0<br />
<br />
The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.<br />
<br />
The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.<br />
<br />
===Query language===<br />
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.<br />
<br />
===Statistics===<br />
<br />
===Linguistics===<br />
The linguistics component is used in the '''Full Text Index'''. <br />
<br />
Two linguistics components are available; the '''language identifier module''', and the '''lemmatizer module'''. <br />
<br />
The language identifier module is used during feeding in the FullTextBatchUpdater to identify the language in the documents. <br />
The lemmatizer module is used by the FullTextLookup module during search operations to search for all possible forms (nouns and adjectives) of the search term.<br />
<br />
The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").<br />
<br />
Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).<br />
<br />
The current license is valid until end of March 2008.<br />
<br />
====Plugin implementation====<br />
The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:<br />
org/gcube/indexservice/common/linguistics/lemmatizerplugin<br />
and <br />
org/gcube/indexservice/common/linguistics/langidplugin<br />
<br />
The class LanguageIdFactory loads an instance of the class LanguageIdPlugin.<br />
The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.<br />
<br />
The language id plugins implements the class org.gcube.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. <br />
The lemmatizer plugins implements the class org.gcube.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin.<br />
The factory use the method:<br />
Class.forName(pluginName).newInstance();<br />
when loading the implementations. <br />
The parameter pluginName is the package name of the plugin class to be loaded and instantiated.<br />
<br />
====Language Identification====<br />
There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".<br />
<br />
The plugin implementations that can be selected when the FullTextBatchUpdaterResource is created:<br />
<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin <br />
<br />
org.gcube.indexservice.common.linguistics.languageidplugin.DummyLangidPlugin<br />
<br />
=====JTextCat=====<br />
The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: <br />
http://citeseer.ist.psu.edu/68861.html<br />
It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.<br />
<br />
The JTexCat is loaded and accessed by the plugin:<br />
org.gcube.indexservice.common.linguistics.jtextcat.JTextCatPlugin<br />
<br />
The JTextCat contains no config - or bigram files since all the statistical data about the languages are contained in the package itself.<br />
<br />
The JTextCat is delivered in the jar file: textcat-1.0.1.jar.<br />
<br />
The license for the JTextCat:<br />
http://www.gnu.org/copyleft/lesser.html<br />
<br />
=====Fastlangid=====<br />
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.<br />
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.<br />
The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig. <br />
<br />
The Fast langid module is loaded by the plugin (using the LanguageIdFactory)<br />
<br />
org.gcube.indexservice.linguistics.fastplugin.FastLanguageIdPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.<br />
<br />
The Fastlangid is in the SVN module:<br />
trunk/linguistics/fastlinguistics/fastlangid<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file config.txt<br />
<br />
The shared library object is called liblangid.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.<br />
<br />
The org_gcube_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared library object.<br />
<br />
The shared library object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The license for the Fastlangid plugin:<br />
<br />
=====Language Identifier Usage=====<br />
<br />
The language identifier is used by the Full Text Updater in the Full Text Index. <br />
The plugin to use for an updater is decided when the resource is created, as a part of the create resource call.<br />
(see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language. <br />
<br />
The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.<br />
<br />
The feeded documents may contain the language per field in the document. If present this specified language is used when indexing the document. In this case the language id module is not used. <br />
If no language is specified in the document, and there is a language identification plugin loaded, the FullTextIndexUpdater Service will try to identify the language of the field using the loaded plugin for language identification. <br />
Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.<br />
<br />
A language aware query can be performed at a query or term basis:<br />
*the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.<br />
*the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)<br />
*Since language is specified at a collection level, language aware queries should only be used for language neutral collections.<br />
<br />
==== Lemmatization ====<br />
There is one real implementations of the lemmatizer plugin available in addition to the dummy plugin that always returns "" (empty string).<br />
<br />
The plugin implementations is selected when the FullTextLookupResource is created:<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
org.diligentproject.indexservice.common.linguistics.languageidplugin.DummyLemmatizerPlugin<br />
<br />
=====Fastlemmatizer=====<br />
<br />
The Fast lemmatizer module is developed by Fast. The lemmatizer modules depends on .aut files (config files) for the language to be lemmatized. Both expansion and reduction is supported, but expansion is used. The terms (noun and adjectives) in the query are expanded. <br />
<br />
The lemmatizer is configured for the following languages: German, Italian, Portuguese, French, English, Spanish, Netherlands, Norwegian.<br />
To support more languages, additional .aut files must be loaded and the config file LemmatizationQueryExpansion.xml must be updated.<br />
<br />
The lemmatizer is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.<br />
<br />
The Java native interface is generated using Swig.<br />
<br />
The Fast lemmatizer module is loaded by the plugin (using the LemmatizerIdFactory)<br />
<br />
org.diligentproject.indexservice.linguistics.fastplugin.FastLemmatizerPlugin<br />
<br />
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects.<br />
<br />
The Fastlemmatizer is in the SVN module: trunk/linguistics/fastlinguistics/fastlemmatizer<br />
<br />
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files. The license string is contained in the config file LemmatizerConfigQueryExpansion.xml<br />
The shared library object is called liblemmatizer.so<br />
<br />
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/lemmatizer.<br />
<br />
The org_diligentproject_indexservice_lemmatizer.jar contains the plugin FastLemmatizerPlugin (that is loaded by the LemmatizerFactory) and the Java native interface to the shared library.<br />
<br />
The shared library liblemmatizer.so is deployed in the $GLOBUS_LOCATION/lib catalogue.<br />
<br />
The '''$GLOBUS_LOCATION/lib''' must therefore be include in the '''LD_LIBRARY_PATH''' environment variable.<br />
<br />
===== Fast lemmatizer configuration =====<br />
The LemmatizerConfigQueryExpansion.xml contains the paths to the .aut files that is loaded when a lemmatizer is instanciated.<br />
<br />
<lemmas active="yes" parts_of_speech="NA">etc/lemmatizer/resources/dictionaries/lemmatization/en_NA_exp.aut</lemmas><br />
<br />
The path is relative to the env variable GLOBUS_LOCATION. If this path is wrong, it the Java machine will core dump. <br />
<br />
The license for the Fastlemmatizer plugin:<br />
<br />
===== Fast lemmatizer logging =====<br />
The lemmatizer logs info, debug and error messages to the file "lemmatizer.txt"<br />
<br />
===== Lemmatization Usage =====<br />
The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...<br />
<br />
==== Linguistics Licenses ====<br />
The current license key for the fastalngid and fastlemmatizer is valid through March 2008.<br />
<br />
If a new license is required please contact: Stefan.debald@fast.no to get a new license key.<br />
<br />
The license must be installed both in the Fastlangid and the Fastlemmatizer module.<br />
<br />
The fastlangid license is installed by updating the SVN text file:<br />
'''linguistics/fastlinguistics/fastlangid/etc/langid/config.txt'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<br />
// The license key<br />
// Contact stefand.debald@fast.no for new license key:<br />
LICSTR=KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG<br />
<br />
A running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/langid/config.txt''' <br />
as described above.<br />
<br />
The fastlemmatizer license is installed by updating the SVN text file:<br />
<br />
'''linguistics/fastlinguistics/fastlemmatizer/etc/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
Use a text editor and replace the 32 character license string with the new license string:<br />
<lemmatization default_mode="query_expansion" default_query_language="en" license="KILMDEPFKHNBNPCBAKONBCCBFLKPOEFG"><br />
<br />
The running system is updated by replacing the license string in the file:<br />
'''$GLOBUS_LOCATION/etc/lemmatizer/LemmatizationConfigQueryExpansion.xml'''<br />
<br />
===Partitioning===<br />
In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.<br />
<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexManagementFactoryService";</nowiki><br />
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();<br />
<br />
managementFactoryEPR = new EndpointReferenceType();<br />
managementFactoryEPR.setAddress(new Address(managementFactoryURI));<br />
managementFactory = managementFactoryLocator<br />
.getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
managementCreateArguments.setIndexTypeName(indexType));<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID("myCollectionID");<br />
managementCreateArguments.setContentType("MetaData"); <br />
<br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = <br />
managementFactory.createResource(managementCreateArguments);<br />
<br />
managementInstanceEPR = managementCreateResponse.getEndpointReference();<br />
String indexID = managementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/FullTextIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
updaterFactoryEPR = new EndpointReferenceType();<br />
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));<br />
updaterFactory = updaterFactoryLocator<br />
.getFullTextIndexUpdaterFactoryPortTypePort(updaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource updaterCreateArguments =<br />
new org.gcube.indexservice.fulltextindexupdater.stubs.CreateResource();<br />
<br />
<span style="color:green">//Connect to the correct Index</span><br />
updaterCreateArguments.setMainIndexID(indexID); <br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.fulltextindexupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory<br />
.createResource(updaterCreateArguments);<br />
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
updaterInstance = updaterInstanceLocator.getFullTextIndexUpdaterPortTypePort(updaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
updaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/FullTextIndexLookupFactoryService";</nowiki><br />
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();<br />
EndpointReferenceType lookupFactoryEPR = null;<br />
EndpointReferenceType lookupEPR = null;<br />
FullTextIndexLookupFactoryPortType lookupFactory = null;<br />
FullTextIndexLookupPortType lookupInstance = null; <br />
<br />
<span style="color:green">//Get factory portType</span><br />
lookupFactoryEPR= new EndpointReferenceType();<br />
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));<br />
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = <br />
new org.gcube.indexservice.fulltextindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;<br />
<br />
lookupCreateResourceArguments.setMainIndexID(indexID); <br />
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);<br />
lookupEPR = lookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);<br />
<br />
<span style="color:green">//Perform a query</span><br />
String query = "good OR evil";<br />
String epr = lookupInstance.query(query); <br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
===Getting statistics from a Lookup resource===<br />
<br />
String statsLocation = lookupInstance.createStatistics(new CreateStatistics());<br />
<br />
<span style="color:green">//Connect to a CMS Running Instance</span><br />
EndpointReferenceType cmsEPR = new EndpointReferenceType();<br />
<nowiki>cmsEPR.setAddress(new Address("http://swiss.domain.ch:8080/wsrf/services/gcube/contentmanagement/ContentManagementServiceService"));</nowiki><br />
ContentManagementServiceServiceAddressingLocator cmslocator = new ContentManagementServiceServiceAddressingLocator();<br />
cms = cmslocator.getContentManagementServicePortTypePort(cmsEPR);<br />
<br />
<span style="color:green">//Retrieve the statistics file from CMS</span><br />
GetDocumentParameters getDocumentParams = new GetDocumentParameters();<br />
getDocumentParams.setDocumentID(statsLocation);<br />
getDocumentParams.setTargetFileLocation(BasicInfoObjectDescription.RAW_CONTENT_IN_MESSAGE);<br />
DocumentDescription description = cms.getDocument(getDocumentParams);<br />
<br />
<span style="color:green">//Write the statistics file from memory to disk </span><br />
File downloadedFile = new File("Statistics.xml");<br />
DecompressingInputStream input = new DecompressingInputStream(<br />
new BufferedInputStream(new ByteArrayInputStream(description.getRawContent()), 2048));<br />
BufferedOutputStream output = new BufferedOutputStream( new FileOutputStream(downloadedFile), 2048);<br />
byte[] buffer = new byte[2048];<br />
int length;<br />
while ( (length = input.read(buffer)) >= 0){<br />
output.write(buffer, 0, length);<br />
}<br />
input.close();<br />
output.close();<br />
--><br />
<br />
<!--=Geo-Spatial Index=<br />
==Implementation Overview==<br />
===Services===<br />
The geo index is implemented through three services, in the same manner as the full text index. They are all implemented according to the Factory pattern:<br />
*The '''GeoIndexManagement Service''' represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of GeoIndexManagement Service, and an index is removed by terminating the corresponding GeoIndexManagement resource. The GeoIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a GeoIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.<br />
*The '''GeoIndexUpdater Service''' is responsible for feeding an Index. One GeoIndexUpdater Service resource can only update a single Index, but one Index can be updated by multiple GeoIndexUpdater Service resources. Feeding is accomplished by instantiating a GeoIndexUpdater Service resources with the EPR of the GeoIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.<br />
*The '''GeoIndexLookup Service''' is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One GeoIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of GeoIndexLookup Service resources. Updates to the Index will be propagated to all GeoIndexLookup Service resources replicating that Index.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:<br />
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]<br />
<br />
===Underlying Technology===<br />
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.<br />
<br />
===CQL capabilities implementation===<br />
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:<br />
<br />
<pre><br />
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"<br />
</pre><br />
<br />
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.<br />
<br />
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:<br />
<br />
<pre><br />
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>) <br />
</pre> <br />
<br />
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then<br />
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:<br />
<br />
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]<br />
<br />
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:<br />
<br />
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]<br />
<br />
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.<br />
<br />
===RowSet===<br />
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:<br />
<pre><br />
<ROWSET colID="colA" lang="en"><br />
<ROW id="doc1" x1="4321" y1="1234"><br />
<FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD><br />
<FIELD name="EndTime">2001-05-27T14:38:03.764</FIELD><br />
</ROW><br />
<ROW id="doc1" x1="1337" x2="4123" y1="1337" y2="6534"><br />
<FIELD name="StartTime">2001-06-27</FIELD><br />
<FIELD name="EndTime">2001-07-27</FIELD><br />
</ROW><br />
</ROWSET><br />
</pre><br />
<br />
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.<br />
<br />
===GeoIndexType===<br />
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:<br />
<br />
<pre><br />
<index-type><br />
<field-list><br />
<field name="StartTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
<field name="EndTime"><br />
<type>date</type><br />
<return>yes</return><br />
</field><br />
</field-list><br />
</index-type><br />
</pre><br />
<br />
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:<br />
<br />
*'''type''' specifies the data type of the field. Accepted values are:<br />
**SHORT - A number fitting into a Java "short"<br />
**INT - A number fitting into a Java "short"<br />
**LONG - A number fitting into a Java "short"<br />
**DATE - A date in the format yyyy-MM-dd'T'HH:mm:ss.s where only yyyy is mandatory<br />
**FLOAT - A decimal number fitting into a Java "float"<br />
**DOUBLE - A decimal number fitting into a Java "double"<br />
**STRING - A string with a maximum length of 100 (or so...)<br />
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.<br />
<br />
===Plugin Framework===<br />
As explained in the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] section, which fields a GeoIndex instance should contain can be dynamically specified through a GeoIndexType provided during GeoIndexManagement initialization. However, since new GeoIndexTypes can be added at any time with any number of new fields, there is no way for the GeoIndex itself to know how to use the information in such fields in any meaningful manner when processing a query; a static generic algorithm for processing such information would drastically limit the usefulness of the information. In order to allow for dynamic introduction of field evaluation algorithms capable of handling the dynamic nature of IndexTypes, a plugin framework was introduced. The framework allows for the creation of GeoIndexType-specific evaluators handling ranking and refinement.<br />
<br />
====Ranking====<br />
The results of a query are sorted according to their rank, and their ranks are also returned to the caller. A RankEvaluator plugin is used to determine the rank of objects. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to return a meaningful rank of each object.<br />
<br />
====Refinement====<br />
The GeoIndex uses TwoStep processing in order to process a query. Firstly, a very efficient filtering step will all possible hits (along with some false hits) using the minimal bounding rectangle (mbr) of the query region. Then, a more costly refinement step will use additional object and query information in order to eliminate all the false hits. While the filtering step is handled internally in the index, the refinement step is handled by a refiner plugin. It is provided with the query region, Object data, GeoIndexType and an optional set of plugin specific arguments, and is expected to use this information in order to determine whether an object is within a query or not.<br />
<br />
====Creating a Rank Evaluator====<br />
A RankEvaluator plugin has to extend the abstract class org.gcube.indexservice.geo.ranking.RankEvaluator which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public double rank(Object entry) -- the method that calculates the rank of an entry. <br />
<br />
<br />
In addition, the RankEvaluator abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Ok, simple enough... So let's create a RankEvaluator plugin. We'll assume that for a certain use case, entries which span over a long period of time are of less interest than objects wich span over a short period of time. Since we're dealing with TimeSpans, we'll assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier.<br />
<br />
The first thing we need to do, is to create a class which extends RankEvaluator:<br />
<pre><br />
package org.mojito.ranking;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
<br />
}<br />
</pre><br />
<br />
Next, we'll implement the isIndexTypeCompatible method. To do this, we need a way of determine if the fields we need are present in the GeoIndexType argument. Luckily, GeoIndexType contains a method called ''containsField'' which expects the String name and GeoIndexField.DataType (date, double, float, int, long, short or string) type of the field in question as arguments. In addition, we'll implement the initialize() method, which we'll leave empty as the plugin we are creating doesn't need to handle any arguments.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
Last, but not least... We need to implement the Rank() method. This is of course the method which calculates a rank for an entry, based on the query polygon, any extra arguments and the different fields of the entry. In our implementation, we'll simply calculate the timespan, and devide 1 by this number in order to get a quick and dirty rank. Keep in mind that this method is not called for all the entries resulting from the R-Tree filtering step, but only a subset roughly fitting the resultset page size. This means that somewhat computationally heavy operation can be performed (if needed) without drastically lowering response time. Please also note how the getDataField() method is used in order retrieve the evaluated fields from the entry data, and how the result is cast to ''Long'' (even though we are dealing with dates). The reason for this is that the GeoIndex internally represents a date as a long containing the number of seconds from the Epoch. If we wanted to evaluate the Minimal Bouning Rectangle (MBR) of the entries, we could access them through ''entry.getBounds()''.<br />
<pre><br />
package org.mojito.ranking;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.ranking.RankEvaluator;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRanker extends RankEvaluator{<br />
public void initialize(String[] args) {}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public double rank(Object obj){<br />
Entry entry = (Entry)obj;<br />
Data data = (Data)entry.getData();<br />
Long entryStartTime = (Long) this.getDataField("StartTime", data);<br />
Long entryEndTime = (Long) this.getDataField("EndTime", data);<br />
long spanSize = entryEndTime - entryStartTime;<br />
<br />
return 1/(spanSize + 1);<br />
}<br />
<br />
}<br />
</pre><br />
<br />
<br />
And there we are! Our first working RankEvaluator plugin.<br />
<br />
====Creating a Refiner====<br />
A Refiner plugin has to extend the abstract class org.diligentproject.indexservice.geo.refinement.Refiner which contains three abstract methods:<br />
<br />
*abstract public void initialize(String args[]) -- a method called during the initiation of the RankEvaluator plugin, providing the plugin with any arguments provided in the code. All arguments are given as Strings, and it's up to the plugin to parse the string into the datatype needed by the plugin.<br />
*abstract public boolean isIndexTypeCompatible(GeoIndexType indexType) -- should be able to determine whether this plugin can be used by an index conforming to the GeoIndexType argument<br />
*abstract public List<Entry> refine(List<Entry> entries); -- the method responsible for refining a list of results. <br />
<br />
<br />
In addition, the Refiner abstract class implements two other methods worth noting<br />
*final public void init(Polygon polygon, InclusionType containmentMethod, GeoIndexType indexType, String args[]) -- initialized the protected variables Polygon polygon, Envelope envelope, InclusionType containmentMethod and GeoIndexType indexType, before calling the abstract initialize() using the last argument. This means that all the four protected variables are available in the initialize() method.<br />
*protected Object getDataField(String field, Data data) -- a method used to retrieve a the contents of a specific GeoIndexType field from a org.geotools.index.Data object conforming to the GeoIndexType used by the plugin.<br />
<br />
<br />
Quite similar to the RankEvaluator isn't it?... So let's create a Refiner plugin to go with the [[Geographical/Spatial Index#Creating a Refiner|previously created RankEvaluator]]. We'll still assume that the data stored in the index will have a "StartTime" field and an "EndTime" field, in accordance with the [[Geographical/Spatial Index#GeoIndexType|GeoIndexType]] created earlier. The "shorter is better" notion from the RankEvaluator example still holds true, and we want to create a plugin which refines a query by removing all objects wich span over a time bigger than a maxSpanSize value, avoiding those ridiculous everlasting objects... The maxSpanSize value will be provided to the plugin as an initialization argument.<br />
<br />
The first thing we need to do, is to create a class which extends Refiner:<br />
<pre><br />
package org.mojito.refinement;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner{<br />
<br />
}<br />
</pre><br />
<br />
The isIndexTypeCompatible method is implemented in a similar manner as for the SpanSizeRanker. However in this plugin we have to pay closer attention to the initialize() function, since we expect the maxSpanSize to be given as an argument. Since maxSpanSize is the only argument, the String array argument of initialize(String[] args) will contain a single element which will be a String representation of the maxSpanSize. In order for this value to be usable, we will parse it to a long, which will represent the maxSpanSize in milliseconds.<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
} <br />
}<br />
</pre><br />
<br />
And once again we've saved the best, or at least the most important, for last; the refine() implementation is where we decide how to refine the query results. It takse a list of Entry objects as an argument, and is expected to return a similar (though usually smaller) list of Entry objects as a result. As with the RankEvaluator, the synchronization with the ResultSet page size allows for quite computationally heavy operations, however we have little use for that in this example. We will simply calculate the time span of each entry in the argument List and compare it to the maxSpanSize value. If it is smaller or equal, we'll add it to the results List.<br />
<br />
<pre><br />
package org.mojito.refinement;<br />
<br />
import java.util.ArrayList;<br />
import java.util.List;<br />
<br />
import org.gcube.indexservice.common.GeoIndexField;<br />
import org.gcube.indexservice.common.GeoIndexType;<br />
import org.gcube.indexservice.geo.refinement.Refiner;<br />
import org.geotools.index.Data;<br />
import org.geotools.index.rtree.Entry;<br />
<br />
<br />
public class SpanSizeRefiner extends Refiner {<br />
private long maxSpanSize;<br />
<br />
public void initialize(String[] args) {<br />
this.maxSpanSize = Long.parseLong(args[0]);<br />
}<br />
<br />
public boolean isIndexTypeCompatible(GeoIndexType indexType) {<br />
return indexType.containsField("StartTime", GeoIndexField.DataType.DATE) && <br />
indexType.containsField("EndTime", GeoIndexField.DataType.DATE);<br />
}<br />
<br />
public List<Entry> refine(List<Entry> entries){<br />
ArrayList<Entry> returnList = new ArrayList<Entry>();<br />
Data data;<br />
Long entryStartTime = null, entryEndTime = null;<br />
<br />
<br />
for(Entry entry : entries){<br />
data = (Data)entry.getData();<br />
entryStartTime = (Long) this.getDataField("StartTime", data);<br />
entryEndTime = (Long) this.getDataField("EndTime", data);<br />
<br />
if (entryEndTime < entryStartTime){<br />
long temp = entryEndTime;<br />
entryEndTime = entryStartTime;<br />
entryStartTime = temp; <br />
}<br />
if (entryEndTime - entryStartTime <= maxSpanSize){<br />
returnList.add(entry);<br />
}<br />
}<br />
return returnList;<br />
}<br />
}<br />
</pre><br />
<br />
And that's all there is to it! We have created our first Refinement plugin, capable of getting rid of those annoying long-lived objects.<br />
<br />
===Query language===<br />
A query is specified through a SearchPolygon object, containing the points of the vertices of the query region, an optional RankingRequest object and an optional list of RefinementRequest objects. A RankingRequest object contains the String ID of the RankEvaluator to use, along with an optional String array of arguments to be used by the specified RankEvaluator. Similarly, the RefinementRequest contains the String ID of the Refiner to use, along with an optional String array of arguments to be used by the specified Refiner<br />
<br />
<br />
<br />
==Usage Example==<br />
===Create a Management Resource===<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoManagementFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexManagementFactoryService";</nowiki><br />
GeoIndexManagementFactoryServiceAddressingLocator geoManagementFactoryLocator = new GeoIndexManagementFactoryServiceAddressingLocator();<br />
<br />
geoManagementFactoryEPR = new EndpointReferenceType();<br />
geoManagementFactoryEPR.setAddress(new Address(geoManagementFactoryURI));<br />
geoManagementFactory = geoManagementFactoryLocator<br />
.getGeoIndexManagementFactoryPortTypePort(managementFactoryEPR);<br />
<br />
<span style="color:green">//Create generator resource and get endpoint reference of WS-Resource.</span><br />
org.gcube.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =<br />
new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();<br />
<br />
managementCreateArguments.setIndexTypeID(indexType);<span style="color:green">//Optional (only needed if not provided in RS)</span><br />
managementCreateArguments.setIndexID(indexID);<span style="color:green">//Optional (should usually not be set, and the service will create the ID)</span><br />
managementCreateArguments.setCollectionID(new String[] {collectionID});<br />
managementCreateArguments.setGeographicalSystem("WGS_1984");<br />
managementCreateArguments.setUnitOfMeasurement("DD");<br />
managementCreateArguments.setNumberOfDecimals(4);<br />
<br />
org.gcube.indexservice.geoindexmanagement.stubs.CreateResourceResponse geoManagementCreateResponse = <br />
geoManagementFactory.createResource(generatorCreateArguments);<br />
geoManagementInstanceEPR = geoManagementCreateResponse.getEndpointReference();<br />
String indexID = geoManagementCreateResponse.getIndexID();<br />
<br />
===Create an Updater Resource and start feeding===<br />
<br />
EndpointReferenceType geoUpdaterFactoryEPR = null;<br />
EndpointReferenceType geoUpdaterInstanceEPR = null;<br />
GeoIndexUpdaterFactoryPortType geoUpdaterFactory = null;<br />
GeoIndexUpdaterPortType geoUpdaterInstance = null;<br />
GeoIndexUpdaterServiceAddressingLocator geoUpdaterInstanceLocator = new GeoIndexUpdaterServiceAddressingLocator();<br />
GeoIndexUpdaterFactoryServiceAddressingLocator updaterFactoryLocator = new GeoIndexUpdaterFactoryServiceAddressingLocator();<br />
<br />
<span style="color:green">//Get the factory portType</span><br />
<nowiki>String geoUpdaterFactoryURI = "http://some.domain.no:8080/wsrf/services/gcube/index/GeoIndexUpdaterFactoryService";</nowiki> <span style="color:green">//could be on any node</span><br />
geoUpdaterFactoryEPR = new EndpointReferenceType();<br />
geoUpdaterFactoryEPR.setAddress(new Address(geoUpdaterFactoryURI));<br />
geoUpdaterFactory = updaterFactoryLocator<br />
.getGeoIndexUpdaterFactoryPortTypePort(geoUpdaterFactoryEPR);<br />
<br />
<br />
<span style="color:green">//Create updater resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResource geoUpdaterCreateArguments =<br />
new org.gcube.indexservice.geoindexupdater.stubs.CreateResource();<br />
<br />
updaterCreateArguments.setMainIndexID(indexID);<br />
<br />
<br />
<span style="color:green">//Now let's insert some data into the index... Firstly, get the updater EPR.</span><br />
org.gcube.indexservice.geoindexupdater.stubs.CreateResourceResponse geoUpdaterCreateResponse = updaterFactory<br />
.createResource(geoUpdaterCreateArguments);<br />
geoUpdaterInstanceEPR = geoUpdaterCreateResponse.getEndpointReference() <br />
<br />
<br />
<span style="color:green">//Get updater instance PortType</span><br />
geoUpdaterInstance = geoUpdaterInstanceLocator.getGeoIndexUpdaterPortTypePort(geoUpdaterInstanceEPR);<br />
<br />
<br />
<span style="color:green">//read the EPR of the ResultSet containing the ROWSETs to feed into the index </span> <br />
BufferedReader in = new BufferedReader(new FileReader(eprFile));<br />
String line;<br />
resultSetLocator = "";<br />
while((line = in.readLine())!=null){<br />
resultSetLocator += line;<br />
}<br />
<br />
<span style="color:green">//Tell the updater to start gathering data from the ResultSet</span><br />
geoUpdaterInstance.process(resultSetLocator);<br />
<br />
===Create a Lookup resource and perform a query===<br />
<br />
<span style="color:green">//Let's put it on another node for fun...</span><br />
<nowiki>String geoLookupFactoryURI = "http://another.domain.no:8080/wsrf/services/gcube/index/GeoIndexLookupFactoryService";</nowiki><br />
EndpointReferenceType geoLookupFactoryEPR = null;<br />
EndpointReferenceType geoLookupEPR = null;<br />
GeoIndexLookupFactoryServiceAddressingLocator geoFactoryLocator = new GeoIndexLookupFactoryServiceAddressingLocator();<br />
GeoIndexLookupServiceAddressingLocator geoLookupInstanceLocator = new GeoIndexLookupServiceAddressingLocator();<br />
GeoIndexLookupFactoryPortType geoIndexLookupFactory = null;<br />
GeoIndexLookupPortType geoIndexLookupInstance = null;<br />
<br />
<span style="color:green">//Get factory portType</span><br />
geoLookupFactoryEPR = new EndpointReferenceType();<br />
geoLookupFactoryEPR.setAddress(new Address(geoLookupFactoryURI));<br />
geoLookupFactory = geoIndexFactoryLocator.getGeoIndexLookupFactoryPortTypePort(geoLookupFactoryEPR);<br />
<br />
<span style="color:green">//Create resource and get endpoint reference of WS-Resource</span><br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResource geoLookupCreateResourceArguments = <br />
new org.gcube.indexservice.geoindexlookup.stubs.CreateResource();<br />
org.gcube.indexservice.geoindexlookup.stubs.CreateResourceResponse geoLookupCreateResponse = null;<br />
<br />
geoLookupCreateResourceArguments.setMainIndexID(indexID); <br />
geoLookupCreateResponse = geoLookupFactory.createResource( geoLookupCreateResourceArguments);<br />
geoLookupEPR = geoLookupCreateResponse.getEndpointReference(); <br />
<br />
<span style="color:green">//Get instance PortType</span><br />
geoLookupInstance = geoLookupInstanceLocator.getGeoIndexLookupPortTypePort(geoLookupInstanceEPR);<br />
<br />
<span style="color:green">//Start creating the query</span><br />
SearchPolygon search = new SearchPolygon();<br />
<br />
Point[] vertices = new Point[] {new Point(-100, 11), new Point(-100, -100),<br />
new Point(100, -100), new Point(100, 11)};<br />
<br />
<span style="color:green">//A request to rank by the ranker created in the previous example</span><br />
RankingRequest ranker = new RankingRequest(new String[]{}, "SpanSizeRanker");<br />
<br />
<span style="color:green">//A request to use the refiner created in the previous example. <br />
//Please make note of the refiner argument in the String array.</span><br />
RefinementRequest refinement = new RefinementRequest(new String[]{"100000"}, "SpanSizeRefiner");<br />
<br />
<span style="color:green">//Perform the query</span><br />
search.setVertices(vertices);<br />
search.setRanker(ranker);<br />
search.setRefinementList(new RefinementRequest[]{refinement});<br />
search.setInclusion(InclusionType.contains);<br />
String resultEpr = geoIndexLookupInstance.search(search);<br />
<br />
<span style="color:green">//Print the results to screen. (refer to the [[ResultSet Framework]] page for a more detailed explanation)</span><br />
RSXMLReader reader=null;<br />
ResultElementBase[] results;<br />
<br />
try{<br />
<span style="color:green">//create a reader for the ResultSet we created</span><br />
reader = RSXMLReader.getRSXMLReader(new RSLocator(resultEpr)); <br />
<br />
<span style="color:green">//Print each part of the RS to std.out</span><br />
System.out.println("<Results>");<br />
do{<br />
System.out.println(" <Part>");<br />
if (reader.getNumberOfResults() > 0){<br />
results = reader.getResults(ResultElementGeneric.class);<br />
for(int i = 0; i < results.length; i++ ){<br />
System.out.println(" "+results[i].toXML());<br />
}<br />
}<br />
System.out.println(" </Part>");<br />
if(!reader.getNextPart()){<br />
break;<br />
}<br />
}<br />
while(true);<br />
System.out.println("</Results>");<br />
}<br />
catch(Exception e){<br />
e.printStackTrace();<br />
}<br />
<br />
<br />
--><br />
<br />
=Forward Index=<br />
<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:<br />
key; integer, value; string<br />
key; float, value; string<br />
key; string, value; string<br />
key; date, value;string<br />
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.<br />
<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The new forward index is implemented through one service. It is implemented according to the Factory pattern:<br />
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.<br />
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:<br />
<br />
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]<br />
<br />
<br />
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.<br />
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.<br />
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.<br />
Example<br />
<br />
<pre><br />
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
or<br />
<pre><br />
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" /><br />
</pre><br />
<br />
<br />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file<br />
Example:<br />
<br />
<pre><br />
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.<br />
<br />
Example:<br />
<br />
<pre><br />
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /><br />
<br />
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /><br />
<br />
<br />
<!-- should be the same for all nodes in the cluster --><br />
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /><br />
<br />
<!-- should be the same for all nodes in the cluster --> <br />
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" /><br />
</pre><br />
<br />
<br />
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml<br />
<br />
Example: <br />
<pre><br />
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" /><br />
</pre><br />
<br />
<br />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.<br />
<br />
<source lang="bash"><br />
# Initialize node run: <br />
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : <br />
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD<br />
<br />
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : <br />
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD<br />
</source><br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]<br />
<br />
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
==Usage Example==<br />
<br />
===Create a ForwardIndex Node, feed and query using the corresponding client library===<br />
<br />
<br />
<source lang="java"><br />
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();<br />
<br />
//Create a resource<br />
CreateResource createResource = new CreateResource();<br />
CreateResourceResponse output = proxyRandomf.createResource(createResource);<br />
<br />
//Get the reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();<br />
<br />
//or get a random reference<br />
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();<br />
<br />
List<EndpointReference> refs = q.fire();<br />
<br />
//Get a proxy<br />
try {<br />
ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();<br />
//Feed<br />
proxyRandom.feedLocator(locator);<br />
//Query<br />
proxyRandom.query(query);<br />
} catch (ForwardIndexNodeException e) {<br />
//Handle the exception<br />
}<br />
<br />
</source><br />
<br />
<!--=Forward Index=<br />
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.<br />
The forward index supports the following schema for each key value pair:<br />
<br />
key; integer, value; string<br />
<br />
key; float, value; string<br />
<br />
key; string, value; string<br />
<br />
key; date, value;string<br />
<br />
The schema for an index is given as a parameter when the index is created.<br />
The schema must be known in order to be able to instantiate a class that is capable of comparing the two keys (implements java.util.comparator). <br />
The Objects stored in the database can be anything.<br />
There is no limit to the length of the keys or values (except for the typed keys).<br />
<br />
==Implementation Overview==<br />
===Services===<br />
The forward index is implemented through three services. They are all implemented according to the factory-instance pattern:<br />
*An instance of '''ForwardIndexManagement Service''' represents an index and manages this index. The life-cycle of the index is the same as the life-cycle of the management instance; the index is created when the '''ForwardIndexManagement''' instance is created, and the index is terminated (deleted) when the '''ForwardIndexManagement''' instance resource is removed. The '''ForwardIndexManagement Service''' manage the life-cycle and properties of the forward index. It co-operates with instances of the '''ForwardIndexUpdater Service''' when feeding content into the index, and with instances of the '''ForwardIndexLookup Service''' for getting content from the index. The Content Management service is used for safe storage of an index. A logical file is established in Content Management when the index is created. The index is retrieved from Content Management and established on the local node when an existing forward index is dynamically deployed on a node. The logical file in Content Management is deleted when the '''ForwardIndexManagement''' instance is deleted.<br />
*The '''ForwardIndexUpdater Service''' is responsible for feeding content into the forward index. The content of the forward index consists of key value pairs. A '''ForwardIndexUpdater Service''' resource updates a single Index. One index may be updated by several '''ForwardIndexUpdater Service''' instances simultaneously. When feeding the index, a '''ForwardIndexUpdater Service''' is created, with the EPR of the '''FullTextIndexManagement''' resource connected to the Index to update. The '''ForwardIndexUpdater''' instance is connected to a ResultSet that contains the content to be fed to the Index.<br />
*The '''ForwardIndexLookup Service''' is responsible receiving queries for the index, and returning responses that matches the queries. The '''ForwardIndexLookup''' gets a reference to the '''ForwardIndexManagement''' instance that is managing the index, when it is created. It can only query this index. Several '''ForwardIndexLookup''' instances may query the same index. The '''ForwardIndexLookup''' instances gets the index from Content Management, and establishes a local copy of the index on the file system that is queried. The local copy is kept up to date by subscribing for index change notifications that are emitted my the '''ForwardIndexManagement''' instance.<br />
<br />
It is important to note that none of the three services have to reside on the same node; they are only connected through web service calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]].<br />
<br />
===Underlying Technology===<br />
<br />
The Forward Index Lookup is based on BerkeleyDB. A BerkeleyDB B+tree is used as an internal index for each dimension-key. An additional BerkeleyDB key-value store is used for storing the presentable information of each document. The range query that a Forward Index Lookup resource will execute internally is a conjunction of single range criteria, that each refers to a single key. For each criterion the B+tree of the corresponding key is used. The outcome of the initial range query, is the intersection of the documents that satisfy all the criteria. The following figure shows the internal design of Forward Index Lookup:<br />
<br />
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]<br />
<br />
===CQL capabilities implementation===<br />
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:<br />
<br />
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]<br />
<br />
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.<br />
<br />
===RowSet===<br />
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':<br />
<br />
The row set "schema"<br />
<br />
<pre><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<KEY><br />
<KEYNAME>title</KEYNAME><br />
<KEYVALUE>sun is up</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>ObjectID</KEYNAME><br />
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionID</KEYNAME><br />
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE><br />
</KEY><br />
<KEY><br />
<KEYNAME>gDocCollectionLang</KEYNAME><br />
<KEYVALUE>es</KEYVALUE><br />
</KEY><br />
<VALUE><br />
<FIELD name="title">sun is up</FIELD><br />
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD><br />
<FIELD name="gDocCollectionLang">es</FIELD><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</pre><br />
<br />
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.<br />
<br />
===Test Client ForwardIndexClient===<br />
<br />
The org.gcube.indexservice.clients.ForwardIndexClient<br />
test client is used to test the ForwardIndex.<br />
<br />
The ForwardIndexClient is in the SVN module test/client <br />
The ForwardIndexClient uses a property file ForwardIndex.properties:<br />
<br />
The property file contains the following properties:<br />
ForwardIndexManagementFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexManagementFactoryService<br />
Host=dili02.osl.fast.no<br />
ForwardIndexUpdaterFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexUpdaterFactoryService<br />
ForwardIndexLookupFactoryResource=<br />
/wsrf/services/gcube/index/ForwardIndexLookupFactoryService<br />
geoManagementFactoryResource=<br />
/wsrf/services/gcube/index/GeoIndexManagementFactoryService<br />
Port=8080<br />
Create-ForwardIndexManagementFactory=true<br />
Create-ForwardIndexLookupFactory=true<br />
Create-ForwardIndexUpdaterFactory=true<br />
<br />
The property Host and Port must be edited to point to VO of interest.<br />
<br />
The test client creates the Factory services (gets the EPRs of) and uses the factory services<br />
to create the statefull web services:<br />
<br />
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.<br />
<br />
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.<br />
<br />
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.<br />
<br />
The test clients creates one WSresource of each type, inserts some data into the update, and queries<br />
the data by using the lookup WS resource.<br />
<br />
Inserting data and deleting tuples<br />
Tuples can be inserted and deleted by:<br />
insertingPair(key,value) / deletingPair(key) -simple methods to insert / delete tuples.<br />
process(rowSet) - method to insert / delete a series of tuples.<br />
procesResultSet - method to insert / delete a series of tuples in a rowset inserted into a resultSet.<br />
<br />
Lookup:<br />
Tuples can be queried by :<br />
getEQ_int(key), getEQ_float(key), getEQ_string(key), getEQ_date(key)<br />
getLT_int(key), getLT_float(key), getLT_string(key), getLT_date(key)<br />
getLE_int(key), getLE_float(key), getLE_string(key), getLE_date(key)<br />
getGT_int(key), getGT_float(key), getGT_string(key), getGT_date(key)<br />
getGE_int(key), getGE_float(key), getGE_string(key), getGE_date(key)<br />
getGTandLT_int(keyGT,keyLT), getGTandLT_float(keyGT,keyLT),getGTandLT_string(keyGT,keyLT), getGTandLT_date(keyGT,keyLT)<br />
getGEandLT_int(keyGE,keyLT), getGEandLT_float(keyGE,keyLT),getGEandLT_string(keyGE,keyLT), getGEandLT_date(keyGE,keyLT)<br />
getGTandLE_int(keyGT,keyLE), getGTandLE_float(keyGT,keyLE),getGTandLE_string(keyGT,keyLE), getGTandLE_date(keyGT,keyLE)<br />
getGEandLE_int(keyGE,keyLE), getGEandLE_float(keyGE,keyLE),getGEandLE_string(keyGE,keyLE), getGEandLE_date(keyGE,keyLE)<br />
getAll<br />
<br />
The result is provided to the client by using the Result Set service.<br />
--><br />
<!--=Storage Handling layer=<br />
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.<br />
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.<br />
--><br />
<br />
=Index Common library=<br />
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21135Creating Indices at the VO Level2014-02-25T16:46:53Z<p>Alex.antoniadi: /* Metadata Broker XSLT */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following sections:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
* [[ Index_Management_Framework#RowSet_2 | Forward Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
* [[ Index_Management_Framework#Forward_Index | Forward Index key-value pairs ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating a Full Text and a Forward Index for a OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
====FwRowset_Transformer====<br />
The following transformation program is called for forward rowset creation. Transformation unit with id="1" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FwRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexTypes for both FullText and Forward Indices.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
====ForwardIndexType====<br />
In OAI-DC many fields, such as "title" and "creator", have string values, so we just need to create a ForwardIndexType for string-string key-value pairs in order to <br />
be able to create the Forward Index:<br />
<br />
<source lang="xml"><br />
<SecondaryType>ForwardIndexType</SecondaryType><br />
<Name>IndexType_fwd_string_string</Name><br />
<Description>Definition of the index type 'string_string' for the forward index</Description><br />
<Body><br />
<field-list> <br />
<field name="key"> <br />
<type>string</type><br />
<sort>ascending</sort><br />
</field><br />
<field name="value"><br />
<type>string</type><br />
</field><br />
</field-list><br />
</Body><br />
</source><br />
<br />
Note that, in contrast to the FullTextIndexType in ForwardIndexType there is no field-datatype mapping but just declaration of the datatypes supported in the index.<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<br />
[[File:Bootstrapper_Configuration.xml]]<br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<br />
[[File:BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage.xml]]</div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21134Creating Indices at the VO Level2014-02-25T16:45:46Z<p>Alex.antoniadi: /* Bootstrapper Configuration */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following sections:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
* [[ Index_Management_Framework#RowSet_2 | Forward Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
* [[ Index_Management_Framework#Forward_Index | Forward Index key-value pairs ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating a Full Text and a Forward Index for a OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
====FwRowset_Transformer====<br />
The following transformation program is called for forward rowset creation. Transformation unit with id="1" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FwRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexTypes for both FullText and Forward Indices.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
====ForwardIndexType====<br />
In OAI-DC many fields, such as "title" and "creator", have string values, so we just need to create a ForwardIndexType for string-string key-value pairs in order to <br />
be able to create the Forward Index:<br />
<br />
<source lang="xml"><br />
<SecondaryType>ForwardIndexType</SecondaryType><br />
<Name>IndexType_fwd_string_string</Name><br />
<Description>Definition of the index type 'string_string' for the forward index</Description><br />
<Body><br />
<field-list> <br />
<field name="key"> <br />
<type>string</type><br />
<sort>ascending</sort><br />
</field><br />
<field name="value"><br />
<type>string</type><br />
</field><br />
</field-list><br />
</Body><br />
</source><br />
<br />
Note that, in contrast to the FullTextIndexType in ForwardIndexType there is no field-datatype mapping but just declaration of the datatypes supported in the index.<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<br />
[[File:Bootstrapper_Configuration.xml]]<br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='description']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='publisher']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='contributor']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='date']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='type']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='format']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='identifier']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='source']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='language']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='relation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='alternative']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='tableOfContents']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='abstract']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='created']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='valid']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='available']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='issued']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='modified']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateAccepted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateCopyrighted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateSubmitted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='extend']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='medium']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isVersionOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasVersion']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReplacedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='replaces']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isRequiredBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='requires']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isPartOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasPart']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReferencedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='references']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isFormatOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasFormat']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='conformsTo']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='spatial']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='temporal']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='audience']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPeriodicity']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPolicy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='instructionalMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='provenance']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rightsHolder']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='mediator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='educationLevel']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accessRights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='license']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='bibliographicCitation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFT<br />
The following XSLT is applied last upon fulltext rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//ROWSET/ROW/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_dc_anylanguage_to_fwRowset_anylanguage_title_creator_subject_coverage<br />
The following XSLT transforms data elements with dc schema to forward rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:variable name="keys"><br />
<key><br />
<keyName>title</keyName><br />
<keyXPath>//*[local-name()='title']</keyXPath><br />
</key><br />
<key><br />
<keyName>creator</keyName><br />
<keyXPath>//*[local-name()='creator']</keyXPath><br />
</key><br />
<key><br />
<keyName>subject</keyName><br />
<keyXPath>//*[local-name()='subject']</keyXPath><br />
</key><br />
<key><br />
<keyName>coverage</keyName><br />
<keyXPath>//*[local-name()='coverage']</keyXPath><br />
</key><br />
</xsl:variable><br />
<xsl:template match="/"><br />
<xsl:element name="ROWSET"><br />
<xsl:element name="INSERT"><br />
<xsl:element name="TUPLE"><br />
<xsl:element name="VALUE"><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<FIELD name="title"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<FIELD name="creator"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<FIELD name="subject"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<FIELD name="coverage"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFWD<br />
The following XSLT is applied last upon forward rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/KEY"><br />
<xsl:copy-of select="self::node()[not(KEYNAME = preceding::KEY/KEYNAME and KEYVALUE = preceding::KEY/KEYVALUE)]" /><br />
</xsl:for-each><br />
<VALUE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/VALUE/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source></div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21133Creating Indices at the VO Level2014-02-25T16:44:03Z<p>Alex.antoniadi: /* FwRowset_Transformer */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following sections:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
* [[ Index_Management_Framework#RowSet_2 | Forward Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
* [[ Index_Management_Framework#Forward_Index | Forward Index key-value pairs ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating a Full Text and a Forward Index for a OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
====FwRowset_Transformer====<br />
The following transformation program is called for forward rowset creation. Transformation unit with id="1" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FwRowset_Transformer.xml]]<br />
<br />
=== Index Types ===<br />
In this section we present the required IndexTypes for both FullText and Forward Indices.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
====ForwardIndexType====<br />
In OAI-DC many fields, such as "title" and "creator", have string values, so we just need to create a ForwardIndexType for string-string key-value pairs in order to <br />
be able to create the Forward Index:<br />
<br />
<source lang="xml"><br />
<SecondaryType>ForwardIndexType</SecondaryType><br />
<Name>IndexType_fwd_string_string</Name><br />
<Description>Definition of the index type 'string_string' for the forward index</Description><br />
<Body><br />
<field-list> <br />
<field name="key"> <br />
<type>string</type><br />
<sort>ascending</sort><br />
</field><br />
<field name="value"><br />
<type>string</type><br />
</field><br />
</field-list><br />
</Body><br />
</source><br />
<br />
Note that, in contrast to the FullTextIndexType in ForwardIndexType there is no field-datatype mapping but just declaration of the datatypes supported in the index.<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<source lang='xml'><br />
<BootstrapInfo><br />
<types><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.TreeManagerCollectionDataType" name="TreeManagerCollection" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.OpenSearchDataType" name="OpenSearch" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.FullTextIndexNodeDataType" name="FullTextIndexNode" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.ForwardIndexNodeDataType" name="ForwardIndexNode" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.GCUBECollectionDataType" name="GCUBECollection" /><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.OpenSearchGenerationTaskType" name="OpenSearchGenerationTaskType"><br />
<input type="GCUBECollection" /><br />
<output type="OpenSearch" /><br />
<run>true</run><br />
</tasktype><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.ForwardIndexNodeGenerationTaskType" name="ForwardIndexNodeGenerationTask"><br />
<input type="TreeManagerCollection" /><br />
<output type="ForwardIndexNode" /><br />
<run>true</run><br />
</tasktype><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.FullTextIndexNodeGenerationTaskType" name="FullTextIndexNodeGenerationTask"><br />
<input type="TreeManagerCollection" /><br />
<output type="FullTextIndexNode" /><br />
<run>true</run><br />
</tasktype><br />
<jobtype description="Creates the required fulltext indices for a collection." name="FTIndexNodeCollection"><br />
<input type="TreeManagerCollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_ft_node_index.input" value="%FTIndexNodeCollection.input" /><br />
<assign to="%Create_ft_node_index.output.IndexedCollectionID" value="%Create_ft_node_index.input.ColID" /><br />
<task name="Create_ft_node_index" tasktype="FullTextIndexNodeGenerationTask" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<jobtype description="Creates the required forward indices for a collection." name="FWDIndexNodeCollection"><br />
<input type="TreeManagerCollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_fwd_node_index.input" value="%FWDIndexNodeCollection.input" /><br />
<assign to="%Create_fwd_node_index.output.IndexedCollectionID" value="%Create_fwd_node_index.input.ColID" /><br />
<task name="Create_fwd_node_index" tasktype="ForwardIndexNodeGenerationTask" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<jobtype description="Creates the open search resource for an open search collection." name="CreateOpenSearchCollectionResource"><br />
<input type="GCUBECollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_OSR.input" value="%CreateOpenSearchCollectionResource.input" /><br />
<assign to="%Create_OSR.output.OpenSearchCollectionID" value="%Create_OSR.input.ColID" /><br />
<task name="Create_OSR" tasktype="OpenSearchGenerationTaskType" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<br />
</types><br />
<br />
<jobs><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index SPD Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:SPD" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_SPD_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_PROVENANCE_anylanguage_to_ftRowset_anylanguage, $BrokerXSLT_DwC_anylanguage_to_ftRowset_anylanguage, $BrokerXSLT_Properties_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index OAI Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:OAI" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_oai_dc_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FARM_dc_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index FIGIS Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:FIGIS" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_FIGIS_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FIGIS_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index SPD Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:SPD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, scientificName, scientificNameAuthorship, genus, phylum, kingdom, order, family, specificEpithet ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_DwC_anylanguage_to_fwRowset_anylanguage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index FIGIS Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:FIGIS" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, scientific_name, family, personal_author ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FIGIS_anylanguage_to_fwRowset_anylanguage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index OAI Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:OAI" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, title, creator, subject, coverage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_dc_anylanguage_to_fwRowset_anylanguage_title_creator_subject_coverage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForDRIVERCollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="DRIVER" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:creator, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="30602b90-603f-11e0-90e5-a7c4e0a7bbf8" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForEcoscopeCollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="Ecoscope" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:link, en:p:description, en:p:S ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="ae0210b0-dcd2-11e2-be24-9415b2540510" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForBING"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="Bing" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:link, en:p:description, en:p:S, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FixedParameters" value="[ http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:count=&quot;25&quot;, config:numOfResults=&quot;200&quot; ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="bcd216d0-dce6-11e2-89e1-9415b2540510" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForINSPIRECollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="INSPIRE" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:creator, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="30602b90-603f-11e0-90e5-a7c4e0a7bbf8" /><br />
</initialization><br />
</job><br />
<job extends="FullText Index OAI Tree Collections" jobtype="FTIndexNodeCollection" name="Indexing"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.ColID" value="FIGIS" /><br />
</initialization><br />
</job><br />
</jobs> <br />
</BootstrapInfo><br />
</source><br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='description']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='publisher']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='contributor']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='date']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='type']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='format']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='identifier']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='source']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='language']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='relation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='alternative']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='tableOfContents']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='abstract']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='created']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='valid']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='available']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='issued']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='modified']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateAccepted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateCopyrighted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateSubmitted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='extend']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='medium']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isVersionOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasVersion']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReplacedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='replaces']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isRequiredBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='requires']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isPartOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasPart']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReferencedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='references']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isFormatOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasFormat']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='conformsTo']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='spatial']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='temporal']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='audience']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPeriodicity']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPolicy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='instructionalMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='provenance']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rightsHolder']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='mediator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='educationLevel']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accessRights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='license']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='bibliographicCitation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFT<br />
The following XSLT is applied last upon fulltext rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//ROWSET/ROW/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_dc_anylanguage_to_fwRowset_anylanguage_title_creator_subject_coverage<br />
The following XSLT transforms data elements with dc schema to forward rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:variable name="keys"><br />
<key><br />
<keyName>title</keyName><br />
<keyXPath>//*[local-name()='title']</keyXPath><br />
</key><br />
<key><br />
<keyName>creator</keyName><br />
<keyXPath>//*[local-name()='creator']</keyXPath><br />
</key><br />
<key><br />
<keyName>subject</keyName><br />
<keyXPath>//*[local-name()='subject']</keyXPath><br />
</key><br />
<key><br />
<keyName>coverage</keyName><br />
<keyXPath>//*[local-name()='coverage']</keyXPath><br />
</key><br />
</xsl:variable><br />
<xsl:template match="/"><br />
<xsl:element name="ROWSET"><br />
<xsl:element name="INSERT"><br />
<xsl:element name="TUPLE"><br />
<xsl:element name="VALUE"><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<FIELD name="title"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<FIELD name="creator"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<FIELD name="subject"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<FIELD name="coverage"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFWD<br />
The following XSLT is applied last upon forward rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/KEY"><br />
<xsl:copy-of select="self::node()[not(KEYNAME = preceding::KEY/KEYNAME and KEYVALUE = preceding::KEY/KEYVALUE)]" /><br />
</xsl:for-each><br />
<VALUE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/VALUE/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source></div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=Creating_Indices_at_the_VO_Level&diff=21132Creating Indices at the VO Level2014-02-25T16:43:18Z<p>Alex.antoniadi: /* FtsRowset_Transformer */</p>
<hr />
<div>[[Category:Administrator's Guide]]<br />
==Indexing Procedure==<br />
<br />
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:<br />
<br />
* Creation of the [[ Index Management Framework | Rowset XSLT ]] generic resources, that transform collection data into data that can be fed to an Index.<br />
* Creation of the [[ Index Management Framework | Index type]] generic resources, that define the Index configuration.<br />
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.<br />
<br />
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following sections:<br />
<br />
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]<br />
* [[ Index_Management_Framework#RowSet_2 | Forward Index Rowset ]]<br />
<br />
You can find detailed descriptions for the Index Type definition here:<br />
<br />
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]<br />
* [[ Index_Management_Framework#Forward_Index | Forward Index key-value pairs ]]<br />
<br />
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.<br />
<br />
==Creating a Full Text and a Forward Index for a OAI-DC collection==<br />
<br />
=== DataTransformation Programs ===<br />
<br />
====FtsRowset_Transformer====<br />
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.<br />
<br />
[[File:FtsRowset_Transformer.xml]]<br />
<br />
====FwRowset_Transformer====<br />
The following transformation program is called for forward rowset creation. Transformation unit with id="1" takes multiple XSLTs and applies final XSLT at the end.<br />
<gDTSTransformationProgram><br />
<Transformer><br />
<Class>org.gcube.datatransformation.datatransformationlibrary.programs.metadata.indexfeed.FwRowset_Transformer</Class><br />
<GlobalProgramParameters /><br />
</Transformer><br />
<TransformationUnits><br />
<TransformationUnit id="1" isComposite="false"><br />
<Sources><br />
<<nowiki>Source</nowiki>><br />
<Input id="TRInput0" /><br />
<ContentType><br />
<Mimetype>text/xml</Mimetype><br />
<Parameters /><br />
</ContentType><br />
<<nowiki>/Source</nowiki>><br />
</Sources><br />
<ProgramParameters><br />
<Parameter isOptional="false" name="finalfwdxslt" value="-" /><br />
<Parameter isOptional="true" name="xslt" value="-" /><br />
<Parameter isOptional="true" name="xslt:1" value="-" /><br />
<Parameter isOptional="true" name="xslt:2" value="-" /><br />
<Parameter isOptional="true" name="xslt:3" value="-" /><br />
<Parameter isOptional="true" name="xslt:4" value="-" /><br />
<Parameter isOptional="true" name="xslt:5" value="-" /><br />
<Parameter isOptional="true" name="xslt:6" value="-" /><br />
<Parameter isOptional="true" name="xslt:7" value="-" /><br />
<Parameter isOptional="true" name="xslt:8" value="-" /><br />
<Parameter isOptional="true" name="xslt:9" value="-" /><br />
</ProgramParameters><br />
<Target><br />
<Output id="TROutput" /><br />
<ContentType><br />
<Mimetype>text/xml</Mimetype><br />
<Parameters><br />
<Parameter isOptional="false" name="schemaURI" value="http://fwrowset.xsd" /><br />
</Parameters><br />
</ContentType><br />
</Target><br />
</TransformationUnit><br />
</TransformationUnits><br />
</gDTSTransformationProgram><br />
<br />
=== Index Types ===<br />
In this section we present the required IndexTypes for both FullText and Forward Indices.<br />
<br />
====FullTextIndexType====<br />
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:<br />
<br />
<source lang="xml"><br />
<Name>IndexType_ft_oai_dc_1.0</Name><br />
<SecondaryType>FullTextIndexType</SecondaryType><br />
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description><br />
<Body><br />
<index-type name="default"><br />
<field-list sort-xnear-stop-word-threshold="2E8"><br />
<field name="contributor"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="coverage"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="creator"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="date"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="description"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="format"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="identifier"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="language"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="publisher"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="relation"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="rights"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="source"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="subject"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="title"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="type"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="ObjectID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionID"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="gDocCollectionLang"><br />
<index>yes</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>yes</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
<field name="S"><br />
<index>no</index><br />
<store>yes</store><br />
<return>yes</return><br />
<tokenize>no</tokenize><br />
<sort>no</sort><br />
<boost>1.0</boost><br />
</field><br />
</field-list><br />
</index-type><br />
</Body><br />
</source><br />
<br />
====ForwardIndexType====<br />
In OAI-DC many fields, such as "title" and "creator", have string values, so we just need to create a ForwardIndexType for string-string key-value pairs in order to <br />
be able to create the Forward Index:<br />
<br />
<source lang="xml"><br />
<SecondaryType>ForwardIndexType</SecondaryType><br />
<Name>IndexType_fwd_string_string</Name><br />
<Description>Definition of the index type 'string_string' for the forward index</Description><br />
<Body><br />
<field-list> <br />
<field name="key"> <br />
<type>string</type><br />
<sort>ascending</sort><br />
</field><br />
<field name="value"><br />
<type>string</type><br />
</field><br />
</field-list><br />
</Body><br />
</source><br />
<br />
Note that, in contrast to the FullTextIndexType in ForwardIndexType there is no field-datatype mapping but just declaration of the datatypes supported in the index.<br />
<br />
=== Bootstrapper Configuration ===<br />
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.<br />
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]<br />
<br />
An example of the configuration is the following:<br />
<source lang='xml'><br />
<BootstrapInfo><br />
<types><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.TreeManagerCollectionDataType" name="TreeManagerCollection" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.OpenSearchDataType" name="OpenSearch" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.FullTextIndexNodeDataType" name="FullTextIndexNode" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.ForwardIndexNodeDataType" name="ForwardIndexNode" /><br />
<type class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.data.GCUBECollectionDataType" name="GCUBECollection" /><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.OpenSearchGenerationTaskType" name="OpenSearchGenerationTaskType"><br />
<input type="GCUBECollection" /><br />
<output type="OpenSearch" /><br />
<run>true</run><br />
</tasktype><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.ForwardIndexNodeGenerationTaskType" name="ForwardIndexNodeGenerationTask"><br />
<input type="TreeManagerCollection" /><br />
<output type="ForwardIndexNode" /><br />
<run>true</run><br />
</tasktype><br />
<tasktype class="org.gcube.portlets.admin.irbootstrapperportlet.gwt.server.types.task.FullTextIndexNodeGenerationTaskType" name="FullTextIndexNodeGenerationTask"><br />
<input type="TreeManagerCollection" /><br />
<output type="FullTextIndexNode" /><br />
<run>true</run><br />
</tasktype><br />
<jobtype description="Creates the required fulltext indices for a collection." name="FTIndexNodeCollection"><br />
<input type="TreeManagerCollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_ft_node_index.input" value="%FTIndexNodeCollection.input" /><br />
<assign to="%Create_ft_node_index.output.IndexedCollectionID" value="%Create_ft_node_index.input.ColID" /><br />
<task name="Create_ft_node_index" tasktype="FullTextIndexNodeGenerationTask" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<jobtype description="Creates the required forward indices for a collection." name="FWDIndexNodeCollection"><br />
<input type="TreeManagerCollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_fwd_node_index.input" value="%FWDIndexNodeCollection.input" /><br />
<assign to="%Create_fwd_node_index.output.IndexedCollectionID" value="%Create_fwd_node_index.input.ColID" /><br />
<task name="Create_fwd_node_index" tasktype="ForwardIndexNodeGenerationTask" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<jobtype description="Creates the open search resource for an open search collection." name="CreateOpenSearchCollectionResource"><br />
<input type="GCUBECollection" /><br />
<jobDefinition><br />
<parallel><br />
<sequential><br />
<assign to="%Create_OSR.input" value="%CreateOpenSearchCollectionResource.input" /><br />
<assign to="%Create_OSR.output.OpenSearchCollectionID" value="%Create_OSR.input.ColID" /><br />
<task name="Create_OSR" tasktype="OpenSearchGenerationTaskType" /><br />
</sequential><br />
</parallel><br />
</jobDefinition><br />
</jobtype><br />
<br />
</types><br />
<br />
<jobs><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index SPD Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:SPD" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_SPD_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_PROVENANCE_anylanguage_to_ftRowset_anylanguage, $BrokerXSLT_DwC_anylanguage_to_ftRowset_anylanguage, $BrokerXSLT_Properties_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index OAI Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:OAI" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_oai_dc_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FARM_dc_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FTIndexNodeCollection" name="FullText Index FIGIS Tree Collections"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.Type" value="ns5:FIGIS" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IndexTypeID" value="ft_FIGIS_1.0" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFT" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FIGIS_anylanguage_to_ftRowset_anylanguage ]" /><br />
<assign to="%Create_ft_node_index.FullTextIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index SPD Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:SPD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, scientificName, scientificNameAuthorship, genus, phylum, kingdom, order, family, specificEpithet ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_DwC_anylanguage_to_fwRowset_anylanguage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index FIGIS Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:FIGIS" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, scientific_name, family, personal_author ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_FIGIS_anylanguage_to_fwRowset_anylanguage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="FWDIndexNodeCollection" name="Forward Index OAI Tree Collections"><br />
<initialization><br />
<assign to="%FWDIndexNodeCollection.input.Type" value="ns5:OAI" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.TransformationXSLTID" value="$BrokerXSLT_wrapperFWD" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyNames" value="[ ObjectID, gDocCollectionID, gDocCollectionLang, title, creator, subject, coverage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IndexedKeyTypes" value="[ fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string, fwd_string_string ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.XsltsIDs" value="[ $BrokerXSLT_dc_anylanguage_to_fwRowset_anylanguage_title_creator_subject_coverage ]" /><br />
<assign to="%Create_fwd_node_index.ForwardIndexNodeGenerationTask.IdOfIndexManagerToAppend" userInputLabel="ID of index node to append" value="%userInput" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForDRIVERCollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="DRIVER" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:creator, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="30602b90-603f-11e0-90e5-a7c4e0a7bbf8" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForEcoscopeCollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="Ecoscope" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:link, en:p:description, en:p:S ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="ae0210b0-dcd2-11e2-be24-9415b2540510" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForBING"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="Bing" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:link, en:p:description, en:p:S, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FixedParameters" value="[ http%3A%2F%2Fa9.com%2F-%2Fspec%2Fopensearch%2F1.1%2F:count=&quot;25&quot;, config:numOfResults=&quot;200&quot; ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="bcd216d0-dce6-11e2-89e1-9415b2540510" /><br />
</initialization><br />
</job><br />
<job jobtype="CreateOpenSearchCollectionResource" name="CreateOSResourceForINSPIRECollection"><br />
<initialization><br />
<assign to="%CreateOpenSearchCollectionResource.input.ColName" value="INSPIRE" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.FieldParameters" value="[ en:s:allIndexes, en:p:title, en:p:creator, en:p:pubDate ]" /><br />
<assign to="%Create_OSR.OpenSearchGenerationTask.OpenSearchResourceID" value="30602b90-603f-11e0-90e5-a7c4e0a7bbf8" /><br />
</initialization><br />
</job><br />
<job extends="FullText Index OAI Tree Collections" jobtype="FTIndexNodeCollection" name="Indexing"><br />
<initialization><br />
<assign to="%FTIndexNodeCollection.input.ColID" value="FIGIS" /><br />
</initialization><br />
</job><br />
</jobs> <br />
</BootstrapInfo><br />
</source><br />
<br />
=== Metadata Broker XSLT ===<br />
<br />
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage<br />
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='description']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='publisher']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='contributor']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='date']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='type']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='format']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='identifier']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='source']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='language']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='relation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='alternative']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='tableOfContents']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='abstract']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='created']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='valid']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='available']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='issued']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='modified']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateAccepted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateCopyrighted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='dateSubmitted']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='extend']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='medium']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isVersionOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasVersion']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReplacedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='replaces']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isRequiredBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='requires']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isPartOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasPart']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isReferencedBy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='references']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='isFormatOf']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='hasFormat']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='conformsTo']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='spatial']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='temporal']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='audience']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPeriodicity']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accrualPolicy']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='instructionalMethod']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='provenance']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='rightsHolder']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='mediator']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='educationLevel']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='accessRights']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='license']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='bibliographicCitation']"><br />
<xsl:if test="normalize-space(.)"><br />
<FIELD name="{local-name()}"><br />
<xsl:value-of select="normalize-space(.)" /><br />
</FIELD><br />
</xsl:if><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFT<br />
The following XSLT is applied last upon fulltext rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<ROW><br />
<xsl:for-each select="//ROWSET/ROW/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</ROW><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_dc_anylanguage_to_fwRowset_anylanguage_title_creator_subject_coverage<br />
The following XSLT transforms data elements with dc schema to forward rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:variable name="keys"><br />
<key><br />
<keyName>title</keyName><br />
<keyXPath>//*[local-name()='title']</keyXPath><br />
</key><br />
<key><br />
<keyName>creator</keyName><br />
<keyXPath>//*[local-name()='creator']</keyXPath><br />
</key><br />
<key><br />
<keyName>subject</keyName><br />
<keyXPath>//*[local-name()='subject']</keyXPath><br />
</key><br />
<key><br />
<keyName>coverage</keyName><br />
<keyXPath>//*[local-name()='coverage']</keyXPath><br />
</key><br />
</xsl:variable><br />
<xsl:template match="/"><br />
<xsl:element name="ROWSET"><br />
<xsl:element name="INSERT"><br />
<xsl:element name="TUPLE"><br />
<xsl:element name="VALUE"><br />
<xsl:for-each select="//*[local-name()='title']"><br />
<FIELD name="title"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='creator']"><br />
<FIELD name="creator"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='subject']"><br />
<FIELD name="subject"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
<xsl:for-each select="//*[local-name()='coverage']"><br />
<FIELD name="coverage"><br />
<xsl:value-of select="." /><br />
</FIELD><br />
</xsl:for-each><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:element><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source><br />
<br />
*BrokerXSLT_wrapperFWD<br />
The following XSLT is applied last upon forward rowsets in order to remove duplicate and empty rowsets:<br />
<source lang=xml><br />
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"><br />
<xsl:output indent="yes" method="xml" omit-xml-declaration="yes" /><br />
<xsl:template match="/"><br />
<ROWSET><br />
<INSERT><br />
<TUPLE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/KEY"><br />
<xsl:copy-of select="self::node()[not(KEYNAME = preceding::KEY/KEYNAME and KEYVALUE = preceding::KEY/KEYVALUE)]" /><br />
</xsl:for-each><br />
<VALUE><br />
<xsl:for-each select="//ROWSET/INSERT/TUPLE/VALUE/FIELD"><br />
<xsl:copy-of select="self::node()[text() and not(@name = preceding::FIELD/@name and text() = preceding::FIELD/text())]" /><br />
</xsl:for-each><br />
</VALUE><br />
</TUPLE><br />
</INSERT><br />
</ROWSET><br />
</xsl:template><br />
</xsl:stylesheet><br />
</source></div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=File:FwRowset_Transformer.xml&diff=21130File:FwRowset Transformer.xml2014-02-25T16:37:41Z<p>Alex.antoniadi: </p>
<hr />
<div></div>Alex.antoniadihttps://wiki.gcube-system.org/index.php?title=File:FtsRowset_Transformer.xml&diff=21129File:FtsRowset Transformer.xml2014-02-25T16:36:07Z<p>Alex.antoniadi: </p>
<hr />
<div></div>Alex.antoniadi