Difference between revisions of "Full Text Index"

From Gcube Wiki
Jump to: navigation, search
(Fastlangid)
(Fastlangid)
Line 226: Line 226:
  
 
=====Fastlangid=====
 
=====Fastlangid=====
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++, and the C++ code is loaded as shared objects.
+
The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object.
The Fast langid plugin interfaces a Java wrapper that loads the shared objects and calls the native C++ code.
+
The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code.
The shared objects are compiled on Linux RHE3 and RHE4.
+
The shared library objects are compiled on Linux RHE3 and RHE4.
  
 
The Java native interface is generated using Swig.  
 
The Java native interface is generated using Swig.  
Line 236: Line 236:
 
org.diligentproject.indexservice.linguistics.fastplugin.FastLanguageIdPlugin
 
org.diligentproject.indexservice.linguistics.fastplugin.FastLanguageIdPlugin
  
The plugin loads the shared libraries, and when init is called, instantiate the native C++ objects that identifies the languages.
+
The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.
  
 
The Fastlangid is in the SVN module:
 
The Fastlangid is in the SVN module:
 
trunk/linguistics/fastlinguistics/fastlangid
 
trunk/linguistics/fastlinguistics/fastlangid
  
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so)  
+
The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files that must be present on the target system.
  
The shared object file is called liblangid.so
+
The shared library object is called liblangid.so
  
 
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.
 
The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.
  
The .jar file (
+
The org_diligentproject_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared object.
 +
 
 +
The shared object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.
  
 
====Language Identifier====
 
====Language Identifier====

Revision as of 20:59, 21 November 2007

Introduction

The Full Text Index is responsible for providing quick full text data retrieval capabilities in the DILIGENT environment.

Implementation Overview

Services

The full text index is implemented through three services. They are all implemented according to the Factory pattern:

  • The FullTextIndexManagement Service represents an index manager. There is a one to one relationship between an Index and a Management instance, and their life-cycles are closely related; an Index is created by creating an instance (resource) of FullTextIndexManagement Service, and an index is removed by terminating the corresponding FullTextIndexManagement resource. The FullTextIndexManagement Service should be seen as an interface for managing the life-cycle and properties of an Index, but it is not responsible for feeding or querying its index. In addition, a FullTextIndexManagement Service resource does not store the content of its Index locally, but contains references to content stored in Content Management Service.
  • The FullTextIndexBatchUpdater Service is responsible for feeding an Index. One FullTextIndexBatchUpdater Service resource can only update a single Index, but one Index can be updated by multiple FullTextIndexBatchUpdater Service resources. Feeding is accomplished by instantiating a FullTextIndexBatchUpdater Service resources with the EPR of the FullTextIndexManagement resource connected to the Index to update, and connecting the updater resource to a ResultSet containing the content to be fed to the Index.
  • The FullTextIndexLookup Service is responsible for creating a local copy of an index, and exposing interfaces for querying and creating statistics for the index. One FullTextIndexLookup Service resource can only replicate and lookup a single instance, but one Index can be replicated by any number of FullTextIndexLookup Service resources. Updates to the Index will be propagated to all FullTextIndexLookup Service resources replicating that Index.

It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls and the DILIGENT CMS. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:

(illustration will be improved shortly... )

			 ________________________________
			|				 |
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|    So Pretty Index Design...   |
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘•∘|
			|________________________________|

RowSet

The content to be fed into an Index, must be served as a ResultSet containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:

<ROWSET>
    <ROW id="doc1">
        <FIELD name="title">How to create an Index</FIELD>
        <FIELD name="contents">Just read the WIKI</FIELD>
    </ROW>
    <ROW id="doc2">
        <FIELD name="title">How to create a Nation</FIELD>
        <FIELD name="contents">Talk to the UN</FIELD>
        <FIELD name="references">un.org</FIELD>
    </ROW>
</ROWSET>

IndexType

How the different fields in the ROWSET should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:

    <index-type>
        <field-list>
            <field name="title" lang="en">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
            <field name="contents" lang="en>
                <index>yes</index>
                <store>no</store>
                <return>no</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
            <field name="references" lang="en>
                <index>yes</index>
                <store>no</store>
                <return>no</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
        </field-list>
    </index-type>

Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:

  • index
specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)
  • store
specifies whether the field should be stored in its original format to be returned in the results from a query.
  • return
specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)
  • tokenize
specifies whether the field should be tokenized. Should usually contain "yes".
  • sort
Not used
  • boost
Not used

For more complex content types, one can also specify sub-fields as in the following example:

<index-type>
    <field-list>
        <field name="contents">
            <index>yes</index>
            <store>no</store>
            <return>no</return>
            <tokenize>yes</tokenize>
            <sort>no</sort>
            <boost>1.0</boost> 

            <!-- subfields of contents -->
            <field name="title">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost> 

                <!-- subfields of title which itself is a subfield of contents -->
                <field name="bookTitle">
                    <index>yes</index>
                    <store>yes</store>
                    <return>yes</return>
                    <tokenize>yes</tokenize>
                    <sort>no</sort>
                    <boost>1.0</boost>
                </field>
                <field name="chapterTitle">
                    <index>yes</index>
                    <store>yes</store>
                    <return>yes</return>
                    <tokenize>yes</tokenize>
                    <sort>no</sort>
                    <boost>1.0</boost>
                </field>
            </field> 

            <field name="foreword">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
                </field>
            <field name="startChapter">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
            <field name="endChapter">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
        </field> 

        <!-- not a subfield -->
        <field name="references">
            <index>yes</index>
            <store>no</store>
            <return>no</return>
            <tokenize>yes</tokenize>
            <sort>no</sort>
            <boost>1.0</boost>
        </field> 

    </field-list>
</index-type>


Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.

We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:

  • index-type-default-1.0 (DublinCore)
  • index-type-TEI-2.0
  • index-type-eiDB-1.0
  • index-type-iso-1.0
  • index-type-FT-1.0

The IndexType of a FullTextIndexManagement Service resource can be changed as long as no FullTextIndexBatchUpdater resources have connected to it. The reason for this limitation is that the processing of fields should be the same for all documents in an index; all documents in an index should be handled according to the same IndexType.

The IndexType of a FullTextIndexLookup Service resource is originally retrieved from the FullTextIndexManagement Service resource it is connected to. However, the "returned" property can be changed at any time in order to change which fields are returned. Keep in mind that only fields which have a "stored" attribute set to "yes" can have their "returned" field altered to return content.

Query language

The Full Text Index uses the Lucene query language, but does not allow the use of fuzzy searches, proximity searches, range searches or boosting of a term. In addition, queries using wildcards will not return usable query statistics.

Linguistics

The linguistics component is used in the Full Text Index.

Two linguistics components are available; the language identifier module, and the lemmatizer module. The language identifier module has two real implementations (plugins) and a dummy plugin (doing nothing, returning always "nolang" when called). The lemmatizer module contains one real implementation (one plugin) (no suitable alternative was found to make a second plugin), and a dummy plugin (always returning an empty String "").

Fast has provided proprietary technology for one of the language identifier modules (Fastlangid) and the lemmatizer module (Fastlemmatizer). The modules provided by Fast require a valid license to run (see later). The license is a 32 character long string. This string must be provided by Fast (contact Stefan Debald, setfan.debald@fast.no), and saved in the appropriate configuration file (see install a lingustics license).

The current license is valid until end of March 2008.

Plugin implementation

The classes implementing the plugin framework for the language identifier and the lemmatizer are in the SVN module common. The package is:

org/diligentproject/indexservice/common/linguistics/lemmatizerplugin
and 
org/diligentproject/indexservice/common/linguistics/langidplugin

The class LanguageIdFactory loads an instance of the class LanguageIdPlugin. The class LemmatizerFactory loads an instance of the class LemmatizerPlugin.

The language id plugins implements the class org.diligentproject.indexservice.common.linguistics.langidplugin.LanguageIdPlugin. The lemmatizer plugins implements the class org.diligentproject.indexservice.common.linguistics.lemmatizerplugin.LemmatizerPlugin. The factory use the method:

Class.forName(pluginName).newInstance();

when loading the implementations. The parameter pluginName is the package name of the plugin class to be loaded and instantiated.

Language Identification Plugins

There are two real implementations of the language identification plugin available in addition to the dummy plugin that always returns "nolang".

JTextCat

The JTextCat is maintained by http://textcat.sourceforge.net/. It is a light weight text categorization language tool in Java. It implements the N-Gram-Based Text Categorization algorithms that is described here: http://citeseer.ist.psu.edu/68861.html It supports the languages: German, English, French, Spanish, Italian, Swedish, Polish, Dutch, Norwegian, Finnish, Albanian, Slovakian, Slovenian, Danish and Hungarian.

The JTexCat is loaded by the plugin: org.diligentproject.indexservice.common.linguistics.jtextcat.JTextCatPlugin

The JTextCat contains no config files or bigram files since all the statistical data about the languages are contained in the package itself.

Fastlangid

The Fast language identification module is developed by Fast. It supports "all" languages used on the web. The tools is implemented in C++. The C++ code is loaded as a shared library object. The Fast langid plugin interfaces a Java wrapper that loads the shared library objects and calls the native C++ code. The shared library objects are compiled on Linux RHE3 and RHE4.

The Java native interface is generated using Swig.

The Fast langid module is loaded by the plugin (using the LanguageIdFactory)

org.diligentproject.indexservice.linguistics.fastplugin.FastLanguageIdPlugin

The plugin loads the shared object library, and when init is called, instantiate the native C++ objects that identifies the languages.

The Fastlangid is in the SVN module: trunk/linguistics/fastlinguistics/fastlangid

The lib catalog contains one catalog for RHE3 and one catalog for RHE4 shared objects (.so). The etc catalog contains the config files that must be present on the target system.

The shared library object is called liblangid.so

The configuration files for the langid module are installed in $GLOBUS_LOACTION/etc/langid.

The org_diligentproject_indexservice_langid.jar contains the plugin FastLangidPlugin (that is loaded by the LanguageIdFactory) and the Java native interface to the shared object.

The shared object liblangid.so is deployed in the $GLOBUS_LOCATION/lib catalogue.

Language Identifier

The language identifier is used by the Full Text Updater in the Full Text Index. The plugin to use for an updater is decided when the resource is created, as a part of the create resource call. (see Full Text Updater). The parameter is the package name of the implementation to be loaded and used to identify the language.

Three packages can be

The language identification module and the lemmatizer module are loaded at runtime by using a factory that loads the implementation that is going to be used.

The mou

The The language of content is specified on a field level. If no language is found and a language identification plugin has been loaded, the FullTextIndexBatchUpdater Service will try to identify the language of the field. Since language is assigned at the Collections level in Diligent, all fields of all documents in a language aware collection should contain a "lang" attribute with the language of the collection.

A language aware query can be performed at a query or term basis:

  • the query "_querylang_en: car OR bus OR plane" will look for English occurrences of all the terms in the query.
  • the queries "car OR _lang_en:bus OR plane" and "car OR _lang_en_title:bus OR plane" will only limit the terms "bus" and "title:bus" to English occurrences. (the version without a specified field will not work in the currently deployed indices)
  • Since language is specified at a collection level, language aware queries should only be used for language neutral collections.

The FullTextIndexLookup Service uses expansion during lemmatization; a word (of a query) is expanded into all known versions of the word. It is of course important to know the language of the query in order to know which words to expand the query with. Currently the same methods used to specify language for a language aware query is used is used to specify language for the lemmatization process. A way of separating these two specifications (such that lemmatization can be performed without performing a language aware query) will be made available shortly...

Partitioning

In order to handle situations where an Index replication does not fit on a single node, partitioning has been implemented for the FullTextIndexLookup Service; in cases where there is not enough space to perform an update/addition on the FullTextIndexLookup Service resource, a new resource will be created to handle all the content which didn't fit on the first resource. The partitioning is handled automatically and is transparent when performing a query, however the possibility of enabling/disabling partitioning will be added in the future. In the deployed Indices partitioning has been disabled due to problems with the creation of statistics. Will be fixed shortly.


Dependencies

Will be filled out shortly

Usage Example

Create a Management Resource

//Get the factory portType
String managementFactoryURI = "http://some.domain.no:8080/wsrf/services/diligentproject/index/FullTextIndexManagementFactoryService";
FullTextIndexManagementFactoryServiceAddressingLocator managementFactoryLocator = new FullTextIndexManagementFactoryServiceAddressingLocator();

managementFactoryEPR = new EndpointReferenceType();
managementFactoryEPR.setAddress(new Address(managementFactoryURI));
managementFactory = managementFactoryLocator
             .getFullTextIndexManagementFactoryPortTypePort(managementFactoryEPR);

//Create generator resource and get endpoint reference of WS-Resource.
org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource managementCreateArguments =
                           new org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResource();
managementCreateArguments.setIndexTypeName(new URI(
                            "http://www.diligentproject.org/index/type/" + indexType));
 managementCreateArguments.setIndexFormat(new URI(
                            "http://www.diligentproject.org/index/format/lucene"));
managementCreateArguments.setIndexID(indexID);
managementCreateArguments.setCollectionID(new String[] {collectionID}); //please only add one collection id for now :)
managementCreateArguments.setContentType(contentType); 

org.diligentproject.indexservice.fulltextindexmanagement.stubs.CreateResourceResponse managementCreateResponse = 
                                                             managementFactory.createResource(managementCreateArguments);

managementInstanceEPR = managementCreateResponse.getEndpointReference();

Create a Updater Resource and start feeding

//Get the factory portType
updaterFactoryURI = "http://some.domain.no:8080/wsrf/services/diligentproject/index/FullTextIndexBatchUpdaterFactoryService"; //could be on any node
updaterFactoryEPR = new EndpointReferenceType();
updaterFactoryEPR.setAddress(new Address(updaterFactoryURI));
updaterFactory = updaterFactoryLocator
             .getFullTextIndexBatchUpdaterFactoryPortTypePort(updaterFactoryEPR);


//Create updater resource and get endpoint reference of WS-Resource
org.diligentproject.indexservice.fulltextindexbatchupdater.stubs.CreateResource updaterCreateArguments =
                                                   new org.diligentproject.indexservice.fulltextindexbatchupdater.stubs.CreateResource();

updaterCreateArguments.setMainIndexEpr(managementInstanceEPR);
updaterCreateArguments.setCollectionIDs(new String[] {collectionID});
                        

//Now let's insert some data into the index... Firstly, get the updater EPR.
org.diligentproject.indexservice.fulltextindexbatchupdater.stubs.CreateResourceResponse updaterCreateResponse = updaterFactory
                                        .createResource(updaterCreateArguments);
updaterInstanceEPR = updaterCreateResponse.getEndpointReference(); 


//Get updater instance PortType
updaterInstance = updaterInstanceLocator.getFullTextIndexBatchUpdaterPortTypePort(updaterInstanceEPR);
 

//read the EPR of the ResultSet containing the ROWSETs to feed into the index                        
BufferedReader in = new BufferedReader(new FileReader(eprFile));
String line;
resultSetLocator = "";
while((line = in.readLine())!=null){
    resultSetLocator += line;
}
                       
//Tell the updater to start gathering data from the ResultSet
updaterInstance.processResultSet(resultSetLocator);

Create a Lookup resource and perform a query

//Let's put it on another node for fun...
lookupFactoryURI = "http://another.domain.no:8080/wsrf/services/diligentproject/index/FullTextIndexLookupFactoryService";
FullTextIndexLookupFactoryServiceAddressingLocator lookupFactoryLocator = new FullTextIndexLookupFactoryServiceAddressingLocator();
EndpointReferenceType lookupFactoryEPR = null;
EndpointReferenceType lookupEPR = null;
FullTextIndexLookupFactoryPortType lookupFactory = null;
FullTextIndexLookupPortType lookupInstance = null; 

//Get factory portType
lookupFactoryEPR= new EndpointReferenceType();
lookupFactoryEPR.setAddress(new Address(lookupFactoryURI));
lookupFactory =lookupFactoryLocator.getFullTextIndexLookupFactoryPortTypePort(factoryEPR);

//Create resource and get endpoint reference of WS-Resource
org.diligentproject.indexservice.fulltextindexlookup.stubs.CreateResource lookupCreateResourceArguments = 
						new org.diligentproject.indexservice.fulltextindexlookup.stubs.CreateResource();
org.diligentproject.indexservice.fulltextindexlookup.stubs.CreateResourceResponse lookupCreateResponse = null;
 
lookupCreateResourceArguments.setMainIndexEpr(managementInstanceEPR);    
lookupCreateResponse = lookupFactory.createResource( lookupCreateResourceArguments);
lookupEPR =  lookupCreateResponse.getEndpointReference(); 

//Get instance PortType
lookupInstance = instanceLocator.getFullTextIndexLookupPortTypePort(instanceEPR);

//Perform a query
String query = "good OR evil";
String epr = lookupInstance.query(query); 

//Print the results to screen. (refer to the ResultSet Framework page for a more detailed explanation)
RSXMLReader reader=null;
ResultElementBase[] results;

try{
    //create a reader for the ResultSet we created
    reader = RSXMLReader.getRSXMLReader(new RSLocator(epr)); 

    //Print each part of the RS to std.out
    System.out.println("<Results>");
    do{
        System.out.println("    <Part>");
        if (reader.getNumberOfResults() > 0){
            results = reader.getResults(ResultElementGeneric.class);
            for(int i = 0; i < results.length; i++ ){
                System.out.println("        "+results[i].toXML());
            }
        }
        System.out.println("    </Part>");
        if(!reader.getNextPart()){
            break;
        }
    }
    while(true);
    System.out.println("</Results>");
}
catch(Exception e){
    e.printStackTrace();
}

--Msibeko 14:36, 1 June 2007 (EEST)