X-Search

From Gcube Wiki
Revision as of 09:26, 18 November 2014 by Yannis.marketakis (Talk | contribs) (Log of Activities)

Jump to: navigation, search

Overview

X-Search is a meta-search engine that reads the description of an underlying search source and is able to query that source and analyze in various ways the returned results. It also exploits the availability of semantic repositories.

Key features

Provision of textual clustering of the results.
Clustering is performed either over the textual snippets or over the entire contents.
Provision of textual entity mining of the results.
Text entity mining can be performed either over the textual snippets or over the entire contents.
Provision of faceted search-like exploration of the results.
The results of clustering, entity mining and metadata-based grouping are visualized and exploited according to the faceted exploration interaction paradigm: when the user clicks on a cluster or entity, the results are restricted to those that contain that cluster or entity.
Ability to semantically explore an identified entity.
X-Search provides the necessary linkage between the mined entities and semantic information. In particular, by exploiting appropriate Semantic Knowledge Bases, the user can retrieve more information about an entity by querying and browsing over these Knowledge Bases.
Ability to apply entity mining and explore the identified entities during plain Web browsing.
X-Search also offers entity discovery and exploration while the user is browsing on the Web. Specifically, the user is able to inspect the entities of a particular Web page by clicking a bookmarklet (a special bookmark) and then to further retrieve more information about an entity by querying a Knowledge Base.


Design

Philosophy

X-Search has been designed to offer its functionality on top of other search systems (as a meta-search service).

Architecture

X-Search is composed of several components. These are:

  • Text Clustering: Responsible for performing clustering on the results of the underlying search system. Clustering is being performed on the textual snippets of the returned results, however clustering of the contents is also supported. Furthermore a ranking on the identified clusters is performed.
  • Text Entity Mining: Responsible for performing entity mining of the textual content. Similar to the Text Clustering component it can be performed either over the textual snippets or over the entire content, and supports ranking of the identified entities.
  • Search System Mediator: It is used as a mediator between X-Search and the underlying search system. Its role is to read an OpenSearch description document describing the underlying search system (i.e. location of the search system, query format, response format, etc.), parse the results of the search system and fetch the contents of a hit (upon user request, i.e. in the case where a user wants to perform entity mining on the whole content of a hit).
  • Linked Open Data Query Component: it is used for building appropriate SPARQL queries and sending them to particular SPARQL endpoints in order to provide useful information about the mined entities. Several SPARQL endpoints can be used for different purposes (i.e. GeoNames for locations, DBPedia for persons or organizations, the MarineTLO-based Warehouse for fish species, etc.).
  • Bookmarklet for Dynamic Semantic Annotation: It allows performing entity mining in the textual contents of any web page (PDF files are also supported).

Deployment

The following component diagram describes the components that constitute X-Search and how they are connected to perform the desired functionality. All the components have been implemented in Java. Several components also use third-party libraries, i.e. the Text Entity Mining component uses the Gate Annie tool for extracting entities from text. Furthermore, the semantic data analysis facilities can be exploited either through the X-Search UI component (for humans) or though the RESTFull X-Search API.

XSearchComponentModel.jpg

The component diagram shown below is used to depict the gCube components with which X-Search communicates. In particular, for allowing the interaction with a user, X-Search-portlet has been developed using GWT. The portlet retrieves the search results from the ASLSession and sends them to X-Search-service for semantic data analysis. The submission of the search results (and the retrieval of the semantic analysis results) is being performed using the gRS2 pipelining mechanism. Note that the following component diagram is not exhaustive, in the sense that only the elements that are necessary for the functionality of X-Search are included (without adding explicitly elements that are directly or indirectly used by them).

XSearchComponentModel within gCube.jpg

The following sequence diagram shows the entire process, starting from the query that is submitted from a user through the search portlet. When X-Search-portlet is being triggered, it fetches the top-K (the default value for K is 100) results from ASL-Session and send them to X-Search-service for semantic data analysis. If the user requests for more results (more than the K that has been retrieved and analyzed), X-Search-portlet will request for the next K results (if any), send them for semantic data analysis, and show them to the user.

XSearchFunctionality.jpg

Implementation details

X-Search-service is a generic meta-search engine that analyzes the contents of the search results. X-Search-service has been implemented in JAVA as a Web application. For performing the desired functionalities, it uses a set of other components:

  • text-entitiy-mining. This component is responsible for identifying and extracting the textual entities from a given unstructured text. It uses the GATE Annie framework which uses gazetters and natural language processing techniques.
  • stella-results-text-clustering. This component is responsible for clustering a set of search results. It supports different clustering algorithms including STC, STC+ and NM-STC.

X-Search-service is fully configurable either through its user interface or through a set of configuration files. These files, as well as the files required by GATE Annie, are downloaded at the time of the deployment.

For the presentation of the results in the Liferay portal, the X-Search-portlet has been developed. It is the front-end of X-Search and is responsible for retrieving the top-K search results from gCube search. Then, it sends them to X-Search-service for performing semantic data analysis, presents them to the users and allows interacting with users.

X-Search-portlet retrieves a set of Top-K (K is configurable) results from the ASL-Session and sends a subset of those for analysis at X-Search-service. As soon as the user requests for more results than K, more results will be fetched and submitted for semantic data analysis. The source code for retrieving the Top-K results from the ASL-Session is shown below:

// resConsumer is an instance of ResultSetConsumerI (aslsearch)
// numOfResConsume declares the number of results to be fetched 
// startOffset is the index of the first results to be fetched 
// aslSession is an instance of the current session
ArrayList<String> results = resConsumer.getResultsToText(numOfResConsume, startOffset, aslSession);

The results are then being sent to X-Search-service using the gRS2 framework. The following code demonstrates the process of initializing the gRS2 pipeline by X-Search-portlet:

// creates the fields for the entries that will be added in gRS2
RecordDefinition[] recordDefinitions = new RecordDefinition[]{
               new GenericRecordDefinition((new FieldDefinition[]{
               new StringFieldDefinition("title"),
               new StringFieldDefinition("snippet"),}))};

// creates a gRS2 writer that will add 
RecordWriter<GenericRecord> writer =
                   new RecordWriter<>(new LocalWriterProxy(),
                   recordDefinitions, RecordWriter.DefaultBufferCapacity,
                   RecordWriter.DefaultConcurrentPartialCapacity,
                   RecordWriter.DefaultMirrorBufferFactor, 1, TimeUnit.DAYS); 

//Opens a new HTTP Connection
URI TCPLocator = Locators.localToTCP(writer.getLocator());
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
con.connect();

After opening the connection, the results are being added and submitted. The semantic analysis results are returned in JSON format.

//For every result create a GenericRecord and add in it the connection
for (String xmlRepresentation : results) {
    GenericRecord rec = createNewGenericRecord(xmlRepresentation);
    httpConn.addRecordToTCPLocator(rec);
}

Since X-Search-service is not co-located with X-Search-portlet, the latter should locate a proper running instance of X-Search-service. The running instances of X-Search-service are maintained by the Information Collector service. X-Search-portlet queries this service to get a list of the available running instances. These are stored in a queue and whenever an X-Search-service instance is required we dequeue one from the list and then enqueue it. This policy guarantees that all the available running instances of X-Search-service will be used and the effort will be balanced between them. Furthermore, X-Search-portlet periodically checks if the queue with the running instances is valid (some instances might be unreachable or new instances might be available). The time interval between periodic checking can be configured. The following portion of source code shows how these running instances can be retrieved using the ic-client:

Queue<String> xsearchEndpointsStack = new LinkedList<>();
ScopeProvider.instance.set(this.gCubeScope);
SimpleQuery query=queryFor(GCoreEndpoint.class).addCondition("$resource/Profile/ServiceName/text() eq
         '"+this.XSEARCH_SERVICENAME+"'").setResult("$resource/Profile/AccessPoint/RunningInstanceInterfaces/Endpoint/text()");
List<String> results=DiscoverClient.client().submit(query);
for(String result : results)
xsearchEndpointsStack.add(result);

When a user submits a query at X-Search-portlet, the portlet builds a connection with the SemanticSearch service of X-Search-service and sends (as request parameter) the URN of the locator that contains the results. In this way, X-Search-service knows the location of the results that have to be analyzed. The following code shows how X-Search-service reads the titles and the snippets of the results, based on a given locator.

wseResults = new ArrayList<SearchResult>();
URI locURI = new URI(locator);
ForwardReader<GenericRecord> reader = new ForwardReader<GenericRecord>(locURI);

int numOfRecords = 0;
GenericRecord rec;

Iterator<GenericRecord> it = reader.iterator();
while (it.hasNext()) {
   rec = it.next();
   String title = "", snippet = "", url = "";
   if (rec != null) {
      if (((StringField) rec.getField("title")) != null) {
          title = ((StringField) rec.getField("title")).getPayload();
      }
      if ((StringField) rec.getField("snippet") != null) {
          snippet = ((StringField) rec.getField("snippet")).getPayload();
      }
   } else {
      System.out.println("- The record " + numOfRecords + " is null.");
   }

   SearchResult searchresult = new SearchResult(title, "", snippet, numOfRecords);
   wseResults.add(searchresult);

   numOfRecords++;
}
reader.close();

Now, X-Search-service performs textual clustering and entity mining in these results:

Clustering clusteringComp = new Clustering(wseResults, query, only_snippets, numOfClusters, Resources.CLUSTERING_ALGORITHM);
clustersContent = clusteringComp.getClustersContent();

Mining miningComp = new Mining(wseResults, query, Resources.MINING_ACCEPTED_CATEGORIES, Resources.SPARQL_ENDPOINTS, Resources.SPARQL_TEMPLATES);
miningComp.giveRankToElements();
entities = miningComp.getEntities();

and produces JSON outputs that are sent back to X-Search-portlet.

For entity enrichment and exploration, X-Search-service provides two services, the InspectEntity service and the ShowProperties service. The InspectEntity service tries to find semantic resources (i.e. URIs) that match a given name of entity. Specifically, it reads the SPARQL endpoint and the SPARQL template query that correspond to the category of the given entity name, and runs a SPARQL query for retrieving the corresponding semantic information. Likewise, the ShowProperties service runs a SPARQL query requesting the outgoing properties of a given semantic resource.

Use Cases

Well suited Use Cases

The user wants to search in documents using an Information Retrieval system. For this reason X-Search is “parameterized” to use it as its underlying search system. For example let’s suppose that he wants to search in the fisheries domain (through the FIGIS search component) for publications about the “Mediterranean Tuna”. Apart from getting the results he also wants to exploit available semantic sources (i.e. FLOD dataset) for annotating at query time the responses. Below we describe 5 sub use cases that are applicable:

  • UC1: Getting advanced search results using X-Search
  • UC2: Restricting (gradually) the search results
  • UC3: Mine (on-demand) all named entities of a hit
  • UC4: Exploit Linked Data sources to semantically annotate resulted entities
  • UC5: Enrich web browsing with semantic search facilities


UC1: Getting advanced search results using X-Search

The user exploits X-Search for searching for “Mediterranean Tuna” in the context of FAO publications about fisheries and aquaculture. For this reason X-Search redirect his query to the FIGIS search component. Before exposing the answers to the user, they are used to perform some post-search activities that will enrich the answer. More precisely real-time clustering and entity mining of the top-K hits of the answers are performed.


UC2: Restricting (gradually) the search results

After performing a search the user has available the search results, the clustering results and a set of mined entities (grouped in different categories). Instead of searching the hits of the answers he can select the entities which interest him. The user selects some of these entities and the answers are restricted to those having the selected entities. This way the user gets a more descriptive answer space.


UC3: Mine (on-demand) all named entities of a hit

After performing a search the user finds a hit in the answers that interests him. He has the ability to mine all the entities from this document. This is extremely useful because it allows users to have a quick view of the content of a document (in terms of its entities). Additionally it is helpful when documents are large enough.


UC4: Exploit Linked Data sources to semantically annotate resulted entities

The user can ask for more information about the resulted entities. Upon user request it is possible to build a SPARQL query that is being sent to appropriate SPARQL endpoints of Linked Open Data. The user can then start browsing over the contents of these data. Regarding marine species the MarineTLO-based warehouse can be exploited however there are several other datasets that can be used (for different entity categories), including GeoNames, Freebase, Wordnet, and more.


UC5: Enrich web browsing with semantic search facilities

The user is browsing a web page containing information from the fisheries domain that interests him. He wants to quickly identify which are the entities of this document. He just clicks on a bookmark in his browser and that page is now shown up with its entities highlighted. Of course it is not a simple bookmark but rather a bookmarklet, that allows adding one-click functionality to a web page or browser.

References

  • X-Search description document and User's Manual. Found at iMarine workspace
  • P. Fafalios and Y. Tzitzikas, Exploratory Professional Search through Semantic Post-Analysis of Search Results, Professional Search in the Modern World, Lecture Notes in Computer Science, Vol. 8830, Springer, 2014 (pdf).
  • P. Fafalios and Y. Tzitzikas, Post-Analysis of Keyword-based Search Results using Entity Mining, Linked Data and Link Analysis at Query Time, IEEE 8th International Conference on Semantic Computing (ICSC'14), Newport Beach, California, USA, June 2014 (pdf | slides).
  • P. Fafalios and P. Papadakos, Theophrastus: On Demand and Real-Time Automatic Annotation and Exploration of (Web) Documents using Open Linked Data, Web Semantics: Science, Services and Agents on the World Wide Web, Elsevier (ISSN: 1570-8268), 2014 (pdf).
  • P. Fafalios, I. Kitsos, Y. Marketakis, C. Baldassarre, M. Salampasis and Y. Tzitzikas, Web Searching with Entity Mining at Query Time, In Proceedings of the 5th Information Retrieval Facility Conference (IRF'2012), Vienna, July 2012 (pdf | slides).
  • P. Fafalios, M. Salampasis and Y. Tzitzikas, Exploratory Patent Search with Faceted Search and Configurable Entity Mining, In Proceedings of the 1st International Workshop of Integrating IR technologies for Professional Search in conjuction with the 35th European Conference on Information Retrieval (ECIR'13), Moscow, Russia, March 2013 (pdf)
  • P. Fafalios and Y. Tzitzikas, X-ENS: Semantic Enrichment of Web Search Results at Real-Time (demo paper), In Proceedings of the 36th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'13),Dublin, Ireland, August 2013 (pdf).

Management (plans, tickets, etc)

Current Activities

A shortcut showing (automatically) all the tickets that relate to Semantic Data Analysis functional area can be found here. Below we provide a more updated view of the current situation.

Log of Activities

  • Various xsearch-portlet activities for improving scalability and extended functionality
    • Dynamic fetching of xsearch configuration files (Ticket #783 Closed - Nov 2012
    • Exploitation of IS in XSearch portlet (Ticket #780) - Closed - Feb 2013
    • Implementation of the new incremental algorithm for extended functionality (Ticket #1823) - Closed - Jun 2013
    • Exploitation of multiple xsearch-service instances (Ticket #1828) - Closed - Jun 2013
    • Retrieval of semantic information about the mined entities (Ticket # 1813) - Closed - Jun 2013
    • XSearch Portlet Memory Consumption (Ticket #628) - Closed - Jul 2013
    • GUI improvements for offering an homogenized view within the iMarine portal #1854 - Closed - Jul 2013
    • Remove gCore dependencies from XSearch-portlet (Ticket #2253) - Closed - Oct 2013
    • Resolve bugs with XSearch bookmarklet (Ticket #2252) - Closed - Oct 2013
    • Portlet improved user interaction (Ticket #2254) - Closed - Apr 2014
  • Various activities about xsearch and gCube search
    • Support search results snippets (Ticket #7) Closed - Jun 2012
    • Provision of textual snippets from gCube search (Ticket #838) - Closed - Nov 2012
    • Configurability of TCPLocator (Ticket #627) - Closed - Feb 2013
    • Searching over multiple collections (Ticket #839) - Closed - Mar 2013
    • Enriching RDF files with the URIs of Named Entities (an XSearch Tagger like Agrotagger, i.e. an iMarine annotator) (Ticket #1187 - Closed - Jun 2013
    • Exploitation of RDF-properties in XSearch (Ticket #960) - Closed - Jan 2014