Social Networking Data Discovery

From Gcube Wiki
Jump to: navigation, search

Overview

The purpose of this document is to show how the search facility over the D4Science Infrastructure Social Networking data, primarily stored into a Cassandra Cluster, has been realized. Apache Cassandra is a horizontally scalable and distributed NoSQL Database, used by many companies around the world (eBay, Netflix, Instagram and many more). It offers highly availability by means of data sharding and replications.

The engine enabling full-text search is ElasticSearch. ElasticSearch is an highly scalable, distributed, open source search and analytics engine based on the famous Apache-Lucene software library. It runs over one or more cluster nodes and is reachable over http(s). Moreover, it allows to organise documents in one or more indices/types according to their schema. This schema can be defined in JSON format, even though it tries to automatically detect it. Elasticsearch is a near real time search platform meaning that there is a low latency (normally one second) from the time you index a document until the time it becomes searchable.

The glue between Cassandra and ElasticSearch is a component based on the gCube SmartExecutor service. The SmartExecutor service allows to execute custom "Tasks" in the form of plugin components, and monitor their execution status. Each instance of the SmartExecutor service can run the "Tasks" related to the plugins available on such an instance. The SmartExecutor plugin related to the Social Networking Data Discovery is named social-data-indexer plugin.

The main goal of the Social Networking Data search facility is to let the users quickly search over this potentially huge amount of data, taking into account the data they are allowed to access. In fact, D4Science is a Research Infrastructure offering many Virtual Research Environments (VREs). A user is allowed to see only the data of the VREs in which she is present. In order to do that, a client Java library has been developed. It receives the query submitted by the user and returns the list of posts belonging to the user's VREs, if any, sorted according to the score they reached.

The Social Networking Library

The gCube Social Networking Library is the bridge between gCube Applications and the social networking facilities. The library discovers the Cassandra Cluster in the infrastructure and offers a lot of methods, such as post creation/deletion, comment creation/deletion, notifications and so on. All the information about the library can be retrieved here. As far as the search mechanism is concerned, the library is used to fetch data from the NoSQL Cassandra cluster and to build up enhanced posts. The concept of enhanced post is explained later in the document.

Key features

In order to understand which are the key features of the Social Networking Data Discovery facilities, we need to understand which are the data that Cassandra stores and how we can help users to quickly retrieve information.

These data are mainly:

  • Users' posts;
  • Users' comments
  • User's file attachments metadata (the payload of such attachments is stored into a different database)

Of course, other information needs to be saved to offer a large set of social facilities, such as post notifications, comment notifications, messages exchange and so on. But these are not of our interest.

The full-text search mainly focuses on the data cited above. In principle, a single user's post could be composed by the following elements:

  • Post's text, that is the initial content of the post [mandatory];
  • Post's author, that is the fullname of who made the post [mandatory];
  • zero or more comments to the post, so zero or more comments' texts and zero or more comments' authors;
  • zero or more attachments (pdfs, images, cvs and so on);
  • a VRE within the infrastructure in which the post/comments were published [mandatory].

Moreover, users can only access and see the data of the VREs to which they registered.

A post with the related text, comments, authors and attachments will be called an enhanced post.

A user can:

  • retrieve posts by author (of both post/comments);
  • retrieve posts by content (of both post/comments);
  • retrieve posts by attachments' names.

In fact, these are the current posts fields that are discoverable even though there is not an "advanced" search that allows to specify which one of these fields must be queried. This aspect will be better discussed later.

In the following "Use Cases" paragraph, we are going to discuss each scenario.

Use cases

As stated above, the search facility allows to retrieve a post when the query matches at least one among its content, author or comment, comments' author or attachment name field.

Our approach is to make a MultiMatch Query, that is the final score of a post evaluated as if the query matches against each field of this document to evaluate a partial score. The partial scores are then (mostly) summed up to get a final score for the document w.r.t. the query (so we are using a MultiMatch most_fields query). The most_fields type of multimatch query makes sense when we are querying different document's fields analyzed in different ways, as it is our case. How the document is analyzed will be shown later.

A list of possible use cases include the following:

  1. Search all posts whose author is a specified user, e.g "John Smith": in this case the fullname "John Smith" is inserted into the search textbox;
  2. Search all posts in which there are comments written by a specified user, e.g "John Smith": in this case the fullname "John Smithi" is inserted into the search textbox;
  3. Suppose that a user remembers to have seen a post with a file attached having name "report.pdf": in this case she can insert "report" or "report.pdf" into the search textbox;
  4. Search for all the feeds with a .doc document attached: in this case ".doc" is inserted into the search textbox;
  5. Search for a specific topic, e.g. "social data indexing": in this case she inserts "social", "social data", "social indexing" or the other possible combinations to look up the posts.

Design

There are different components involved for making the Social Networking Data search feature work (see the Architecture paragraph). We have already discussed about Apache Cassandra, but there is something more to say about how ElasticSearch has been prepared and how the Social-Data-Indexer plugin acts.

Specifically, we need to tell to the ElasticSearch engine what are the fields of the document that are needed to be indexed, so that the engine is able to create the inverted indices and all the related structures needed to speed up the query and the retrieval phases. Moreover, we need to instruct it on how we want to analyse the documents at indexing time, so when a document is inserted or updated, and at query time, so that the query is analyzed for that specified field. This is an important aspect because it influences the accuracy of the results.

A simplified enhanced posts can be represented by the following java class

/**
 * Enhanced post class.
 */
public class EnhancedPost{
 
     private Post post;
     private List<Comment> comments;
     private List<Attachment> attachments;
 
....
}

Where a Post object looks like

/**
 * Post class.
 */
public class Post {
 
     private String key;
     private long time;
     private String description;
     private String author;
     private String authorEmail;
     private long commentsNo;
     private long likesNo;
     private String context;
     private String privacy;
....
}

A comment looks like

/**
 * Comment class.
 */
public class Comment {
 
     private String key;
     private long creationTime;
     private long lastEditTime;
     private String description;
     private String author;
     private String authorEmail;
....
}

And, finally, an Attachment object looks like

/**
 * Attachment class.
 */
public class Attachment {
 
     private String key;
     private String uri;
     private String name;
     private String description;
     private String thumbnailURL;
     private String mimeType;
....
}

Now, we are mainly interested in making the following fields searchable:

  1. Post's description;
  2. Post's author;
  3. Comment's description;
  4. Comment's author;
  5. Feed's context and
  6. Attachment's name.

To this extent, we need to specify the document's schema of the objects we are going to push to ElasticSearch. We underline which fields must be indexed (and how must be analyzed) and the ones won't be analyzed at all, thus won't be indexed/searchable. Furthermore, we can also specify fields that will be searchable but not analyzed. The schema is in JSON format and resembles the Java classes reported so far.

Mapping Schema

A simple "mapping" schema for our document is:

<! -- Document schema -->
"mappings": {
  "enhanced_posts": {
    "_timestamp": {  <! -- It is the insert/update document timestamp in the index -->
      "enabled": true, 
      "ignore_missing": false
    },
    "properties": {
      "attachments": {
        "properties": {
          "description": {
            "type": "string",
            "index": "no"
          },
          "key": {
            "type": "string",
            "index": "no"
          },
          "mimeType": {
            "type": "string",
            "index": "no"
          },
          "name": {
            "type": "string",
            "analyzer": "filename_index",
            "search_analyzer": "filename_search"
          },
          "thumbnailURL": {
            "type": "string",
            "index": "no"
          },
          "uri": {
            "type": "string",
            "index": "no"
          }
        }
      },
      "comments": {
        "properties": {
          "author": {
            "type": "string",
            "analyzer": "author_analyzer"
          },
          "key": {
            "type": "string",
            "index": "no"
          },
          "lastEditTime": {
            "type": "long",
            "index": "no"
          },
          "description": {
            "type": "string",
            "analyzer": "feed_comment_text_index_analyzer",
            "search_analyzer": "feed_comment_text_search_analyzer"
          },
          "thumbnailURL": {
            "type": "string",
            "index": "no"
          },
          "creationTime": {
            "type": "long",
            "index": "no"
          },
          "authorEmail": {
            "type": "string",
            "index": "no"
          }
        }
      },
      "post": {
        "properties": {
          "commentsNo": {
            "type": "long",
            "index": "no"
          },
          "description": {
            "type": "string",
            "analyzer": "feed_comment_text_index_analyzer",
            "search_analyzer": "feed_comment_text_search_analyzer"
          },
          "authorEmail": {
            "type": "string",
            "index": "no"
          },
          "author": {
            "type": "string",
            "analyzer": "author_analyzer"
          },
          "key": {
            "type": "string",
            "index": "no"
          },
          "likesNo": {
            "type": "long",
            "index": "no"
          },
          "privacy": {
            "type": "string",
            "index": "no"
          },
          "thumbnailURL": {
            "type": "string",
            "index": "no"
          },
          "time": {
            "type": "long",
            "index": "no"
          },
          "context": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }

Please note, for example, that the field "context" within the post class will be not analyzed. This means that the search engine will not perform any preprocessing for this string object but all the needed structure to retrieve it will be created. This is useful because such field will be used to filter the retrieved results.

On the other hand, there are fields that won't be analyzed and won't be retrievable, due to the declaration:

"field_name":{
   "type": "field_type",
   "index": "no"
}

As far as the remaining fields is concerned, we need to further specify the analyzers that the engine will apply to preprocess them. In fact, in the mappings above we said, for instance, that author's name must be analyzed according the "author_analyzer". So, how an analyzer is made? It is mainly composed by

  1. a Tokenizer used to split words (e.g., the Whitespace tokenizer);
  2. zero or more TokenFilters used to modify/remove/add tokens starting from the stream of tokens received by the Tokenizer (e.g. the Standard Token filter, the AsciiFolding Token filter, etc), and
  3. a CharFilters that comes first of the Tokenizer to delete/transform chars (e.g. the html_strip filter to delete html tags).

Of course, ElasticSearch offers a lot of precooked Tokenizers, Filters and CharFilters and Analyzers. More information can be retrieved here. New analyzers can be created by composing the different sub pieces. Moreover, as stated above different analyzers can be used for the same field at index time and at query time.

To choose the analyzer that best fits the field to which it will be applied, we need to know our data. For instance, the author fullname is supposed to be composed by

  1. Name
  2. Middlename
  3. Lastname

These are separated by whitespaces and can contain characters such as "ä", "ö" etc. Since it is very difficult that a person that is looking for someone else's posts exactly remembers the name of that person (and this is even more difficult if that person's name contains a lot of particular chars), it is helpful to transform that chars into the ASCII equivalents ones. So that "ä" becomes "a", "ö" becomes "o" and so on. This operation must be performed both at query and index time. Thus, an analyzer for this field can be declared this way:

"author_analyzer": {
   "tokenizer": "whitespace",
      "filter": [
         "asciifolding",
         "lowercase"
      ]
 }

The strings are also lowercased so that searching for "John Smith" or "john smith" makes no difference at all (both will be converted to "john smith" and tokenized to "john" and "smith").

A more complex analyzer has been defined for the post/comment's description. This description could contain in principle html tags (e.g. <p>This is a paragraph</p>), hashtags (e.g. #elasticsearch), non ASCII chars, and a lot of meaningless words (e.g. "this", "a", "an", "the", also known as 'stopwords'). To properly analyze this field we need to consider that

  1. hashtags must be retained (default analyzers remove the char "#" but we want to preserve them)
  2. html tags are useless, thus must be removed;
  3. stopwords are meaningless, so it is better to remove them (they would also increase space usage and document retrieval time).

Furthermore, suppose there is an hashtag #elasticsearch, and we are interested in retrieving document that have this hashtag but, unfortunately, we do not remember the entire hashtag. A way to help the user is to store within the index the word "#elasticsearch", but also "#elas", "#elastic" and so on. This is what the 'edge-n-gram' tokenizer does: it needs two parameters to be defined, the minimum lenght of the word to be generated and the maximum one. For instance, if we use a edge-n-gram filter with values [2, 20], the word "elasticsearch" generates this stream of tokens: el, ela, elas, elast, elasti, elastic, elastics, elasticse, elasticsea, elasticsear, elasticsearc and elasticsearch.

Please note that it is reasonable to apply the edge-n-gram filter at indexing time, but it doesn't at query time. Doing it at query time would enlarge the time needed for the query to be evaluated as well as to sort the large amount of documents that would match.

We defined the description_comment_post_index_analyzer and the description_comment_post_search_analyzer this way

"description_comment_post_index_analyzer": {
  "type": "custom",
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "hashtag_filter",
    "edge_ngram",
    "asciifolding",
    "stopwords_eng_remover"
  ]
}
 
"description_comment_post_search_analyzer": {
  "type": "custom",
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "hashtag_filter",
    "asciifolding",
    "stopwords_eng_remover"
  ]
}

The hastag_filter simply specifies that when a word containing "#" is found, this symbol must be considered as a word delimiter, thus it won't be deleted. Finally, to perform a full-text search over attachment's name we used a pattern tokenizer that uses a regular expression to split the original filename, so that most punctuation is removed. After that, a edge-n-gram token filter is used, followed by lowercase and asciifolding filters.

The social-data-indexer-se-plugin is a SmartExecutor plugins that runs every T minutes in order to push back the Cassandra data, prepared according to the document schema, into the ElasticSearch engine. Furthermore it removes documents related to enhanced feeds no longer present on Cassandra and updates changed feeds. We decided to perform this synchronization task every T minutes and avoid to push every single change to the Elastic cluster for different reasons:

  1. first of all this reduce the load of both clusters;
  2. allows to decouple ElasticSearch and Cassandra;
  3. doesn't require other data structures that would have been needed for supporting document deletion;
  4. users expect to be helped in retrieving old feeds and not the most recent ones.

SmartExecutor SocialDataIndexer Plugin

The SmartExecutor takes care of the lifecycle of the plugin and each plugin's execution can be monitored. The simplified code that it cyclically performs is reported here

public void SocialDataIndexerPlugin(){
 
   // Save the start time
   long startTime = currentTime();
 
   // Discover ElasticSearch and Cassandra Clusters in the Infrastructure
   discoverEndPoints();
 
   // Retrieve Contexts 
   List<VRE> vres = getAllVREs();
 
   // For each context, retrieve the posts and the related comments/attachments to build up enhanced feeds
   List<EnhancedFeed> enhancedFeeds = new ArrayList<EnhancedFeed>();
   for(VRE vre: vres){
      List<Post> posts = vre.getPosts();
      for(Post post: posts){
         EnhancedFeed enhancedFeeds = buildEnhancedFeedFromPost(post);
	 enhancedFeeds.add(enhancedFeeds);
      }
   }
 
   // Use ElasticSearch bulk API
   addEnhancedFeeds(enhancedFeeds);
 
   // Delete not updated documents (using the _timestamp property)
   deleteDocumentsWithTimestampLowerThan(startTime);
}

Architecture

Figure 1: Social Networking Data Discovery architecture

Figure 1 shows how the different components interact and the roles they have. The Cassandra Cluster stores all raw information we need so it is queried by the Social-Data-Indexer plugin every T minutes to fetch new documents. The Social-Data-Indexer plugin builds up the enhanced posts, pushes them to the ElasticSearch cluster by using the ElasticSearch Bulk APIs and deletes the documents that refer to no longer present posts in Cassandra.

The ElasticSearch Client Library (ESCL) is deployed into the Infrastructure Gateway (Gateway in the Figure). A user interested in retrieving documents, transparently uses it. When one or more matching documents to the query are retrieved by ElasticSearch, the JSON document is mapped back to the EnhancedPost class in Java , so that a list of those class beans is returned. Finally, the "News Feed portlet" deployed at the gateway points renders the posts properly. An example of the News Feed portlet displaying the matching posts is shown in Figure 2.

Figure 2: News Feed portlet showing retrieved posts matching a query

API

To use the Elastic Search Client Library you need to put in the pom.xml file of your project the following dependency

<dependency>
   <groupId>org.gcube.socialnetworking</groupId>
   <artifactId>social-data-search-client</artifactId>
   <version>[1.0.0-SNAPSHOT,2.0.0-SNAPSHOT)</version>
</dependency>

After that, you need to instanciate it, for instance in your servlet's init method

private ElasticSearchClientInterface escl;
 
@Override
public void init() {
   try {
      escl = new ElasticSearchClientImpl();
      logger.info("Elasticsearch client library instanciated");
   } catch (Exception e) {
      logger.error("Unable to instanciate elasticsearch client library", e);
   }
}

The client library offers just two methods up to now

public interface ElasticSearchClientInterface {
 
/**
* Given a query, the method find matching enhanced feeds into the elasticsearch index and return
* at most <b>quantity</b> hits starting from <b>from</b>.
* @param query the query to match
* @param vreIDS specifies the vre(s) to which the returning feeds must belong
* @param from start hits index
* @param quantity max number of hits to return starting from <b>from</b>
* @return A list of matching enhanced feeds or null
*/
List<EnhancedFeed> searchInEnhancedFeeds(String query, Set<String> vreIDS, int from, int quantity);
 
/**
* Delete from the index a document with id docID.
* @param docID the id of the doc to delete
* @return true on success, false otherwise
*/
boolean deleteDocument(String docID);
 
}

Usage/Examples

Java

   // Retrieve at most ten feeds whose author is "Andrea Rossi" in context devVRE
   List<String> vres = new ArrayList<String>();
   vres.add("/gcube/devsec/devVRE");
 
   String query = "Andrea Rossi";
 
   int from = 0, quantity = 10;
 
   List<EnhancedFeed> enhancedFeeds = el.searchInEnhancedFeeds(query, vres, from, quantity);
 
   ...

Gateways

Try the search functionality here https://i-marine.d4science.org/group/imarine-gateway

Deployment

Currently the deployment configuration is as follows:

  1. A Cluster of 3 nodes (with full replication) for Cassandra with a consistency level of two;
  2. An ElasticSearch Cluster made of one node only, because there are less than 10.000 documents to be stored;
  3. One instance of the Social-Data-Indexer plugin that runs every 10 minutes, even though it takes less than half a minute to push new documents and delete old ones. This means that new feeds will be searchable in at most ten minutes and no longer present posts will be still available in at most ten minutes.