Difference between revisions of "Social Networking Data Discovery"

From Gcube Wiki
Jump to: navigation, search
(The Social Networking Library)
(Key features)
Line 17: Line 17:
  
 
== Key features ==
 
== Key features ==
 +
In order to understand which are the key features of the Social Networking Data Discovery facilities, we need to understand which are the data that
 +
Cassandra stores and how we can help users to quickly retrieve information.
  
TODO
+
These data are mainly:
 +
* Users' posts;
 +
* Users' comments, and
 +
* User's attachments metadata (However, the payload of such attachments is stored into a different database)
 +
 
 +
Of course, a lot of other information needs to be saved to offer a large set of social facilities, such as post notifications, comment notifications, messages exchange and so on.
 +
 
 +
The full-text search mainly focuses on the data cited above. In principle, a single user's post could be composed by the following elements:
 +
* Post's text, that is the initial content of the post [mandatory];
 +
* Post's author, that is the fullname of who made the post [mandatory];
 +
* zero or more comments to the post, so zero or more comments' texts and zero or more comments' authors;
 +
* zero or more attachments (pdfs, images, cvs and so on);
 +
* a VRE within the infrastructure in which the post/comments were published [mandatory].
 +
Users can only access and see the data of the VREs to which they registered.
 +
 
 +
A post with the related text, comments, authors and attachments is an enhanced.
 +
 
 +
Up to now, a user can:
 +
* retrieve a feed by author;
 +
* retrieve a feed by content (of both post/comments);
 +
* retrieve a feed by attachments' names.
 +
 
 +
In the following "Use Cases" paragraph, we are going discuss of such scenarios.
 
== Use cases ==
 
== Use cases ==
 
TODO
 
TODO

Revision as of 16:13, 27 July 2016

Overview

The purpose of this document is to show how the search facility over the D4Science Infrastructure Social Networking data, primarily stored into a Cassandra Cluster, has been realized. Cassandra is an highly scalable and distributable database, used by a lot of companies around the world (eBay, Netflix, Instagram and many more). It offers highly availability by means of data sharding and replications.

The engine that enables the full-text search is ElasticSearch. ElasticSearch is an highly scalable, distributable, open source full text search and analytics engine based on the famous Apache-Lucene software library. It runs on one or more nodes and is reachable over http. It allows to organize documents in one or more indexes according to their schema. This schema can be defined in JSON format, even tought Elastic tries to automatically detect it.

The glue between Cassandra and ElasticSearch is a SmartExecutor plugin, namely the social-data-indexer plugin. In the following we are going to investigate which roles it have.

The main goal of the search facility is to let the users quickly search over this potentially huge amount of data, taking into account the data they are allowed to access. In fact, D4Science is a Research Infrastructure that offers many Virtual Research Environments (VREs). A user is allowed to see only the data of the VREs in which she is present.

The Social Networking Library

The gCube Social Networking Library is the bridge between gCube Applications and the social networking facilities. The library discovers the Cassandra Cluster in the infrastructure and offers a lot of methods, such as post creation/deletion, comment creation/deletion, notifications generation and so on. All the information about the library can be retrieved here. As far as the search mechanism is concerned, the library is used to fetch data from the NoSql Cassandra cluster and to build up enhanced feeds. The concept of enhanced feed will be shown later.

Key features

In order to understand which are the key features of the Social Networking Data Discovery facilities, we need to understand which are the data that Cassandra stores and how we can help users to quickly retrieve information.

These data are mainly:

  • Users' posts;
  • Users' comments, and
  • User's attachments metadata (However, the payload of such attachments is stored into a different database)

Of course, a lot of other information needs to be saved to offer a large set of social facilities, such as post notifications, comment notifications, messages exchange and so on.

The full-text search mainly focuses on the data cited above. In principle, a single user's post could be composed by the following elements:

  • Post's text, that is the initial content of the post [mandatory];
  • Post's author, that is the fullname of who made the post [mandatory];
  • zero or more comments to the post, so zero or more comments' texts and zero or more comments' authors;
  • zero or more attachments (pdfs, images, cvs and so on);
  • a VRE within the infrastructure in which the post/comments were published [mandatory].

Users can only access and see the data of the VREs to which they registered.

A post with the related text, comments, authors and attachments is an enhanced.

Up to now, a user can:

  • retrieve a feed by author;
  • retrieve a feed by content (of both post/comments);
  • retrieve a feed by attachments' names.

In the following "Use Cases" paragraph, we are going discuss of such scenarios.

Use cases

TODO

Design

TODO

Architecture

The engine underneath the full text search is an ElasticSearch Cluster. The ES software is an highly scalable, distributable, open source full text search and analytics engine based on the famous Apache-Lucene software library. ES runs on one or more nodes and is reachable over http. It allows to organize documents in one or more indexes according to their schema. This schema is defined in JSON format.

A SmartExecutor plugin, namely the SocialDataIndexer plugin has the role to fetch documents from the Cassandra nodes, organize them, and put them into the ElasticSearch cluster. Instead of pushing any new information sent to Cassandra when it is published, we decided to update the index by means of the plugin whose execution is scheduled pre

Philosophy

API

Usage/Examples

TODO

Deployment

TODO