Social Networking Data Discovery

From Gcube Wiki
Jump to: navigation, search

Overview

The purpose of this document is to show how the search facility over the D4Science Infrastructure Social Networking data, primarily stored into a Cassandra Cluster, has been realized.

These data are mainly:

  • Users' posts;
  • Users' comments, and
  • User's attachments. // The payload of such attachments is stored into the Workspace

A single user's post could be, in principle, composed by the following elements:

  • Post's text, that is the initial content of the post;
  • Post's author, that is who made the post;
  • zero or more comments to the post, so zero or more comments' texts and zero or more comments' authors;
  • zero or more attachments (pdfs, images, cvs and so on);
  • a scope within the infrastructure in which the post/comments were published.

The goal of the search facility is to let the users quickly search over this potentially huge amount of data.

The Social Networking Library

The gCube Social Networking Library is the bridge between gCube Applications and the social networking facilities. All information about the library can be retrieved here. As far as the search mechanism is concerned, the library is used to fetch data from the NoSql Cassandra cluster and to build up enhanced feeds.

An enhanced feed is a post with its comments, the authors and the other optional information reported above.

Key features

TODO

Use cases

TODO

Design

TODO

Architecture

The engine underneath the full text search is an ElasticSearch Cluster. The ES software is an highly scalable, distributable, open source full text search and analytics engine based on the famous Apache-Lucene software library. ES runs on one or more nodes and is reachable over http. It allows to organize documents in one or more indexes according to their schema. This schema is defined in JSON format.

A SmartExecutor plugin, namely the SocialDataIndexer plugin has the role to fetch documents from the Cassandra nodes, organize them, and put them into the ElasticSearch cluster. Instead of pushing any new information sent to Cassandra when it is published, we decided to update the index by means of the plugin whose execution is scheduled pre

Philosophy

API

Usage/Examples

TODO

Deployment

TODO