Result Set components

From Gcube Wiki
Revision as of 14:28, 30 April 2012 by Gerasimos.farantatos (Talk | contribs) (Well suited use cases)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Overview

Result Set provides a common data transfer mechanism that aims to establish high throughput point to point on demand communication.

This document outlines the design rationale, key features, and high-level architecture, the options for their deployment and as well some use cases.

Key features

Point to Point transfer
one writer-one reader as core functionality
No disk writes
records are stored in a buffer that simulates a blocking queue
Produce only what is requested
a producer-consumer model that blocks when needed and reduces the unnecessary data transfers
Configurable lifetime policies based on activity
persist and reuse a set only by specifically requesting it
No Web Service invocations
result set is designed as a core functionality of the overall system.
Bidirectional communication with writer/reader “events”
Records are field containers
different fields indicate payload as well as transport and access details
Definitions describe Records and Fields
transport directive, mime type, compression, chunk size
Intuitive stream and iterator based interface
simplified usage with reasonable default behavior for common use cases and a variety of features for increased usability and flexibility
Multiple protocols support
data transfer currently supports the following protocols: tcp and http
HTTP Broker Servlet
results are exposed as an http endpoint

Design

Philosophy

Result Set has been designed and implemented to establish transfer reliability and high throughput on point to point communication and bulk movement of large data sets between distributed components and also to overcome problems that arise in systems with Web Service based architecture such as gCube.

Architecture

Records are field containers. Each different field indicates the payload as well as transport and access details. Definitions describe Records and Fields such as their transport directive, mime type, compression, chunk size etc. Fields are of a variety of types, such as String, File, Object(interface) or URL. In case of URL fields, a general purpose URL resolution library has been developed to support a variety protocols such as http, ftp, sftp, ftps, gridftp, bittorrent, etc. In order to orchestrate bidirectional communication between the reader and the writer, Event objects can be used. Events are simply key-value pairs that provide a simple but powerful data structure that can be used in the communication.

The main entities of the architecture are:

Writer
The entity which performs record authoring to an underlying buffer, in order for them to be consumed by a collocated or remote reader. Writers have a variety of properties that can be configured:
  • Buffer capacity can be configured in order to tune the memory footprint and total performance rate.
  • Number of concurrent partial transfers. The number of records a reader can concurrently access.
  • Mirroring factor. Factor limiting the buffer mirroring chunks.
  • Inactivity time can be configured in order to allow control over the lifetime of allocated resources.
Reader
The entity which consumes records authored by a writer. Readers come in two major types
  • Forward Reader
    • Performance oriented version offering limited seek functionality. This version supports base point-to-point fast data consumption. Disk persistence is avoided for optimal performance.
  • Random Reader
    • Reader offering full random access functionality through disk persisted records. Targets consumers which require support for multi-pass processing or seeking to random places of the result set.
In addition, value-adding features can be added on top of existing Readers. Such an example is Keep-Alive functionality, implemented in order to support time delayed or triggered processing.
Proxy
The entity that is responsible for handling each entity connection with the other connection for a protocol. Currently 3 types of proxies have been implemented supporting Local, TCP and HTTP transfer protocol. Also store functionality is supported in each protocol so records can be read again after the first reading. Each writer needs to specify the type of proxy in order to create the appropriate locator. The reader reads this locator and decides automatically which type of proxy should be used.
Mirror
The entity that is responsible for the actual data transfer between each entity for every supported protocol. Each proxy initiates the respective mirror in order to communicate with the other side-entity. This entity is transparent to the readers and writers.
Buffer
The entity that is responsible for keeping the records in the main memory. In case of local transfer, the same address space is used for increased performance. In case of remote transfers, on the reader's side the mirror fills the buffer with the retrieved records and the consumes them. On the writer's side the produced records are moved in the buffer and the mirror handles their remote transfer. Both the reader and the writer support configurable buffer size.
Connection Handler
The entity that is responsible for setting up a connection between the reader and the requested writer. Writer subscribes itself on the connection handler so the readers can find the requested writers.
HTTP Broker
This component exposes the result as an http endpoint


The following diagram shows the stages of a full communication between a reader and a writer:

gRS2 communication schema

note: The locator that is produced from the writer side is not actual send to the read through the network but the client side of the application uses it and feeds it to reader.

Deployment

Result Set is a set of facilities available as Java libraries. Therefore, there is no deployment scheme for them. They are just individually and independently co-deployed with components that need to use a particular feature that one of them offers.

Only the HTTP Broker Servlet needs to be deployed as a web application in a machine hosting an application server such as Apache Tomcat, Jetty, JBoss etc.

Use Cases

Well suited use cases

  • The Result Set can offer high throughput in point to point data transfer for records with fields containing embedded payload of reasonable size, such as the results of any service publishing stream based results. According to the use case in question, clients will select appropriate sizes for the buffers which will host their records in relevance with the target memory footprint.
  • Pipelined processing: Result Sets located on the same node or distributed to a number of nodes can be combined in order to form a high throughput pipeline which can be used for data processing.
  • The Result Set can be used by components which are part of the presentation layer. The random access requirements of such components are satisfied by the appropriate Readers.
  • Transfer of any type of payload: Records can be easily customized to carry any datatype the user prefers if the datatypes that are supported from the offered Field types are not suited to her needs.

Less well suited Use Cases

In case of records containing overly large payloads, the user should avoid embedding the payload into fields designed to keep the entire object in memory. If this suggestion is not followed, the ResultSet may not perform optimally as a result of the large incurred memory footprint and network overhead. In such cases the user should opt for fields supporting chunked transfer, such as the FileField, or if store-and-forward needs to be avoided entirely, for a scheme which data are published and passed by reference via URLFields. In the latter case, the full functionality of the URL Resolution Library will be exploited. The core functionality offered by the library is, as mentioned, targeted to point-to-point communication. Use cases employing the publish-subscribe scheme, namely the consumption of the same result set by different consumers should use a utility specially designed for this case, the ResultSet Store, keeping in mind that the basic features for which ResultSet was designed - high performance and on-demand production - will not be offered to their full extent.