Semantic Data Analysis
From Gcube Wiki
Revision as of 09:24, 26 April 2012 by Yannis.marketakis (Talk | contribs) (FORTH - Semantic Data Analysis (1st draft))
Overview
This task aims to deliver a set of libraries and services to bridge the gap between communities and link distributed data across community boundaries. The introduction of the semantic Web and the publication of expressive metadata in a shared knowledge framework enable the deployment of services that can intelligently use Web resources.
Emphasis will be given on two main objectives:
- A knowledge repository or repositories based on the Linked Data principles that can accommodate the existing semantic information that is provided by the various partners (FAO, IRD). This includes efforts for reaching a common top-level ontology. FLOD will be used as a starting point and actions for linking/extending it will be undertaken. We should stress that such semantic repositories are important assets and can be exploited now and in the future for various purposes.
- Provision of various innovative services for exploiting the semantic infrastructure described in (1). Emphasis will be given on providing exploratory search services and on bridging the gap between the responses of non semantic search systems (e.g. Web search engines, other vertical search systems) and semantic information. We will design and develop a generic meta-search service, for the moment called X-Search, that will be able to read a description of an underlying search source (the description can be an OpenSearch description document for being generic), e.g. the description of the gCube Search Service, or the description of search services provided by the partners (e.g. FIGIS). X-Search will be able to query that source and to analyze (in various ways) the returned results and also exploiting the availability of semantic repositories. Specifically X-Search will provide advanced services for satisfying recall-oriented information needs and for semantically enriching the results. These services include: results clustering, entity name recognition, semantic enrichment etc. It will be possible to apply these services over the entire answer returned by the underlying system or only over the top-K hits returned (in addition ability to analyze only the textual snippets or the full contents). The semantic enrichment will be based on the SPARQL endpoints of (1), plus other SPARQL endpoints.
Key Features
- Provision of results clustering over any search system
- Returns textual snippets and for which there is an OpenSearch description
- Provision of snippet or contents-based entity recognition
- Generic as well as vertical - based on predetermined entity categories and lists which can be obtained by querying SPARQL endpoints
- Provision of gradual faceted (session-based) search
- Allows to gradually restrict the answer based on the selected entities and/or clusters
- Ability to fetch and display semantic information of an identified entity
- Achieved through querying approprate SPARQL endpoints
- Ability to apply these services on any web page through a web browser
- Using the functionality of bookmarklets
Subsystems
- Text Clustering component
- Text Clusterer
- Cluster Ranker
- Text Entity Mining component
- Text Entity Miner
- Entity Ranker
- Search Engine Mediator
- Search System Parser
- Hit Resolver
- Caching component
- Linked Open Data Query component
- Bookmarklet component for Dynamic Semantic Annotation
- Text Clustering component
As regards the semantic repository we assume repositories that provide a SPARQL endpoint. Such a repository is FLOD, which contains several datasets from the fisheries domain (Discussion about FLOD datasets and top-level ontologies are described in detail elsewhere).