GCat Background

From Gcube Wiki
Revision as of 19:33, 1 July 2016 by Leonardo.candela (Talk | contribs) (SoBigData.eu: Dataset Metadata)

Jump to: navigation, search

** THIS DOCUMENT IS A DRAFT **

gCube Data Catalogue.... using CKAN.

CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data see: http://ckan.org/

gCube Data Catalogue Metadata

A Metadata in the gCube Data Catalogue is made by two parts: CKAN's default metadata fields and gCube Metadata Profile.


CKAN's default metadata fields

Those are metadata fields common for all metadata types in the gCube Data Catalogue (and used by default in the CKAN platform).

Label Field Name (API) Definition Guidelines Example
Title* title Name given to the dataset. Short phrase, written in plain language. Should be sufficiently descriptive to allow for search and discovery. Aquaculture Production and Consumption in Africa (2011)
Description description Short description explaining the content and its origins. Description of a few sentences, written in plain language. Should,provide a sufficiently comprehensive overview of the resource for anyone,to understand its content, origins, and any continuing work on it. The,description can be written at the end, since it summarizes key,information from the other metadata fields. This dataset contains attributes of aquaculture production and,consumption for each of Africa’s provinces in 2011. The data was,provided by………
Tags tags An array of Taxonomic terms stored as tags Taxonomic terms Access to education, Bamboo
License* lincese_title the license that applies to published dataset.
Organization* organization Organization the datasets belongs to See list of organizations on

https://ckan-d-d4s.d4science.org/organization

D4Science
Version version Version of dataset Increase manually after editing 1.0
Author* Owner of dataset The person who created the dataset Joe Bloggs
Author Contact Contact details of owner The email or other contact details of the person who created the dataset. joe@example.com
Mantainer Mantainer of the dataset The person who maintains the dataset Joe Bloggs
Mantainer

Contact

Contact details of mantainer The email or other contact details of the person who maintains the dataset. joe@example.com

mandatory fields are marked with an asterisk (*)


gCube Metadata Profile

gCube Metadata Profile defines a Metadata schema XML-based for adding custom metadata fields.

A gCube Metadata Profile is composed by one Metadata Format (<metadataformat>) that contains one or many (<metadatafield>). The schema is the following:

<?xml version="1.0" encoding="UTF-8">
<metadataformat>
    <metadatafield>
        <fieldName>Name</fieldName>
        <mandatory>true</mandatory>
        <isBoolean>false</isBoolean>
        <defaulValue>default value</defaulValue>
        <note>shown as suggestions in the insert/update metadata form of CKAN</note>
        <vocabulary>
            <vocabularyField>field1</vocabularyField>
            <vocabularyField>field2</vocabularyField>
            <!-- ... others vocabulary fields -->
        </vocabulary>
        <validator>
            <regularExpression>a regular expression for validating values</regularExpression>
        </validator>
    </metadatafield>
     <!-- ... others metadata fields -->
</metadataformat>

It's possible to validate a Metadata Format schema using following DTD


<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT metadataformat (metadatafield+)>
<!ELEMENT metadatafield (fieldName, mandatory, isBoolean?, defaulValue?, note?, vocabulary?, validator?)>
<!ELEMENT fieldName (#PCDATA)>
<!ELEMENT mandatory (#PCDATA)>
<!ELEMENT isBoolean (#PCDATA)>  <!-- MUST BE (true|false) -->
<!ELEMENT defaulValue (#PCDATA)> 
<!ELEMENT note (#PCDATA)> 
<!ELEMENT vocabulary (vocabularyField+)> 
<!ELEMENT vocabularyField (#PCDATA)> 
<!ELEMENT validator (regularExpression)> 
<!ELEMENT regularExpression (#PCDATA)> 

A possible instance of Metadata Field (<metadatafield>):

<metadatafield>
   <fieldName>Accessibility</fieldName>
   <mandatory>true</mandatory>
   <defaulValue>virtual/public</defaulValue>
   <vocabulary>
       <vocabularyField>virtual/public</vocabularyField>
       <vocabularyField>virtual/private</vocabularyField>
       <vocabularyField>transactional</vocabularyField>
   </vocabulary>
</metadatafield>

SoBigData.eu: Dataset Metadata

The current list of fields characterising a SoBigData resource is available at https://docs.google.com/spreadsheets/d/1kuhvmDVKpmqt2foyCB9wDo3HgzoAiCuRQ8CjRS-DVOM/edit?usp=sharing

The following fields have been identified:

Field In Catalogue
Internal Fields
Internal Identifier Automatically created
Creation Date Automatically created
Last Modification Automatically updated
General Description
Title Title
Identifier
<fieldName>External Identifier</fieldName>
<mandatory>false</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>This applies only to datasets that have been already published. 
   Insert here a DOI, an handle, and any other Identifier assigned when 
   publishing the dataset alsewhere.</note>
<vocabulary></vocabulary>
<validator></validator>
Creators Author is there, unfortunately there is only one author per Dataset. Moreover, the technology supports only key value pairs ... no complex types.
<fieldName>Creator</fieldName>
<mandatory>true</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>The name of the creator, with email and ORCID. The format should be: family, given[, email][, ORCID]. 
   Examples: Smith, John, js@acme.org, orcid.org/0000-0000-0000-0000; Miller, Elizabeth
</note>
<vocabulary></vocabulary>
<validator></validator>
Creation Date
<fieldName>CreationDate</fieldName>
<mandatory>true</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>The date of creation of the dataset (different from the date of creation of the dataset automatically added by the system)
</note>
<vocabulary></vocabulary>
<validator></validator>
Distributor
Publisher
Publication Date when the daaset is published in the repository ... no field have to be specified;
Contact Isn't this the Author / Maintainer? I would go for Maintainer.
Thematic Cluster

Shall we go for a Topic too? I think so.

<fieldName>ThematicCluster</fieldName>
<mandatory>true</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>The SoBigData.eu Thematic Clusters
</note>
<vocabulary>
   <vocabularyField>Text and Social Media Mining</vocabularyField> 
   <vocabularyField>Social Network Analysis</vocabularyField> 
   <vocabularyField>Human Mobility Analytics</vocabularyField> 
   <vocabularyField>Web Analytics</vocabularyField> 
   <vocabularyField>Visual Analytics</vocabularyField> 
   <vocabularyField>Social Data</vocabularyField> 
</vocabulary>
<validator></validator>
Area Tag vs domain specific field
Semantic Coverage Tag vs domain specific field
Time Coverage Start Date
<fieldName>TimeCoverage</fieldName>
<mandatory>true</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>List of time intervals, e.g. 1977-03-10T11:45:30 - 2005-01-15T09:10:00</note>
<vocabulary></vocabulary>
<validator></validator>
Time Coverage End Date not needed see below
Geo Location
<fieldName>spatial</fieldName>
<mandatory>false</mandatory>
<isBoolean>false</isBoolean>
<defaulValue></defaulValue>
<note>The value must be a valid GeoJSON geometry, for example:
   {
      "type":"Polygon",
      "coordinates":[[[2.05827, 49.8625],[2.05827, 55.7447], [-6.41736, 55.7447], [-6.41736, 49.8625], [2.05827, 49.8625]]]
   }
   or:
   {
      "type": "Point",
      "coordinates": [-3.145,53.078]
   }
</note>
<vocabulary></vocabulary>
<validator></validator>

More on GeoJSON geometry.

ProcessingDegree
ManifestationType
Language
Description
RelatedLiterature
RelatedDataset
Accessibility properties
Accessibility
AccessibilityMode
Privacy
Technical properties
Size
DiskSize
Format
FormatSchema
Api
Legally and Ethical Aspects
Personal data/ Non Personal
Personal sensitive data
Data set contains data of children
Consent of the data subject
Consent obtained also covers the envisaged transfer of the personal data outside the EU
Personal data was manifestly made public by the data subject
Data Protection Directive
Intellectual properties
IP/Copyrights
Link to the source
Right holder, if identified and different from the source
License
Link to the license
Field/Scope of use
Basic rights
Restrictions on use
Prohibited actions
Sublicense rights
Attribution requirements
Display requirements
Distribution requirements
Territory of use
License term
Requirement of non-disclosure

(confidentiality mark)

    <metadatafield>
        <fieldName>'''xxx'''</fieldName>
        <mandatory>true</mandatory>
        <isBoolean>false</isBoolean>
        <defaulValue>default value</defaulValue>
        <note>shown as suggestions in the insert/update metadata form of CKAN</note>
        <vocabulary>
            <vocabularyField>field1</vocabularyField>
            <vocabularyField>field2</vocabularyField>
            <!-- ... others vocabulary fields -->
        </vocabulary>
        <validator>
            <regularExpression>a regular expression for validating values</regularExpression>
        </validator>
    </metadatafield>

SoBigData.eu: Method Metadata

gCube Data Catalogue: Geo Harvesting

This extension contains plugins (ckanext-geonetwork and others) that add geospatial capabilities to CKAN (https://github.com/geosolutions-it/ckanext-geonetwork/wiki).

Several harvesters to import geospatial metadata into CKAN from other sources in ISO 19139 format and others has been created in gCube Data Catalogue. In particular all metadata created into gCube Geonetwork (GeoNetwork is the catalog application to manage spatially referenced resources generated into D4Science Infrastructure) are harvested through the 'Geoentwork Resolver' a "middle tier" able to:

Mapping (among fields) from an ISO19139 Metadata to Ckan Dataset via ckanext-geonetwork is showed in the following table:

ISO19139 Ckan Dataset
Title Title
Description Description
Digital Transfer Option Data and Resource
CI_OnlineResource
gmd:url URL
gmd:name Name
gmd:description Description
Descriptive Keywords
gmd:keyword Tag
Additional Info
bbox, metadata language, age,

reference system, etc.

key/value

gCube Data Catalogue: Ckan Connector