ENH: Linked Datasets (RDF)

@context

ENH: Linked Datasets (RDF)

This is very much a meta ticket.
There are a number of bare links here.
They are for documentation

Use Case

So I:

retrieved some data
- from somewhere
- about a certain #topic
perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Series (1D)
- index
- data
  - NumPy datatypes
DataFrame (2D)
- index
- column(s)
  - NumPy datatypes
Panel (3D)
Panel4D (4D)

Read or parse a data format into a DataSet:

pandas.read_*
- read_clipboard
- read_csv
- read_excel
- read_fwf
- read_gbq
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sql
- read_stata
- read_table
pandas.HDFStore
- https://pandas.pydata.org/docs/dev/io.html#hdf5-pytables

Add metadata:

Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

pandas.DataFrame.
- to_csv
- to_dict
- to_excel
- to_gbq
- to_html
- to_latex
- to_panel
- to_period
- to_records
- to_sparse
- to_sql
- to_stata
- to_string
- to_timestamp
- to_wide
to_ RDF
to_ CSVW
to_ HTML + RDFa
to_ JSONLD
- create a JSONLD @context

Share or publish a serialized DataSet with the internet:

Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
FTP, SFTP, RSYNC, NFS
HTML web upload form with metadata form fields
CLI tool
Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
HTTP API: Object Storage (~LDP)
- GET/POST /container/filename.csv # [.json|.xml|.xls|.rdf|.html]
- challenge: indexing metadata from a separate document / named graph
  - GET/POST to/container/filename.csv`
Push to CKAN
Host DataSet metadata
- python -m SimpleHTTPServer 8088
- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s

Implementation

What changes would be needed for Pandas core to support this workflow?

.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)

`.meta` schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

CSV2RDF (`csvw`)

W3C PROV (`prov:`)

schema.org (`schema:`)

http://schema.org
http://www.w3.org/wiki/WebSchemas
http://schema.rdfs.org/
https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
  - [schema:Thing, schema:CreativeWork]
  - distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
  - spatial, temporal
  - catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
  - [schema:Thing, schema:CreativeWork]
  - dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
  - [schema:Thing, schema:CreativeWork]
  - contentSize
  - contentURL
  - uploadDate

W3C RDF Data Cube (`qb:`)

http://www.w3.org/TR/vocab-data-cube/
http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
  - qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
  - qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
  - qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]

`to_rdf`

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

output fmt
JSON-LD: compaction

.

Series.meta
Series.to_rdf()
DataFrame.meta
DataFrame.to_rdf()
Panel.meta
Panel.to_rdf()
Panel4D.meta
Panel4D.to_rdf()

`read_rdf`

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

Objective: Additive journal of transformations
Link to source script(s) URIs
Decorator for annotating data transformations with metadata.
Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
- 'this is an aggregation of that'
  - 'this' has a URI
  - 'that' has a URI
What if there is no metadata for df2?

Units support

Series.meta
DataFrame.column.meta
NumPy [, PyTables]
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
https://pint.readthedocs.org/en/latest/
http://pythonhosted.org/quantities/

RDF Datatypes

http://en.wikipedia.org/wiki/ISO_8601
http://www.w3.org/TR/xmlschema-2/#decimal
http://schema.org/Date
http://schema.org/DateTime
http://schema.org/Float
http://schema.org/Quantity
https://github.com/RDFLib/rdflib
- from rdflib.namespace import XSD, RDF, RDFS
- from rdflib import URIRef, Literal
- https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)

JSON-LD (RDF in JSON)

https://github.com/digitalbazaar/pyld (JSON-LD)
https://github.com/RDFLib/rdflib-jsonld (JSON-LD)

Linked Data Primer

Linked Data Abstractions

Graphs are represented as triples of (s,p,o)
Subject, Predicate, Object
Queries are patterns with ?references
- graph.triples((None, None, None))
- SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'

URIs and URLs

a URI is like a URL
usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
a URI may start with a different URI prefix
- urn:
- uuid:

SQL and Linked Data

there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
  - dbpedia.org has a high degree of centrality
    - http://lod-cloud.net/
rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores

Named Graphs

Quads: (g, s, p, o)
g: sometimes called the 'context' of a triple
Metadata about GRAPH ?g
Multiple named graphs in one file: TriX, TriG

Linked Data Formats

NTriples
RDF/XML
- TriX
Turtle, N3
- TriG
JSON-LD

Choosing Schema

XSD, RDF, RDFS, DCTERMS
Which schema is most popular?
Which schema is a best fit for the data?
Which schema will search engines index for us?
What do the queries look like?
Years Later... What is OWL?
Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

where and how was it downloaded? (digital sense)
how was it collected? (process control sense)

Datasets have structure:

Tabular, Hierarchical
1D, 2D, 3D, 4D
Graph-based
- Chains
- Flows
Schema

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Linked Datasets (RDF) #3402