-
- Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
ENH: Linked Datasets (RDF)
- This is very much a meta ticket.
- There are a number of bare links here.
- They are for documentation
(UPDATE: see westurner/pandasrdf#1)
Use Case
So I:
- retrieved some data
- from somewhere
- about a certain #topic
- perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").
Status Quo: Pandas IO
How do I go from a [CSV] to a DataFrame to something shareable with a URL?
http://pandas.pydata.org/pandas-docs/dev/io.html
- http://pandas.pydata.org/pandas-docs/dev/dsintro.html
- https://github.com/pydata/pandas/blob/master/pandas/core/format.py
- https://github.com/pydata/pandas/blob/master/pandas/rpy/common.py
.
- Series (1D)
- index
- data
- NumPy datatypes
- DataFrame (2D)
- index
- column(s)
- NumPy datatypes
- Panel (3D)
- Panel4D (4D)
Read or parse a data format into a DataSet:
pandas.read_*read_clipboardread_csvread_excelread_fwfread_gbqread_hdfread_htmlread_jsonread_msgpackread_pickleread_sqlread_stataread_table
pandas.HDFStore
Add metadata:
- Add RDF metadata (RDFa, JSONLD)
Save or serialize a DataSet into a data format:
pandas.DataFrame.to_csvto_dictto_excelto_gbqto_htmlto_latexto_panelto_periodto_recordsto_sparseto_sqlto_statato_stringto_timestampto_wide
- to_ RDF
- to_ CSVW
- to_ HTML + RDFa
- to_ JSONLD
- create a JSONLD @context
Share or publish a serialized DataSet with the internet:
- Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
- FTP, SFTP, RSYNC, NFS
- HTML web upload form with metadata form fields
- CLI tool
- Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
- HTTP API: Object Storage (~LDP)
GET/POST /container/filename.csv# [.json|.xml|.xls|.rdf|.html]- challenge: indexing metadata from a separate document / named graph
GET/POST to/container/filename.csv`
- Push to CKAN
- Host DataSet metadata
python -m SimpleHTTPServer 8088- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s
Implementation
What changes would be needed for Pandas core to support this workflow?
.metaschemato_rdffor Series, DataFrames, Panels, and Panel4Dsread_rdffor Series, DataFrames, Panels, and Panel 4Ds- ~
@datastepprocess decorators - ~
DataSet - ~
DataCatalogof precomputed aggregations/views/slices. - Units support (
.meta?)
.meta schema
It's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).
Ontology Resources
- http://www.w3.org/TR/rdf-schema/ (
rdfs:) - http://www.w3.org/TR/owl-overview/ (
owl:) - http://www.w3.org/TR/sparql11-query/#sparqlDefinition
- http://lov.okfn.org
- http://prefix.cc
CSV2RDF (csvw)
W3C PROV (prov:)
- http://www.w3.org/TR/prov-primer/#intuitive-overview-of-prov
- http://www.w3.org/TR/prov-o/
- http://www.w3.org/2011/prov/wiki/ProvImplementations
schema.org (schema:)
- http://schema.org
- http://www.w3.org/wiki/WebSchemas
- http://schema.rdfs.org/
- https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
- [schema:Thing, schema:CreativeWork]
- distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
- spatial, temporal
- catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
- [schema:Thing, schema:CreativeWork]
- dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
- [schema:Thing, schema:CreativeWork]
- contentSize
- contentURL
- uploadDate
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
W3C RDF Data Cube (qb:)
- http://www.w3.org/TR/vocab-data-cube/
- http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
- http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
- qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
- qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
- output
fmt - JSON-LD: compaction
.
-
Series.meta -
Series.to_rdf() -
DataFrame.meta -
DataFrame.to_rdf() -
Panel.meta -
Panel.to_rdf() -
Panel4D.meta -
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
-
Series.read_rdf() -
DataFrame.read_rdf() -
Panel.read_rdf() -
Panel4D.read_rdf()
Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.
@datastep / PROV
- Objective: Additive journal of transformations
- Link to source script(s) URIs
- Decorator for annotating data transformations with metadata.
- Generate PROV metadata for data transformations
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
-
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]- 'this is an aggregation of that'
- 'this' has a URI
- 'that' has a URI
- 'this is an aggregation of that'
- What if there is no metadata for df2?
Units support
- Series.meta
- DataFrame.column.meta
- NumPy [, PyTables]
- http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
- https://pint.readthedocs.org/en/latest/
- http://pythonhosted.org/quantities/
RDF Datatypes
- http://en.wikipedia.org/wiki/ISO_8601
- http://www.w3.org/TR/xmlschema-2/#decimal
- http://schema.org/Date
- http://schema.org/DateTime
- http://schema.org/Float
- http://schema.org/Quantity
- https://github.com/RDFLib/rdflib
from rdflib.namespace import XSD, RDF, RDFSfrom rdflib import URIRef, Literal- https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)
JSON-LD (RDF in JSON)
- https://github.com/digitalbazaar/pyld (JSON-LD)
- https://github.com/RDFLib/rdflib-jsonld (JSON-LD)
Linked Data Primer
Linked Data Abstractions
- Graphs are represented as triples of (s,p,o)
- Subject, Predicate, Object
- Queries are patterns with ?references
graph.triples((None, None, None))SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
- subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'
URIs and URLs
- a URI is like a URL
- usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
- a URI may start with a different URI prefix
urn:uuid:
SQL and Linked Data
- there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
- virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
- dbpedia.org has a high degree of centrality
- rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores
Named Graphs
- Quads: (g, s, p, o)
- g: sometimes called the 'context' of a triple
- Metadata about
GRAPH ?g - Multiple named graphs in one file: TriX, TriG
Linked Data Formats
- NTriples
- RDF/XML
- TriX
- Turtle, N3
- TriG
- JSON-LD
Choosing Schema
- XSD, RDF, RDFS, DCTERMS
- Which schema is most popular?
- Which schema is a best fit for the data?
- Which schema will search engines index for us?
- What do the queries look like?
- Years Later... What is OWL?
- Why would we start with RDFS now?
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
http://example.com/datasets/#<key> Shared or published DataSets have URLs:
http://ckan.example.org/datasets/<key> DataSets are about certain things:
e.g. URIs for #Tags, Categories, Taxonomy, Ontology DataSets are derived from somewhere, somehow:
- where and how was it downloaded? (digital sense)
- how was it collected? (process control sense)
Datasets have structure:
- Tabular, Hierarchical
- 1D, 2D, 3D, 4D
- Graph-based
- Chains
- Flows
- Schema
5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.