Implementing Linked Data in Low Resource Conditions Caterina Caracciolo, Johannes Keizer {caterina.caracciolo},{johannes.keizer}@fao.org Food and Agriculture Organization of the UN 09 September 2015
Goals for Today • Give you a high level view of what is needed to do Linked Data • Identify possible bottlenecks due to working with little resources • Based on our experience, give you some suggestions to overcome those bottlenecks
Our background assumptions Some restrictions are needed… • Target audience: small-medium size institutions – This talk is not meant to be a how-to guide for specific technical problems, but rather a support grid to plan your entering the linked open data world • Target data – We mainly think of textual data, e.g., list of publications produced by the institution, catalogues of specimens in the local museum, factsheets on plants, events organized, ..
Topics for today • What is a “low-resource” condition • Open Data and Linked Open Data • An overview of Linked Data lifecycle – Bottlenecks in terms of resources – Our suggestions to overcome them • The example of Agris
Low-resource condition = ?
1. IT competencies • Few IT people, over-busy • Technology fast moving, nothing taught in school • Need personal update – But working environment may not encourage this – Or there may be language barriers
2. Other IT/IM/cultural issues • Competency on legal issues – licenses, litigations? • “It is my data”, even in the same organization • Different “cultures” in the same workplace – Domain specialists “know” the domain and the data – e.g., the reports they produced - do not want to spend time with “techy stuff” – IT/IM people may prefer to spend time to make better system once, instead of repeating ad-hoc conversions - would like to standardize more All may require some investments in time
3. Software • Outdated operating systems and software – Because of cost of licenses, or cultural issues
4. Hardware CPU, memory and technology constraints...
5. Electricity may be unreliable
5. Electricity ..occasionally available…
5. Electricity …expensive…
6. Internet connection may be slow…
6. Internet connection ..dependent on the weather…
Data
The trend Great attention to data • Interoperability of data – data that can be reused = processed in different applications • Standard and open formats are seen as crucial to interoperability • Data made available over the web, for maximum reuse
Open Data
Open data in a nutshell • Like other “open” movements: open and free • See http://opendefinition.org/ • Especially for government-generated data • E.g., census, public investments, housing, environment, .. • A variety of formats used to expose the data • XLS, CSV, XLM, JSON, PPT, SDMX, .. • Preference for non-proprietary formats – Most of the data around is “open”, more or less… • But, check out if your country has produced a national policy on data!
Who does Open Data? • National and regional initiatives (not exhaustive) – opendataforafrica.org – data.gov.uk – usopendata.org – opendatalatinamerica.org – open-data.europa.eu – data.gov.au – data.gov.in • Global and sectorial initiatives – e.g., GODAN
Why do people go for Open Data • Increase transparency of governments and institutions • Create new business opportunities • It is the way to go now
Linked Open Data
Linked Open Data in a nutshell • Like other “open” movement: open and free – You can have Linked Data that with no open license – but today we think of Linked Open Data (LOD) • Any type of data, any domain • The format of choice: RDF – Various serialization possible – XML, Turtle, N- Triples, N-Quads, JSON-LD, Notation 3, TriX • Not just getting datasets out, but linked pieces of data
Why should I go for Linked Data? • To be able to reuse data published by others • To promote business – made by others or yourself • Not to be isolated, left behind in the information world • Yes but… is the game worth the candle?
Agris - a LOD-based application
Then, Open Data or Linked Data? • Can be seen as two steps along the same line • You should decide based on your situation and goals – Open data requires less effort. Good if data will be primarily used by others or have no direct interest in linking to other datasets – Linked Open Data may be more complex because of the linking step. Good if you want to exploit the data yourself, e.g. to enhance your library/doc rep catalogue with data produced by others
The Linked Data workflow
A typical Linked Data flow SPARQL endpoint HTML/RDF Content negotiation RDF store RDF dump LOD based applications Data consumptionLOD exposureLOD storage “Original “ dataset Maintenance in RDF Maintenance in original format Conversion SPARQL endpoint “Before” the LOD
Data generation
Some remarks on RDF
RDF • RDF is simply triples – Subject – predicate - object titleID dct:title • Triples may be serialized in various formats – RDF/XML, Turtle, N-triples, N-Quads, JSON-LD, TriX
The role of predicates • … the dct:title in previous slide, to indicate the “title” of a book • Important to expose the data without ambiguities • Recommendation is to use standards, or de facto standard, to facilitate reuse of data • Search for the vocabulary appropriate to your data, e.g. with http://lov.okfn.org/dataset/lov/index.html – Look also at W3C Best Practices for Publishing Linked data http://www.w3.org/TR/ld-bp/
Conversion from existing formats
Converting data to RDF • Many converter to RDF – A list in http://www.w3.org/wiki/ConverterToRdf • Conversion could be done as a one-time migration effort, or could be scheduled regularly – When done regularly, for exposing your data, your established data maintenance is not affected
An simple example of conversion
My dummy table ID book Author Title Subject 1 John Dee Perfect Art of Navigation Navigation, geography 2 Jethro Tull The new horse-houghing husbandry Horse husbandry
1. Get some RDF “The perfect Art of Navigation” John Dee 1 Subject Title Author Navigation
2. Get some linked RDF “John Dee” (Agrovoc URI) <URI> dct:subject dct:title dct:creator “The perfect Art of Navigation” http://aims.fao.org/aos/agrovoc/c_15908
3. Get some more links http://dbpedia.org/page/John_Dee (Agrovoc URI) <URI> dct:subject dct:title dct:creator “The perfect Art of Navigation” http://aims.fao.org/aos/agrovoc/c_15908
Data maintenance
Data maintenance • If data is regularly converted to RDF, the “old” maintenance flow is kept – But with the extra step of linking • If data is once for all migrated RDF, may have the problem of maintenance – you may need a GUI
Linking your data
What can be linked? 1. Vocabularies used to describe and annotate the data - or ontologies – i.e., the properties of the triples - your “Title” and somebody else’s “Titulo” 2. The entities linked, the “objects” – i.e., the object of the triple – a specific author in your dataset to the same author in somebody else’s dataset, or in Wikipedia • Often, they are also called vocabularies, which may create confusion
1. Linking vocabularies • It is a research area – Ontology Alignment Evaluation Initiative (OAEI) – Note that “ontology” is often used as a generic term, also to mean rather simple vocabularies to describe data – ontology may sometimes also include “individuals”, e.g., country names, .. • Best solution is to go for standard vocabularies from the start! – When you design the conversion of your data
2. Linking “individuals” • Relatively simple problem, but few out-of-the- box tools – Usually the problem is data “cleanliness” – e.g., different name spelling, abbreviations, … • Best solution is to identify the top dataset(s) to link and start linking to it/them – Either manually or semi-automatically (Automatic selection of candidate links, then manual check) – Data validation usually outside the rest of the data lifecycle
Hint: Drupal for your catalogue
Drupal = a content management system • Allows you to: 1. import data from csv, xml, RSS feed 2. create RDF 3. maintain the data from GUI 4. expose RDF • Good for your catalogues of documents, people, .. • Need to know Drupal, but no programming skills required
Similar tools • AgriDrupal – Drupal customized for small institutions – Includes tools for automatic tagging with AGROVOC, which is a linked resource • ScratchPad – Customized for biodiversity data
If you want to have your thesuarus linked… • This is our experience - AGROVOC • Thesauri are used for document indexing (dct:subject “navigation”) • Steps: – Convert the thesaurus into SKOS concept scheme – Use VocBench for data maintenance, including links – Use SKOSMOS for data visualization and search
Data storage
Triple stores • Very many around, also very many benchmark to compare performances and functionalities – Cf. http://www.w3.org/wiki/RdfStoreBenchmarking • Some tech know-how needed to choose the best solution and keep it up and running
Data exposure
Various options 1. Provide a dump for download 2. Expose de-refenceable URIs 3. Expose sparql endpoint 4. Expose webserivces
RDF dump for download • Pros – Simply a file to download – For data consumers, access to data is under control -> efficient, fast • Cons – The issue may be to keep the dump in synch – Need to decide policy on versioning – Need to decide what to include in the dump (only the data? Also the links? ..)
De-referenceable URIs • Pros: – Data exposed is always up-to-date – Serving content for URIs – Simple back-ends are available to visualize also the html - e.g. Pubby, Loddy • Cons: – Need to set up content negotiation mechanism. Not a big issue, but server must be up 24/7.. – Data is accessible but not searchable by humans
SPARQL endpoint • Pros: – Not much work involved, typically endpoint is provided by triple store • Cons: – Require 24/7 server availability – No limitations on queries -> may be heavy on server side • Other solutions under study, e.g. http://linkeddatafragments.org
Web Services • Pros: – Known technology, good performances – More control on data access, less strain on server – May be built on top RDF store • Cons: – Need to be implemented
Multilinguality
Multilingual vocabularies can help
In practice… An institution with limited resources wants to move to Linked Data. What to do?
You have at least two options 1. Consider your specific bottlenecks and go ahead on your own 2. Organize a collaboration – Effort on creating partnership, networks
AGRIS An example of collaborative approach to LOD
The AGRIS network Data coordination Partner Partner Partner Partner Partner Partner Can be much smaller o bigger! Partner Partner
The AGRIS network 6969
……a bibliographical record original
…the same record in a mashup page http://agris.fao.org/agris-search/search.do?recordID=QM2007000047
Data Flow 72
AGRIS dataflow and processing
The AGRIMetaMaker
26% 22% 14% 11% 9% 4% 3% 2% 2% 2% 1% 1% 3% Metadata tools used by AGRIS Providers WebAgris AMM OJS Mendeley WebAGRIS PubMed InMagic DOAJ GFIS system Dspace AgriDrupal RISC Others
How linked data is produced
……using title and authors
……using key words
……using key words
…using the journal name or the ISSN
…using aligments between thesauri
http://agris.fao.org/agris-search/search.do?recordID=PL2009000495
http://agris.fao.org/agris-search/search.do?recordID=PH2011000084
Linking URIs
Linking vocabularies
Recap and Conclusions
1. Understand your own constraints
2. Keep an eye on tech improvements
3. Be smart from the start
In brief… • Start small: one dataset only (or few) • Start relevant: choose a key dataset, either because central to your application, or because widely used (visibility) • Start from somewhere: try to reuse experience as much as possible • Go in steps: open first, then link • Look for collaborations
4. In union there is strength
Find your own union • Organize a consortium and maximize your resources • Look for experience and support from other organizations
Thank you! caterina.caracciolo@fao.org johannes.keizer@fao.org http://aims.fao.org

Implementing Linked Data in Low-Resource Conditions

  • 1.
    Implementing Linked Datain Low Resource Conditions Caterina Caracciolo, Johannes Keizer {caterina.caracciolo},{johannes.keizer}@fao.org Food and Agriculture Organization of the UN 09 September 2015
  • 2.
    Goals for Today •Give you a high level view of what is needed to do Linked Data • Identify possible bottlenecks due to working with little resources • Based on our experience, give you some suggestions to overcome those bottlenecks
  • 3.
    Our background assumptions Somerestrictions are needed… • Target audience: small-medium size institutions – This talk is not meant to be a how-to guide for specific technical problems, but rather a support grid to plan your entering the linked open data world • Target data – We mainly think of textual data, e.g., list of publications produced by the institution, catalogues of specimens in the local museum, factsheets on plants, events organized, ..
  • 4.
    Topics for today •What is a “low-resource” condition • Open Data and Linked Open Data • An overview of Linked Data lifecycle – Bottlenecks in terms of resources – Our suggestions to overcome them • The example of Agris
  • 5.
  • 6.
    1. IT competencies •Few IT people, over-busy • Technology fast moving, nothing taught in school • Need personal update – But working environment may not encourage this – Or there may be language barriers
  • 7.
    2. Other IT/IM/culturalissues • Competency on legal issues – licenses, litigations? • “It is my data”, even in the same organization • Different “cultures” in the same workplace – Domain specialists “know” the domain and the data – e.g., the reports they produced - do not want to spend time with “techy stuff” – IT/IM people may prefer to spend time to make better system once, instead of repeating ad-hoc conversions - would like to standardize more All may require some investments in time
  • 8.
    3. Software • Outdatedoperating systems and software – Because of cost of licenses, or cultural issues
  • 9.
    4. Hardware CPU, memoryand technology constraints...
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    The trend Great attentionto data • Interoperability of data – data that can be reused = processed in different applications • Standard and open formats are seen as crucial to interoperability • Data made available over the web, for maximum reuse
  • 17.
  • 18.
    Open data ina nutshell • Like other “open” movements: open and free • See http://opendefinition.org/ • Especially for government-generated data • E.g., census, public investments, housing, environment, .. • A variety of formats used to expose the data • XLS, CSV, XLM, JSON, PPT, SDMX, .. • Preference for non-proprietary formats – Most of the data around is “open”, more or less… • But, check out if your country has produced a national policy on data!
  • 19.
    Who does OpenData? • National and regional initiatives (not exhaustive) – opendataforafrica.org – data.gov.uk – usopendata.org – opendatalatinamerica.org – open-data.europa.eu – data.gov.au – data.gov.in • Global and sectorial initiatives – e.g., GODAN
  • 20.
    Why do peoplego for Open Data • Increase transparency of governments and institutions • Create new business opportunities • It is the way to go now
  • 26.
  • 27.
    Linked Open Datain a nutshell • Like other “open” movement: open and free – You can have Linked Data that with no open license – but today we think of Linked Open Data (LOD) • Any type of data, any domain • The format of choice: RDF – Various serialization possible – XML, Turtle, N- Triples, N-Quads, JSON-LD, Notation 3, TriX • Not just getting datasets out, but linked pieces of data
  • 28.
    Why should Igo for Linked Data? • To be able to reuse data published by others • To promote business – made by others or yourself • Not to be isolated, left behind in the information world • Yes but… is the game worth the candle?
  • 29.
    Agris - aLOD-based application
  • 30.
    Then, Open Dataor Linked Data? • Can be seen as two steps along the same line • You should decide based on your situation and goals – Open data requires less effort. Good if data will be primarily used by others or have no direct interest in linking to other datasets – Linked Open Data may be more complex because of the linking step. Good if you want to exploit the data yourself, e.g. to enhance your library/doc rep catalogue with data produced by others
  • 31.
  • 32.
    A typical LinkedData flow SPARQL endpoint HTML/RDF Content negotiation RDF store RDF dump LOD based applications Data consumptionLOD exposureLOD storage “Original “ dataset Maintenance in RDF Maintenance in original format Conversion SPARQL endpoint “Before” the LOD
  • 33.
  • 34.
  • 35.
    RDF • RDF issimply triples – Subject – predicate - object titleID dct:title • Triples may be serialized in various formats – RDF/XML, Turtle, N-triples, N-Quads, JSON-LD, TriX
  • 36.
    The role ofpredicates • … the dct:title in previous slide, to indicate the “title” of a book • Important to expose the data without ambiguities • Recommendation is to use standards, or de facto standard, to facilitate reuse of data • Search for the vocabulary appropriate to your data, e.g. with http://lov.okfn.org/dataset/lov/index.html – Look also at W3C Best Practices for Publishing Linked data http://www.w3.org/TR/ld-bp/
  • 37.
  • 38.
    Converting data toRDF • Many converter to RDF – A list in http://www.w3.org/wiki/ConverterToRdf • Conversion could be done as a one-time migration effort, or could be scheduled regularly – When done regularly, for exposing your data, your established data maintenance is not affected
  • 39.
    An simple exampleof conversion
  • 40.
    My dummy table IDbook Author Title Subject 1 John Dee Perfect Art of Navigation Navigation, geography 2 Jethro Tull The new horse-houghing husbandry Horse husbandry
  • 41.
    1. Get someRDF “The perfect Art of Navigation” John Dee 1 Subject Title Author Navigation
  • 42.
    2. Get somelinked RDF “John Dee” (Agrovoc URI) <URI> dct:subject dct:title dct:creator “The perfect Art of Navigation” http://aims.fao.org/aos/agrovoc/c_15908
  • 43.
    3. Get somemore links http://dbpedia.org/page/John_Dee (Agrovoc URI) <URI> dct:subject dct:title dct:creator “The perfect Art of Navigation” http://aims.fao.org/aos/agrovoc/c_15908
  • 44.
  • 45.
    Data maintenance • Ifdata is regularly converted to RDF, the “old” maintenance flow is kept – But with the extra step of linking • If data is once for all migrated RDF, may have the problem of maintenance – you may need a GUI
  • 46.
  • 47.
    What can belinked? 1. Vocabularies used to describe and annotate the data - or ontologies – i.e., the properties of the triples - your “Title” and somebody else’s “Titulo” 2. The entities linked, the “objects” – i.e., the object of the triple – a specific author in your dataset to the same author in somebody else’s dataset, or in Wikipedia • Often, they are also called vocabularies, which may create confusion
  • 48.
    1. Linking vocabularies •It is a research area – Ontology Alignment Evaluation Initiative (OAEI) – Note that “ontology” is often used as a generic term, also to mean rather simple vocabularies to describe data – ontology may sometimes also include “individuals”, e.g., country names, .. • Best solution is to go for standard vocabularies from the start! – When you design the conversion of your data
  • 49.
    2. Linking “individuals” •Relatively simple problem, but few out-of-the- box tools – Usually the problem is data “cleanliness” – e.g., different name spelling, abbreviations, … • Best solution is to identify the top dataset(s) to link and start linking to it/them – Either manually or semi-automatically (Automatic selection of candidate links, then manual check) – Data validation usually outside the rest of the data lifecycle
  • 50.
    Hint: Drupal foryour catalogue
  • 51.
    Drupal = acontent management system • Allows you to: 1. import data from csv, xml, RSS feed 2. create RDF 3. maintain the data from GUI 4. expose RDF • Good for your catalogues of documents, people, .. • Need to know Drupal, but no programming skills required
  • 52.
    Similar tools • AgriDrupal –Drupal customized for small institutions – Includes tools for automatic tagging with AGROVOC, which is a linked resource • ScratchPad – Customized for biodiversity data
  • 53.
    If you wantto have your thesuarus linked… • This is our experience - AGROVOC • Thesauri are used for document indexing (dct:subject “navigation”) • Steps: – Convert the thesaurus into SKOS concept scheme – Use VocBench for data maintenance, including links – Use SKOSMOS for data visualization and search
  • 54.
  • 55.
    Triple stores • Verymany around, also very many benchmark to compare performances and functionalities – Cf. http://www.w3.org/wiki/RdfStoreBenchmarking • Some tech know-how needed to choose the best solution and keep it up and running
  • 56.
  • 57.
    Various options 1. Providea dump for download 2. Expose de-refenceable URIs 3. Expose sparql endpoint 4. Expose webserivces
  • 58.
    RDF dump fordownload • Pros – Simply a file to download – For data consumers, access to data is under control -> efficient, fast • Cons – The issue may be to keep the dump in synch – Need to decide policy on versioning – Need to decide what to include in the dump (only the data? Also the links? ..)
  • 59.
    De-referenceable URIs • Pros: –Data exposed is always up-to-date – Serving content for URIs – Simple back-ends are available to visualize also the html - e.g. Pubby, Loddy • Cons: – Need to set up content negotiation mechanism. Not a big issue, but server must be up 24/7.. – Data is accessible but not searchable by humans
  • 60.
    SPARQL endpoint • Pros: –Not much work involved, typically endpoint is provided by triple store • Cons: – Require 24/7 server availability – No limitations on queries -> may be heavy on server side • Other solutions under study, e.g. http://linkeddatafragments.org
  • 61.
    Web Services • Pros: –Known technology, good performances – More control on data access, less strain on server – May be built on top RDF store • Cons: – Need to be implemented
  • 62.
  • 64.
  • 65.
    In practice… An institutionwith limited resources wants to move to Linked Data. What to do?
  • 66.
    You have atleast two options 1. Consider your specific bottlenecks and go ahead on your own 2. Organize a collaboration – Effort on creating partnership, networks
  • 67.
    AGRIS An example ofcollaborative approach to LOD
  • 68.
    The AGRIS network Datacoordination Partner Partner Partner Partner Partner Partner Can be much smaller o bigger! Partner Partner
  • 69.
  • 70.
  • 71.
    …the same recordin a mashup page http://agris.fao.org/agris-search/search.do?recordID=QM2007000047
  • 72.
  • 73.
  • 74.
  • 75.
    26% 22% 14% 11% 9% 4% 3% 2% 2% 2% 1% 1% 3% Metadata toolsused by AGRIS Providers WebAgris AMM OJS Mendeley WebAGRIS PubMed InMagic DOAJ GFIS system Dspace AgriDrupal RISC Others
  • 76.
    How linked datais produced
  • 77.
  • 78.
  • 79.
  • 80.
    …using the journalname or the ISSN
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
    1. Understand yourown constraints
  • 88.
    2. Keep aneye on tech improvements
  • 89.
    3. Be smartfrom the start
  • 90.
    In brief… • Startsmall: one dataset only (or few) • Start relevant: choose a key dataset, either because central to your application, or because widely used (visibility) • Start from somewhere: try to reuse experience as much as possible • Go in steps: open first, then link • Look for collaborations
  • 91.
    4. In unionthere is strength
  • 92.
    Find your ownunion • Organize a consortium and maximize your resources • Look for experience and support from other organizations
  • 93.