A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases Exame de Qualificação de Doutorado Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello orientador UFSC/CTC/INE/PPGCC
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
Introduction: Motivation ● RDF is currently widespread: ○ Best buy: ■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwriteweb-how-best-bu y-is-using-the-semantic-web-23031.html ○ Globo.com: ■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013 ○ US data.gov: ■ https://www.data.gov/developers/semantic-web
Introduction: Motivation (LOD stats)
Introduction: Objectives This PhD Thesis proposal presents Rendezvous, a middleware for storing massing RDF graphs. This middleware includes a novel data partitioning approach, a fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models, and a caching structure that accelerate the querying response.
Introduction: Contributions ● (i) a mapping of RDF data to the columnar, document, and key/value NoSQL data models (SADALAGE; FOWLER, 2012) ● (ii) a workload-aware partitioner based on the current graph structure and, mainly, on the typical application workload ● (iii) a caching schema based on key/value databases for speeding up the query response time ● (iv) an experimental evaluation that compares the current version of our approach against two baselines (Rainbow (GU; HU; HUANG, 2015) and ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
Background: RDF and SPARQL
Background: NoSQL
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
State of the Art - No NoSQL Triplestores WARP (h-hop replication), YARS, Hexastore (multiple indexes), 4store, SPIDER, RDF-3X, SHARD, SW-Store (vertical partition), SOLID, SPOVC (horizontal partition), and S2X
State of the Art - NoSQL Triplestores
State of the Art - NoSQL Triplestores RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF, H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
State of the Art - Categories ● RDF/NoSQL Converters ● Polystores/Multimodel ● In-memory Rainbow (GU; HU; HUANG, 2015) Amada (Aranda-Andújar, 2012)
State of the Art ● BUGIOTTI, F. et al. Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR). [S.l.: s.n.], 2015.
State of the Art - Open Issues ● To avoid indexing all the triple component permutations ● To consider workload and the usage of statistics for data partitioning ● To exploit in-memory possibilities ● To combine RDF storage with multiple NoSQL models
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
Rendezvous: Architecture
Rendezvous: Storing ● Fragmentation ● Indexing ● Partitioning ● Mapping
Storing: Fragmentation
Storing: Fragmentation
Storing: Fragmentation
Storing: Fragmentation
Storing: Fragmentation
Storing: Indexing
Storing: Indexing - Simple queries and fragmentation
Storing: Partitioning ● The partition is used when the dataset is bigger than each server capabilities
Storing: Partitioning
Storing: Partitioning
Rendezvous: Querying ● Query decomposition ● Caching
Querying: Decomposition
Querying: Decomposition Q1: SELECT ?x WHERE {x? p5 y? . y? p2 z? .} Q2: SELECT ?x WHERE {x? p9 y? . M p10 y? .} D1: db.partition1.find({p5:{$exists:true}, p2:{$exists:true}}}) D2: db.partition1.find({p9:{$exists:true}, subject:M}})
Querying: Decomposition Q3: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . } C1: SELECT S1,O1 FROM p1 SELECT S2,O2 FROM p2 WHERE O=S1 SELECT S3,O3 FROM p3 WHERE O=S2 AND S=D (C1) Q4: SELECT ?x WHERE {x? p2 y? .y? p3 z? .x? p5 w? .w? p9 k? .L p11 k?.} SQ5: SELECT ?x WHERE {x? p2 y?. y? p3 z?. x? p5 w?.} SQ6: SELECT ?x WHERE {x? p5 w?. w? p9 k?. L p11 k?.}
Querying: Decomposition Q5: SELECT ?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . z? p5 ?k . k? p6 G . k? p7 I . k? p8 H } P1: {k? p6 G . k? p7 I . k? p8 H } P2: {x? p1 y? . y? p2 z? . z? p3 w?} P3: { z? p5 ?k}
Querying: Caching
Querying: Caching (chain queries)
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
Evaluation ● LUBM: ontology for the University domain, synthetic RDF data scalable to any size, and 14 extensional queries representing a variety of properties ● Generated dataset with 4000 universities (around 100 GB and contains around 500 million triples) ● 12 queries with joins, all of them have at least one subject-subject join, and six of them also have at least one subject-object join ● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB 3.4.3, and Apache Cassandra 3.10 ● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
Evaluation: Rendezvous vs. Rainbow
Evaluation: Rendezvous vs. ScalaRDF
Evaluation: Conclusions ● Fragments are scalable ● Bigger boundaries are not necessarily related to bigger storage size ● Graph-aware partitions are better than NoSQL partitions ● Near cache is fast but it makes more difficult to keep data consistency
Evaluation: Future Work ● Compression of triples during the storage ● Update and delete operations ● Other NoSQL types (e.g., graph) ● Better datasets
Agenda ● Introduction: Motivation, objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
Schedule ● Middleware development (continuously until 2018) ○ Compression ○ Graph database ○ More complex and abstract workload awareness ● Submission of papers (continuously until 2018) ○ Special Interest Group On Management of Data (SIGMOD) ○ Very Large Databases (VLDB) ○ IEEE Transactions on Knowledge and Data Engineering (TKDE) ● Defense of the PhD thesis (2019)
LUBM model

A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases

  • 1.
    A Workload-Aware Middlewarefor Storing Massive RDF Graphs into NoSQL Databases Exame de Qualificação de Doutorado Luiz Henrique Zambom Santana Prof. Dr. Ronaldo dos Santos Mello orientador UFSC/CTC/INE/PPGCC
  • 2.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 3.
    Introduction: Motivation ● RDFis currently widespread: ○ Best buy: ■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwriteweb-how-best-bu y-is-using-the-semantic-web-23031.html ○ Globo.com: ■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013 ○ US data.gov: ■ https://www.data.gov/developers/semantic-web
  • 4.
  • 5.
    Introduction: Objectives This PhDThesis proposal presents Rendezvous, a middleware for storing massing RDF graphs. This middleware includes a novel data partitioning approach, a fragmentation strategy that maps pieces of this RDF graph into NoSQL databases with different data models, and a caching structure that accelerate the querying response.
  • 6.
    Introduction: Contributions ● (i)a mapping of RDF data to the columnar, document, and key/value NoSQL data models (SADALAGE; FOWLER, 2012) ● (ii) a workload-aware partitioner based on the current graph structure and, mainly, on the typical application workload ● (iii) a caching schema based on key/value databases for speeding up the query response time ● (iv) an experimental evaluation that compares the current version of our approach against two baselines (Rainbow (GU; HU; HUANG, 2015) and ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively
  • 7.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 8.
  • 9.
  • 10.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 11.
    State of theArt - No NoSQL Triplestores WARP (h-hop replication), YARS, Hexastore (multiple indexes), 4store, SPIDER, RDF-3X, SHARD, SW-Store (vertical partition), SOLID, SPOVC (horizontal partition), and S2X
  • 12.
    State of theArt - NoSQL Triplestores
  • 13.
    State of theArt - NoSQL Triplestores RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF, H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
  • 14.
    State of theArt - Categories ● RDF/NoSQL Converters ● Polystores/Multimodel ● In-memory Rainbow (GU; HU; HUANG, 2015) Amada (Aranda-Andújar, 2012)
  • 15.
    State of theArt ● BUGIOTTI, F. et al. Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR). [S.l.: s.n.], 2015.
  • 16.
    State of theArt - Open Issues ● To avoid indexing all the triple component permutations ● To consider workload and the usage of statistics for data partitioning ● To exploit in-memory possibilities ● To combine RDF storage with multiple NoSQL models
  • 17.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 18.
  • 19.
    Rendezvous: Storing ● Fragmentation ●Indexing ● Partitioning ● Mapping
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Storing: Indexing -Simple queries and fragmentation
  • 27.
    Storing: Partitioning ● Thepartition is used when the dataset is bigger than each server capabilities
  • 28.
  • 29.
  • 30.
    Rendezvous: Querying ● Querydecomposition ● Caching
  • 31.
  • 32.
    Querying: Decomposition Q1: SELECT?x WHERE {x? p5 y? . y? p2 z? .} Q2: SELECT ?x WHERE {x? p9 y? . M p10 y? .} D1: db.partition1.find({p5:{$exists:true}, p2:{$exists:true}}}) D2: db.partition1.find({p9:{$exists:true}, subject:M}})
  • 33.
    Querying: Decomposition Q3: SELECT?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . } C1: SELECT S1,O1 FROM p1 SELECT S2,O2 FROM p2 WHERE O=S1 SELECT S3,O3 FROM p3 WHERE O=S2 AND S=D (C1) Q4: SELECT ?x WHERE {x? p2 y? .y? p3 z? .x? p5 w? .w? p9 k? .L p11 k?.} SQ5: SELECT ?x WHERE {x? p2 y?. y? p3 z?. x? p5 w?.} SQ6: SELECT ?x WHERE {x? p5 w?. w? p9 k?. L p11 k?.}
  • 34.
    Querying: Decomposition Q5: SELECT?x WHERE { x? p1 y? . y? p2 z? . z? p3 w? . z? p5 ?k . k? p6 G . k? p7 I . k? p8 H } P1: {k? p6 G . k? p7 I . k? p8 H } P2: {x? p1 y? . y? p2 z? . z? p3 w?} P3: { z? p5 ?k}
  • 35.
  • 36.
  • 37.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 38.
    Evaluation ● LUBM: ontologyfor the University domain, synthetic RDF data scalable to any size, and 14 extensional queries representing a variety of properties ● Generated dataset with 4000 universities (around 100 GB and contains around 500 million triples) ● 12 queries with joins, all of them have at least one subject-subject join, and six of them also have at least one subject-object join ● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB 3.4.3, and Apache Cassandra 3.10 ● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
  • 39.
  • 40.
  • 41.
    Evaluation: Conclusions ● Fragmentsare scalable ● Bigger boundaries are not necessarily related to bigger storage size ● Graph-aware partitions are better than NoSQL partitions ● Near cache is fast but it makes more difficult to keep data consistency
  • 42.
    Evaluation: Future Work ●Compression of triples during the storage ● Update and delete operations ● Other NoSQL types (e.g., graph) ● Better datasets
  • 43.
    Agenda ● Introduction: Motivation,objectives, and contributions ● Background ○ RDF ○ NoSQL ● State of the Art ○ Open Issues ● Rendezvous ○ Storing: Fragmentation, Indexing, Partitioning, and Mapping ○ Querying: Query decomposition and Caching ● Evaluation ● Schedule
  • 44.
    Schedule ● Middleware development(continuously until 2018) ○ Compression ○ Graph database ○ More complex and abstract workload awareness ● Submission of papers (continuously until 2018) ○ Special Interest Group On Management of Data (SIGMOD) ○ Very Large Databases (VLDB) ○ IEEE Transactions on Knowledge and Data Engineering (TKDE) ● Defense of the PhD thesis (2019)
  • 45.