Building a Graph based RDF Store for Apache Cassandra Name: Ravindra Ranwala ID: 138227T Supervisor: Dr. Amal Shehan Perera 1
Agenda ● Introduction ● Basic Concepts ● The Problem ● Literature Review ● Methodology ● Demo ● Evaluation and Result ● Conclusion 2
Introduction ● RDFs are used to support queries in the semantic web. ● RDF stores contain trillions of triples. ● Today RDF data is everywhere - commercial search engines proliferate RDF data ex. Google, yahoo, bing etc. ● SPARQL - used as a query language. ● Different approaches exists to build a triple store. ● Main challenges are system scalability and generality. 3
Basic Concepts - RDF Triple ● RDF dataset consists of statements in the form of (subject, predicate, object) ● Subject has a predicate property whose value is the object. ● Examples: <Titanic, has award, Best picture> ● Core of the semantic web is built on top of the RDF data model. ● These triples can be stored in different ways. 4
The Problem ● Apache Cassandra is a Nosql, multi tenant and multi data centric database. ● Our objective is to build a scalable RDF store for Apache Cassandra. ● Cassandra is used by eBay, Twitter, Cisco, etc. ● This will exponentially increase the value of Cassandra. ● The largest known Cassandra cluster has 300 TB of data over 400 machines. ● This motivates us to build a distributed, scalable RDF store to answer user queries on them efficiently. 5
Literature Review - Concepts ● A triple store can be built on top of any DBMS or File system. ● RDF dataset consists of statements in the form of <subject, predicate, object> ● Subject has a predicate property whose value is object. ● Ex. <person1, name, Mike> ● A typical triple store holds a multi millions/billions of such triples. ● Efficient and scalable management of RDF data is a fundamental challenge. ● SPARQL queries are submitted to the RDF store. Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph Engine for Web Scale RDF Data," 6
Apache Cassandra ● Distributed, fault tolerant (i.e. no single point of failures), post relational, Nosql database system. ● Peer to peer distributed architecture. Supports both strict and eventual consistency. ● All the nodes are the same. There is no master and slave nodes. ● Uses read/write anywhere style architecture. DataStax Corporation. (2011, October) “Welcome to Apache Cassandra 1.0” 7
Triple store –approaches ● There are different approaches the exist to manage RDF data. ● Each approach has it’s own advantages and disadvantages. 8
Relational Approach ● Triples are stored using the relational model. Justin J. Levandoski F. Mokbel, "RDF Data-Centric Storage," 9
Relational Approach (contd.) ● Triple store - yields costly self joins of a huge RDF store (trillions of triples) ● N-array - eliminates the need for joins, but leads to higher number of nulls. ● reduces null storage, but introduces costly join. 10
Graph based approaches ● New approach that greatly improves the performance of SPARQL query processing ● Graph exploration instead of joins. ● Unnecessary intermediate results can be pruned down. ● Models RDF data in it’s native graph form. ● Examples: Trinity, TripleRush etc. 11
Trinity RDF ● Graph based implementation. Models RDF as a DAG. ● Subjects and objects are represented as a node. ● Predicate is represented as a directed labelled edge. ● Graph is stored in memory for fast access. H. Wang, and Y. Li B. Shao, "The Trinity graph engine. Technical Report 161291, Microsoft Research," 12
Trinity Architecture ● Distributed in memory key value store. ● Partitions RDF graph across multiple machines by hashing on the nodes. ● Each machine holds a disjoint part of the graph. ● Final result is assembled at the proxy. Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph Engine for Web Scale RDF Data," 13
Methodology ● Use case Scenarios ○ Populating data into Cassandra Cluster ○ Building the RDF Graph ○ Querying the RDF Graph ○ Dropping the RDF Store ● Technologies used. ○ Apache Jena RDF API ○ Struts 2 ○ Java/JSP/XSLT/XML/XPath 14
System Architecture 15
Demo 16
Evaluation and Result ● DBPedia benchmarking was used to compare. ● DBPedia geo-coordinates and homepages dataset was used. Accounts for 0.7 million triples ● 4Store, Bigdata RDF stores were compared with our implementation ● Queries used ○ Query One: Finds the homepage of the Metropolitan museum of Art ○ Query Two: Finds the Homepage of Kevin_Bacon ○ Query Three: Finds all the resources and their homepages which reside near the area of Berlin. ○ Query Four: Finds all the resources and their homepages which reside near the area of New York. 17
Benchmark Results ● Query complexity increases from Q1 through Q4. ● The execution time taken by different RDF stores, to execute above four queries. ● Query execution time is measured in ms. Q1 Q2 Q3 Q4 Our implementation 216ms 7ms 336ms 279ms 4Store 16ms 18ms 455ms 416ms Bigdata 41ms 30ms 2sec, 355ms 1sec, 600ms DBpedia. (2008, Jan 10.) RDF Store Benchmarks with DBpedia [Online]. Available: http://wifo5-03.informatik.uni-mannheim.de/benchmarks-200801/ 18
Benchmarking Results 19
Benchmarking Results 20
Benchmarking Analysis ● Graph based approach yields more performance boosts when query becomes more and more complex ● Complexity increases from Query 1 to 4 gradually. ● This implementation outperforms 4store and bigdata especially when the complexity of the query increases. ● First query takes time, because it builds the index structure. 21
Future Work ● Main limitation of the approach is Scalability. ● Larger datasets lead to OutOfMemory error while building the graph model. ● Solution: Distributed implementation 22
Conclusion ● Approaches used to model and retrieve RDF data. ● New approaches to manage RDF data efficiently. ● Graph based approach. ● New Implementation ○ Use case scenarios ○ Evaluation and result using DBPedia dataset ○ Benchmark Analysis 23

Graph basedrdf storeforapachecassandra

  • 1.
    Building a Graphbased RDF Store for Apache Cassandra Name: Ravindra Ranwala ID: 138227T Supervisor: Dr. Amal Shehan Perera 1
  • 2.
    Agenda ● Introduction ● BasicConcepts ● The Problem ● Literature Review ● Methodology ● Demo ● Evaluation and Result ● Conclusion 2
  • 3.
    Introduction ● RDFs areused to support queries in the semantic web. ● RDF stores contain trillions of triples. ● Today RDF data is everywhere - commercial search engines proliferate RDF data ex. Google, yahoo, bing etc. ● SPARQL - used as a query language. ● Different approaches exists to build a triple store. ● Main challenges are system scalability and generality. 3
  • 4.
    Basic Concepts -RDF Triple ● RDF dataset consists of statements in the form of (subject, predicate, object) ● Subject has a predicate property whose value is the object. ● Examples: <Titanic, has award, Best picture> ● Core of the semantic web is built on top of the RDF data model. ● These triples can be stored in different ways. 4
  • 5.
    The Problem ● ApacheCassandra is a Nosql, multi tenant and multi data centric database. ● Our objective is to build a scalable RDF store for Apache Cassandra. ● Cassandra is used by eBay, Twitter, Cisco, etc. ● This will exponentially increase the value of Cassandra. ● The largest known Cassandra cluster has 300 TB of data over 400 machines. ● This motivates us to build a distributed, scalable RDF store to answer user queries on them efficiently. 5
  • 6.
    Literature Review -Concepts ● A triple store can be built on top of any DBMS or File system. ● RDF dataset consists of statements in the form of <subject, predicate, object> ● Subject has a predicate property whose value is object. ● Ex. <person1, name, Mike> ● A typical triple store holds a multi millions/billions of such triples. ● Efficient and scalable management of RDF data is a fundamental challenge. ● SPARQL queries are submitted to the RDF store. Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph Engine for Web Scale RDF Data," 6
  • 7.
    Apache Cassandra ● Distributed,fault tolerant (i.e. no single point of failures), post relational, Nosql database system. ● Peer to peer distributed architecture. Supports both strict and eventual consistency. ● All the nodes are the same. There is no master and slave nodes. ● Uses read/write anywhere style architecture. DataStax Corporation. (2011, October) “Welcome to Apache Cassandra 1.0” 7
  • 8.
    Triple store –approaches ●There are different approaches the exist to manage RDF data. ● Each approach has it’s own advantages and disadvantages. 8
  • 9.
    Relational Approach ● Triplesare stored using the relational model. Justin J. Levandoski F. Mokbel, "RDF Data-Centric Storage," 9
  • 10.
    Relational Approach (contd.) ●Triple store - yields costly self joins of a huge RDF store (trillions of triples) ● N-array - eliminates the need for joins, but leads to higher number of nulls. ● reduces null storage, but introduces costly join. 10
  • 11.
    Graph based approaches ●New approach that greatly improves the performance of SPARQL query processing ● Graph exploration instead of joins. ● Unnecessary intermediate results can be pruned down. ● Models RDF data in it’s native graph form. ● Examples: Trinity, TripleRush etc. 11
  • 12.
    Trinity RDF ● Graphbased implementation. Models RDF as a DAG. ● Subjects and objects are represented as a node. ● Predicate is represented as a directed labelled edge. ● Graph is stored in memory for fast access. H. Wang, and Y. Li B. Shao, "The Trinity graph engine. Technical Report 161291, Microsoft Research," 12
  • 13.
    Trinity Architecture ● Distributedin memory key value store. ● Partitions RDF graph across multiple machines by hashing on the nodes. ● Each machine holds a disjoint part of the graph. ● Final result is assembled at the proxy. Jiacheng Yang, Haixun Wang, Bin Shao, Zhongyuan Wang Kai Zeng, "A Distributed Graph Engine for Web Scale RDF Data," 13
  • 14.
    Methodology ● Use caseScenarios ○ Populating data into Cassandra Cluster ○ Building the RDF Graph ○ Querying the RDF Graph ○ Dropping the RDF Store ● Technologies used. ○ Apache Jena RDF API ○ Struts 2 ○ Java/JSP/XSLT/XML/XPath 14
  • 15.
  • 16.
  • 17.
    Evaluation and Result ●DBPedia benchmarking was used to compare. ● DBPedia geo-coordinates and homepages dataset was used. Accounts for 0.7 million triples ● 4Store, Bigdata RDF stores were compared with our implementation ● Queries used ○ Query One: Finds the homepage of the Metropolitan museum of Art ○ Query Two: Finds the Homepage of Kevin_Bacon ○ Query Three: Finds all the resources and their homepages which reside near the area of Berlin. ○ Query Four: Finds all the resources and their homepages which reside near the area of New York. 17
  • 18.
    Benchmark Results ● Querycomplexity increases from Q1 through Q4. ● The execution time taken by different RDF stores, to execute above four queries. ● Query execution time is measured in ms. Q1 Q2 Q3 Q4 Our implementation 216ms 7ms 336ms 279ms 4Store 16ms 18ms 455ms 416ms Bigdata 41ms 30ms 2sec, 355ms 1sec, 600ms DBpedia. (2008, Jan 10.) RDF Store Benchmarks with DBpedia [Online]. Available: http://wifo5-03.informatik.uni-mannheim.de/benchmarks-200801/ 18
  • 19.
  • 20.
  • 21.
    Benchmarking Analysis ● Graphbased approach yields more performance boosts when query becomes more and more complex ● Complexity increases from Query 1 to 4 gradually. ● This implementation outperforms 4store and bigdata especially when the complexity of the query increases. ● First query takes time, because it builds the index structure. 21
  • 22.
    Future Work ● Mainlimitation of the approach is Scalability. ● Larger datasets lead to OutOfMemory error while building the graph model. ● Solution: Distributed implementation 22
  • 23.
    Conclusion ● Approaches usedto model and retrieve RDF data. ● New approaches to manage RDF data efficiently. ● Graph based approach. ● New Implementation ○ Use case scenarios ○ Evaluation and result using DBPedia dataset ○ Benchmark Analysis 23