Deep dive into the native multi model database ArangoDB

Deep dive into the native multi-model database ArangoDB Frank Celler Percona Live 2016, Santa Clara, 20 April 2016 www.arangodb.com

is a multi-model Database Features is a document store, a key/value store and a graph database, offers convenient queries (via HTTP/REST and AQL), including joins between different collections, and graph queries, with configurable consistency guarantees using transactions.

is a multi-model Database Features is a document store, a key/value store and a graph database, offers convenient queries (via HTTP/REST and AQL), including joins between different collections, and graph queries, with configurable consistency guarantees using transactions. =⇒ Allows polyglot persistence with multiple instances of a single technology.

is extensible by JavaScript Code The Foxx Microservice Framework Allows you to extend the HTTP/REST API by your own routes, which you implement in JavaScript running on the database server, with direct access to the C++ DB engine.

is extensible by JavaScript Code The Foxx Microservice Framework Allows you to extend the HTTP/REST API by your own routes, which you implement in JavaScript running on the database server, with direct access to the C++ DB engine. Unprecedented possibilities for data centric services: custom-made complex queries or authorizations schema-validation push feeds, etc.

is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.

is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.

is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to signiﬁcantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.

The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.

The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: is able to compete with specialised products on their turf allows for polyglot persistence using a single database technology In a microservice architecture, there will be several diﬀerent deployments.

performance https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/

Why is multi-model possible at all? Document stores and key/value stores Document stores: have primary key, are key/value stores. Without using secondary indexes, performance is nearly as good as with opaque data instead of JSON. Good horizontal scalability can be achieved for key lookups.

horizontal scalability Experiment: Single document writes (1kB / doc) on cluster of sizes 8 to 80 machi- nes (64 to 640 vCPUs), another 4 to 40 load servers, running on AWS. https://mesosphere.com/blog/2015/11/30/arangodb-benchmark-dcos/

Why is multi-model possible at all? Document stores and graph databases Graph database: would like to associate arbitrary data with vertices and edges, so JSON documents are a good choice. A good edge index, giving fast access to neighbours. This can be a secondary index. Graph support in the query language. Implementations of graph algorithms in the DB engine. https://www.arangodb.com/2016/04/ index-free-adjacency-hybrid-indexes-graph-databases/

Replication and Sharding ArangoDB provides (Version 2.8, January 2016) Sharding with automatic data distribution, easy setup of (asynchronous) replication (cluster and single), fault tolerance by automatic failover, full integration with Apache Mesos and Mesosphere DC/OS.

Replication and Sharding ArangoDB provides (Version 2.8, January 2016) Sharding with automatic data distribution, easy setup of (asynchronous) replication (cluster and single), fault tolerance by automatic failover, full integration with Apache Mesos and Mesosphere DC/OS. Work in progress (Version 3.0, RC in April 2016): synchronous replication in cluster mode, zero administration by a self-repairing and self-balancing cluster architecture.

Data-Center Operating Systems Resource Management Installation should be as easy as possible integration into the resource management of data-center gives better resource utilisation, full integration with Apache Mesos and Mesosphere DC/OS

Data-Center Operating Systems Resource Management Installation should be as easy as possible integration into the resource management of data-center gives better resource utilisation, full integration with Apache Mesos and Mesosphere DC/OS Work in progress Mesosphere DC/OS a very mature, Open-Source solution later this year integration also for Kubernetes, Docker-Swarm

About Mesosphere’s DC/OS https://dcos.io

Installing Mesosphere’s DC/OS https://dcos.io

Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, AQL is independent of the driver used and oﬀers protection against injections by design.

Extensible through JavaScript The Foxx Microservice Framework Allows you to extend the HTTP/REST API by your own routes, which you implement in JavaScript running on the database server, with direct access to the C++ DB engine.

Extensible through JavaScript The Foxx Microservice Framework Allows you to extend the HTTP/REST API by your own routes, which you implement in JavaScript running on the database server, with direct access to the C++ DB engine. Unprecedented possibilities for data centric services: complex queries or authorizations, schema-validation, push feeds, etc. easy deployment via web interface or REST API, automatic API description through Swagger =⇒ discoverability of services.

Use case: Aircraft ﬂeet management

Use case: Aircraft fleet management One of our customers uses ArangoDB to store each part, component, unit or aircraft as a document model containment as a graph thus can easily find all parts of some component keep track of maintenance intervals perform queries orthogonal to the graph structure thereby getting good efficiency for all needed queries http://radar.oreilly.com/2015/07/ data-modeling-with-multi-model-databases.html

Use case: rights management Right managements in relational model is hard: looks like a forest at ﬁrst then exceptions pop-up one company sub-contracts another for a special station an engineer works for two companies some-one needs special permissions when being a proxy much easier expressed as graph structure

Use case: e-commerce AboutYou uses ArangoDB to create channels showing new products allow recommendation to friends celebrities presenting new fashion blog about fashion products nightly business analysis news stream https://www.arangodb.com/case-studies/ aboutyou-data-driven-personalization-with-arangodb/

First deployment: a simple key/value store A key/value store One collection “data”, indexes on “value” (sorted) and “name” (hash). Single document requests Indexes possible Range queries possible

Second deployment: a Microservice as a Foxx app A Foxx Microservice Simple TODO app, deployed from app store with web UI. REST/JSON API available Swagger generates API description automatically

Third deployment: a single server graph database A Graph Database Graph “worldCountry” with vertex collection “worldVertex” and edge collection “worldEdges”, links from cities to countries to continents to world. Show some graph traversals. Show graph viewer.

Fourth deployment: a multi-model application A multi-model database Some data from a web shop. Show some queries.

Life of a query Text and query parameters come from user

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST)

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc.

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP)

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine Distribute and link up engines on diﬀerent servers

Life of a query Text and query parameters come from user Parse text, produce abstract syntax tree (AST) Substitute query parameters First optimisation: constant expressions, etc. Translate AST into an execution plan (EXP) Optimise one EXP, produce many, potentially better EXPs Reason about distribution in cluster Optimise distributed EXPs Estimate costs for all EXPs, and sort by ascending cost Instanciate “cheapest” plan, i.e. set up execution engine Distribute and link up engines on diﬀerent servers Execute plan, provide cursor API

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP Black arrows are dependencies

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP Black arrows are dependencies Think of a pipeline

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API Blocks of “Items” travel through the pipeline

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x Query → EXP Black arrows are dependencies Think of a pipeline Each node provides a cursor API Blocks of “Items” travel through the pipeline What is an “item”???

Pipeline and items FOR a IN collA EnumerateCollection a EnumerateCollection b Singleton Calculation xx FOR b IN collB LET xx = a.x Items have vars a, xx Items have no vars Items are the thingies traveling through the pipeline.

Pipeline and items FOR a IN collA EnumerateCollection a EnumerateCollection b Singleton Calculation xx FOR b IN collB LET xx = a.x Items have vars a, xx Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame

Pipeline and items FOR a IN collA EnumerateCollection a EnumerateCollection b Singleton Calculation xx FOR b IN collB LET xx = a.x Items have vars a, xx Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame Thus: Items look diﬀerently in diﬀerent parts of the plan

Pipeline and items FOR a IN collA EnumerateCollection a EnumerateCollection b Singleton Calculation xx FOR b IN collB LET xx = a.x Items have vars a, xx Items have no vars Items are the thingies traveling through the pipeline. An item holds values of those variables in the current frame Thus: Items look diﬀerently in diﬀerent parts of the plan We always deal with blocks of items for performance reasons

Execution plans FOR a IN collA RETURN {x: a.x, z: b.z} EnumerateCollection a EnumerateCollection b Calculation xx == b.y Filter xx == b.y Singleton Calculation xx Return {x: a.x, z: b.z} Calc {x: a.x, z: b.z} FILTER xx == b.y FOR b IN collB LET xx = a.x

Move ﬁlters up FOR a IN collA FOR b IN collB FILTER a.x == 10 FILTER a.u == b.v RETURN {u:a.u,w:b.w} Singleton EnumColl a EnumColl b Calc a.x == 10 Return {u:a.u,w:b.w} Filter a.u == b.v Calc a.u == b.v Filter a.x == 10

Move ﬁlters up FOR a IN collA FOR b IN collB FILTER a.x == 10 FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the ﬁrst FILTER is pulled out of the inner FOR. Singleton EnumColl a EnumColl b Calc a.x == 10 Return {u:a.u,w:b.w} Filter a.u == b.v Calc a.u == b.v Filter a.x == 10

Move ﬁlters up FOR a IN collA FILTER a.x < 10 FOR b IN collB FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the ﬁrst FILTER is pulled out of the inner FOR. However, the number of items traveling in the pipeline is decreased. Singleton EnumColl a Return {u:a.u,w:b.w} Filter a.u == b.v Calc a.u == b.v Calc a.x == 10 EnumColl b Filter a.x == 10

Move ﬁlters up FOR a IN collA FILTER a.x < 10 FOR b IN collB FILTER a.u == b.v RETURN {u:a.u,w:b.w} The result and behaviour does not change, if the ﬁrst FILTER is pulled out of the inner FOR. However, the number of items traveling in the pipeline is decreased. Note that the two FOR statements could be interchanged! Singleton EnumColl a Return {u:a.u,w:b.w} Filter a.u == b.v Calc a.u == b.v Calc a.x == 10 EnumColl b Filter a.x == 10

Remove unnecessary calculations FOR a IN collA LET L = LENGTH(a.hobbies) FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} Singleton EnumColl a Calc L = ... EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...}

Remove unnecessary calculations FOR a IN collA LET L = LENGTH(a.hobbies) FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! Singleton EnumColl a Calc L = ... EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...}

Remove unnecessary calculations FOR a IN collA FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! (since it cannot throw an exception). Singleton EnumColl a EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...}

Remove unnecessary calculations FOR a IN collA FOR b IN collB FILTER a.u == b.v RETURN {h:a.hobbies,w:b.w} The Calculation of L is unnecessary! (since it cannot throw an exception). Therefore we can just leave it out. Singleton EnumColl a EnumColl b Calc a.u == b.v Filter a.u == b.v Return {...}

Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Singleton EnumColl a Filter ... Calc ... Sort a.y, a.x Return a

Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), Singleton EnumColl a Filter ... Calc ... Sort a.y, a.x Return a

Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), then we can read oﬀ the half-open interval between { y: 10, x: 17 } and { y: 10, x: 23 } from the skiplist index. Singleton Sort a.y, a.x Return a IndexRange a

Use index for FILTER and SORT FOR a IN collA FILTER a.x > 17 && a.x <= 23 && a.y == 10 SORT a.y, a.x RETURN a Assume collA has a skiplist index on “y” and “x” (in this order), then we can read oﬀ the half-open interval between { y: 10, x: 17 } and { y: 10, x: 23 } from the skiplist index. The result will automatically be sorted by y and then by x. Singleton Return a IndexRange a

Data distribution in a cluster Requests DBserver DBserver DBserver CoordinatorCoordinator 4 2 5 3 11 The shards of a collection are distributed across the DB servers.

Data distribution in a cluster Requests DBserver DBserver DBserver CoordinatorCoordinator 4 2 5 3 11 The shards of a collection are distributed across the DB servers. The coordinators receive queries and organise their execution

Scatter/gather EnumerateCollection

Scatter/gather Remote EnumShard Remote Remote EnumShard Remote Concat/Merge Remote EnumShard Remote Scatter

Links https://www.arangodb.com https://docs.arangodb.com/cookbook/index.html https://github.com/ArangoDB/guesser http://mesos.apache.org/ https://mesosphere.com/ https://mesosphere.github.io/marathon/ https://dcos.io

Deep dive into the native multi model database ArangoDB

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Deep dive into the native multi model database ArangoDB (20)

More from ArangoDB Database (18)

Recently uploaded (20)

Deep dive into the native multi model database ArangoDB