Beyond Relational Databases

Beyond relational databases NoSQL, NewSQL, TimeSeries DB Grégory BoissinotJanuary 2015 v3

Objectives Understand the dominance of relational databases Know the existence of alternative technologies for differing needs Provide you enough background on how NoSQL databases work Make you know the existence of others movements

Presentation Content RDBMS Stability Some RDBMS problems Unsuitable use cases with RDBMS NoSQL Why the emergence of this movement? Transactions and scalability issues NoSQL types

Relational Databases: already achievement of maturity Files DB Hierarchical DB Network DB Relational DB temps 1970

RDBMS (Relational Database Management System) Classic way to store data in the world of enterprise applications Often used for all database needs A powerful tool used for many more decades Providing persistence, concurrency control Accessible from many programming languages Mostly standard Widely understood The degree of standardisation is enough to keep things familiar SQL used as an integration mechanism between applications ACID transactions to modify multiple rows and multiple tables Atomic, Consistent, Isolated, etc Durable

RDBMS Schema & Normalization Relational databases require an explicitly defined schema A schema is a specification that describes the structure of an object Data normalization is the process of organizing data into tables in such way to reduce the potential for data anomalies (an inconsistency in the data)

Joining process Often the need to read data from multiple tables : a join operation on the data is performed. The join is very easier to use in the SQL syntax As the size of table grows, the join operation take longer as more data blocks need to be read

RDMS - A stability for more than more decades Stability of RDBMS Change in langages Change in architectures temps … 1980 Change in platforms Change in processes

Some RDBMS Problems SCALE OUT IS HARD (Limited scale) RIGID SCHEMA IMPEDANCE MISMATCH BAD COST CONTROL

Relational Model Example Everything is normalized No data is repeated in multiple tables. We have referential integrity RIGID SCHEMA

Changing relational database schema is hard Relational model is a set of structured data: tables with tuples and relations A tuple is a limited data structure We can’t use List, Map Can’t nest one tuple within another to get nested records Promote the data normalization No data is duplicated We referential integrity Data are modeled independently from their usage Enable to think on data manipulation as operation that have As input tuples, etc Return tuples RIGID SCHEMA

A relational database used as an integration DB Very used in 80’ For a relational database, SQL is used as an integration mechanism between applications ● Simple ● Transactional ● Triggers are available (implementation specific) Shared database integration style

Relational databases are not designed to run on clusters But it’s cheaper and more effective to scale horizontally by buying lots of machines. However it requires DBA expertise With relation database, for scaling you have to buy a bigger machine SCALE OUT IS HARD (with RDBMS)

Difference between the relational model and the in-memory data structures A lot of application development effort is spent on mapping data between in-memory data structures and a relational databases IMPEDANCE MISMATCH

Tentatives for helping to map data OODBMS ORM (JPA, Hibernate, etc) IBatis Spring Data jOOQ IMPEDANCE MISMATCH

Often difficult to control cost with relational database BAD COST CONTROL Multiple criterias ● Number of users to access database ● Number of servers ● The volume of the data

Unsuitable use cases for RDBMS Unpredictable Data (Accepts entry of any form and size) User or Session data, Log, Sensor Data from IoT Connected Data Social data, Recommendation System Real time Analytics Always context dependant Performance Responsiveness

Why NoSQL? A new challenger for a new world! There's a huge demand for things other than SQL

Scalability NoSQL favors new factors Arrival of Internet and new Web Application needs ● Large volume of read and write operations ● Low Latency response time ● High availability Flexibility Cost Control Availability

Supporting large volume of data: an old objective New use cases with huge amount of data Oracle RAC SQL server Influence of Google and Amazon (adopter of large clusters) New NoSQL products Google → BigTable Amazon → Dynamo Several actors have already addressed this in the past

NoSQL and the BigData Galaxy A combination of V

NoSQL: a movement Driven by a set a common characteristics Open-sourceNot using a relational database Running well on clusters Schemaless

NoSQL: very ill-defined Not Only SQL Polyglot Persistence M.Fowler approach

NoSQL databases types Key-Value database Document database Column Family database Graph databases

Key-Value database Are based on distributed hash tables ● 3 operations: set, get, delete Data in RAM (cache) or persisted in SSD or disk (true db) A lot of examples: Ehcache, MemcacheD, Redis, Amazon DynamoDB, Riak, Voldemort, Basho, ...

Document database A document is a set of ordered key-value pairs Any document could be different from all previous inserted documents ⇒ Document databases are designed to accommodate variations in documents within a collection Collections are groups of similar documents

Document database Similar to Key-Value DBs where the Value is semi-structured, it is the with arbitrary, nested data formats and varying format Document DBs enable you to query and filter based on elements Sharding can be based on a field that is not the key Secondary indexes on nested columns

Column-oriented database Row-based systems are designed to efficiently return data for an entire row Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data Examples: BigTable, HBase, Druid Cassandra is a hybrid between a key-value and a column-oriented database 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004; 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000;

Graph DB No need to create tables to model many-to-many relations Instead they are explicitly modeling using edges Several use cases: Social Graph, Maps use cases, etc

NoSQL avantages SchemalessScalability Rich Content Cost Control

Favor Scale-out over Scale-up With NoSQL, adding server has often no Impact NoSQL are designed to utilize available in a cluster with minimal intervention by DBA Scale up Scale out With RDBMS, adding CPU, Memory, Processors rises migration issues or buying a new server maybe rises downtime Scalability

Flexible schema Schemaless Denormalization keeps data that is frequently used together in the document Embedded document

All NoSQL DB promote denormalization and that eliminates, or at least reduces, the need for joins Improve query performance over more normalized models (Join is a costly operation) Denormalization Schemaless Schemafree

Aggregate Data Model A more complex structure than a set of tuples An aggregate is a collection of related objects that we wish to treat as a unit for data manipulation, management a consistency Eric Evant’s DDD ● We can think on term of complex record that allows: List,Map and other data structures to be nested inside it ● We like to update aggregates with atomic operation RICH CONTENT

Aggregate Data Model Example ● The customer contains a list of billing addresses; The order contains a list of: order items, a shipping address, and payments The payment itself contains a billing address for that payment A single address appears 3 times, but instead of using an id it is copied each time We like to communicate with our data storage in terms of aggregates RICH CONTENT

Aggregate Models Different approach of relational data model ● Relation database are don’t have the concept of aggregate (aggregate-ignorant) ● With aggregates, there is often no need for joins RICH CONTENT

Aggregate Boundaries Two aggregates: Customer and Order Links between aggregates are relationships Instead of using an id, a same data can be stored several times (e.g. the address) We can draw our aggregate differently //Customer { "id": 1, "name": "Fabio", "billingAddress": [ { "city": "Paris" } ] } //Orders { "id": 99, "customerId": 1, "orderItems": [ ..], "shippingAddress": [ {"city": "Paris”} ], "orderPayment": [ "billingAddress": [ {"city": "Paris”} ], …. ] } RICH CONTENT

Aggregates, the trade-off Solve the impedance mismatch Easier to work on cluster (Unit for replication and sharding) NoSQL doesn’t support Atomicity that spans multiple aggregate Not adaptable for all the needs (e.g. analyze its product sales over the last months) RICH CONTENT

Aggregate with NoSQL types Key-Value and Document databases are strongly aggregate-oriented With key-value DBs the aggregate is opaque (Blob) the aggregate can be any type of object the aggregate is only accessed by the key With Document DBs, we can see a structure in the aggregate we define structure on the data can submit queries based on fields

Aggregate : not a systematic solution Advanced data denormalization with Redis

NoSQL are often free of cost COST CONTROL The major open source are free No licence No politics based on the number of users No politics depends on the numbers of server Most companies behind the NoSQL products provide commercial support, advanced (frequently indispensable) monitoring tools, in collaboration with SaaS solutions

Sharding & Replication Sharding (or partitioning depending of the products...) ● Divided into disjoint sets ● To scale out Replication ● Duplicate the data (on different node) ● To ensure high-availability Both: each shard is replicated

Sharding: goodness and costliness We shard data to allow scale out ● Scale up means use a more powerful machine ● Scale out means use more machines Scale out to increase ● The throughput or the total amount of data or ... The main cost of sharding is about distributed locks and transactions ● Give up TX and rely on atomic operations on aggregate is a solution to achieve linear horizontal scalability

Replication: the way to achieve HA Replication can be ● Synchronous or asynchronous ○ A trade off between performance and consistency ● Master/slaves or peer-to-peer ○ master/slaves is better to implement locks (no-distributed) ○ peer-to-peer is better to HA (no election when a failure occurs) Main motivations ● Mostly to increase the “High Availability”

Example 1: sharding and primary/slaves replicas Copy schema from old commercial presentation (page 40, CVAT)

Example 2: Sharding and p2p replicas

Cassandra is well suited for write intensive applications Mainly because each node performs APPENDS on the file systems Tunable consistency Focus on Cassandra with P2P architecture

CAP Theorem Distributed databases cannot have consistency (C), availability (A) and partition protection (P) at the same time Consistency: A read is guaranteed to return the most recent write for a given client Availability: every request received by a non-failing node in the system must result in a response Partition Tolerance: the system continues to operate despite arbitrary partitioning due to network failures Also known as the Brewer’s theorem

CAP theorem gotchas Consistent != global state There are several definitions of Consistency. It more about linearization: find a point of view (so an order of events respectful of causality) where the final state is correct Availability != Vivacity A failing node do not remove the availability property. But a dead system is not very useful. Because a read-only system is more convenient, we will prefer “CP” to “CA” for distributed systems. Networks are not reliable

NoSQL Quorum to the rescue A quorum is the number of servers that must respond to a read or write operations for the operation to be considered OK. A big enough is often required to ensure the wished consistency

Availability & Consistency in Distributed Databases We often sacrifice Consistency for Scalability, Availability or Performance However many enterprise use case needs (Strong) Consistency Eventual Consistency “There may be times when the data is inconsistent” Eventually consistent means that some replicas might be inconsistent for some period for time but will become consistent at some point

Two Phase Commit (2PC) A two-phase commit is a transaction that require writing data to two separate locations Help ensure consistency With 2PC, the DB favors consistency but at the risk of the most recent data not being available for a brief period of time While the 2PC is executing, transactions are longer. The updated data is delayed until the 2PC finishes (the lock takes more time) Favor Consistency over availability

BASE Transactions for NoSQL BA Basically available S Soft safe E Eventually consistency BA: There can be partial failure in some parts of the distributed system and the rest of teh system continues to function S: It refers to the fact that data may eventually be overwritten with more recent data (this property overlaps with eventual consistency) E: There may be times when the database is in an inconsistent state

Schemaless in depth Schemaless DBs do not require formal structure specification It doesn’t make sense to require data modelers to specify all possible document fields prior to building and populating the database Attention: Schemaless doesn’t mean no schema Schema is often implicit in the code

Polymorphic Schema Polymorphic Schema Derived from Latin and literally means “many shapes” Each document can have a different structure Created dynamically when the document is inserted

Which NoSQL database ? Multiple criteria - Volume of reads and write (throughput) - Tolerance for inconsistent data in replicas - The nature of relations between entities and how that affects query patterns - Availability and disaster recovery requirements - The need for flexibility in data models - Latency requirement - Volume of data

Quizz - NoSQL DBs Uses cases Application that use JSON data structure ? Frequent small reads and writes along with simple data models ? Caching data from relational DBs to improve performance ? Application that are geographically distributed over multiple data centers ? Social networking ?

Additional Key-value DBs Uses cases Backend support for websites with high volumes of reads and write Key-Value DBs Storing large objects such as images and audio files Key-Value DBs Tracking transient attributes in a web application such as a shopping cart Key-Value DBs

Additional Document DBs Uses cases Application that use JSON data structure Document DBs Tracking variable type of metadata Document DBs Storing configuration and user information for mobile applications Document DBs

Additional Column family DBs Uses cases Application with the potential for truly large volumes of data such as hundreds of terabytes Colum family DBs Applications with dynamic fields Colum family DBs

Additional Graph DBs Uses cases Network and IT infrastructure management Graph DBs Recommending products and services Graph DBs

Quizz - NoSQL DBs Uses cases Application that use JSON data structure Document DBs such as MongoDB Frequent small reads and writes along with simple data models Key-Value DBs such as Redis Caching data from relational DBs to improve performance Key-Value DBs such as Redis Application that are geographically distributed over multiple data centers Colum DBs such as Cassandra Social networking GraphDB such as Neo4j

NewSQL movement The co-existence between of RDBMS and NoSQL features in the same product NewSQL s a class of modern RDBMS’s that seek to provide The same scalable performance of NoSQL systems for read-write workloads ACID guarantees of a traditional relational database system.

TimeSeries DB ● Consists of sequence of values or events changing with time ○ Data is recorded at regular intervals ● Very used within Microservices Architecture and with DDD approaches ● Applications ○ Financial: stock price, inflation ○ Biomedical: blood pressure ○ Meteorological: precipitation ● Already several technologies ○ DruidDB ○ InfluxDB ○ Redis

Treat the database as a Application database The responsibility for database integrity is put in the service With application database, the database is only acceded by a single application codebase ⇒ a single team / a single application Only the team need to know the database structure We favor application communication by Web Services Give more freedom to choose a database

Polyglot Persistence Several DBs technologies for a single application ● We use Service wrapping pattern for each DB ● Developers want different APIs for different problems ● Most organizations have for now a mix of data storage technologies for different circumstances

Suitable for Microservices Architecture ● Each Service manages its own data ○ The data consistency is delegated to the service ● Each is an independent functional unit

Conclusion Four factors favors NoSQL usage: Scalability, Cost, Flexibility and Availability RDBMS and SQL is going to continue to exist The solution is likely to be an hybrid of multiple technologies Always the choice depends on your needs RDBMS stayed a good choice in many scenarios (strong legacy, critical data, etc) We are entering in a world of Polyglot Persistence

Beyond Relational Databases

More Related Content

What's hot

Similar to Beyond Relational Databases

More from Gregory Boissinot

Recently uploaded

Beyond Relational Databases