Beyond relational databases NoSQL, NewSQL, TimeSeries DB Grégory BoissinotJanuary 2015 v3
Objectives Understand the dominance of relational databases Know the existence of alternative technologies for differing needs Provide you enough background on how NoSQL databases work Make you know the existence of others movements
Presentation Content RDBMS Stability Some RDBMS problems Unsuitable use cases with RDBMS NoSQL Why the emergence of this movement? Transactions and scalability issues NoSQL types
Relational Databases: already achievement of maturity Files DB Hierarchical DB Network DB Relational DB temps 1970
RDBMS (Relational Database Management System) Classic way to store data in the world of enterprise applications Often used for all database needs A powerful tool used for many more decades Providing persistence, concurrency control Accessible from many programming languages Mostly standard Widely understood The degree of standardisation is enough to keep things familiar SQL used as an integration mechanism between applications ACID transactions to modify multiple rows and multiple tables Atomic, Consistent, Isolated, etc Durable
RDBMS Schema & Normalization Relational databases require an explicitly defined schema A schema is a specification that describes the structure of an object Data normalization is the process of organizing data into tables in such way to reduce the potential for data anomalies (an inconsistency in the data)
Joining process Often the need to read data from multiple tables : a join operation on the data is performed. The join is very easier to use in the SQL syntax As the size of table grows, the join operation take longer as more data blocks need to be read
RDMS - A stability for more than more decades Stability of RDBMS Change in langages Change in architectures temps … 1980 Change in platforms Change in processes
Some RDBMS Problems SCALE OUT IS HARD (Limited scale) RIGID SCHEMA IMPEDANCE MISMATCH BAD COST CONTROL
Relational Model Example Everything is normalized No data is repeated in multiple tables. We have referential integrity RIGID SCHEMA
Changing relational database schema is hard Relational model is a set of structured data: tables with tuples and relations A tuple is a limited data structure We can’t use List, Map Can’t nest one tuple within another to get nested records Promote the data normalization No data is duplicated We referential integrity Data are modeled independently from their usage Enable to think on data manipulation as operation that have As input tuples, etc Return tuples RIGID SCHEMA
A relational database used as an integration DB Very used in 80’ For a relational database, SQL is used as an integration mechanism between applications ● Simple ● Transactional ● Triggers are available (implementation specific) Shared database integration style
Relational databases are not designed to run on clusters But it’s cheaper and more effective to scale horizontally by buying lots of machines. However it requires DBA expertise With relation database, for scaling you have to buy a bigger machine SCALE OUT IS HARD (with RDBMS)
Difference between the relational model and the in-memory data structures A lot of application development effort is spent on mapping data between in-memory data structures and a relational databases IMPEDANCE MISMATCH
Tentatives for helping to map data OODBMS ORM (JPA, Hibernate, etc) IBatis Spring Data jOOQ IMPEDANCE MISMATCH
Often difficult to control cost with relational database BAD COST CONTROL Multiple criterias ● Number of users to access database ● Number of servers ● The volume of the data
Unsuitable use cases for RDBMS Unpredictable Data (Accepts entry of any form and size) User or Session data, Log, Sensor Data from IoT Connected Data Social data, Recommendation System Real time Analytics Always context dependant Performance Responsiveness
Why NoSQL? A new challenger for a new world! There's a huge demand for things other than SQL
Scalability NoSQL favors new factors Arrival of Internet and new Web Application needs ● Large volume of read and write operations ● Low Latency response time ● High availability Flexibility Cost Control Availability
Supporting large volume of data: an old objective New use cases with huge amount of data Oracle RAC SQL server Influence of Google and Amazon (adopter of large clusters) New NoSQL products Google → BigTable Amazon → Dynamo Several actors have already addressed this in the past
NoSQL and the BigData Galaxy A combination of V
NoSQL: a movement Driven by a set a common characteristics Open-sourceNot using a relational database Running well on clusters Schemaless
NoSQL: very ill-defined Not Only SQL Polyglot Persistence M.Fowler approach
NoSQL databases types Key-Value database Document database Column Family database Graph databases
Key-Value database Are based on distributed hash tables ● 3 operations: set, get, delete Data in RAM (cache) or persisted in SSD or disk (true db) A lot of examples: Ehcache, MemcacheD, Redis, Amazon DynamoDB, Riak, Voldemort, Basho, ...
Document database A document is a set of ordered key-value pairs Any document could be different from all previous inserted documents ⇒ Document databases are designed to accommodate variations in documents within a collection Collections are groups of similar documents
Document database Similar to Key-Value DBs where the Value is semi-structured, it is the with arbitrary, nested data formats and varying format Document DBs enable you to query and filter based on elements Sharding can be based on a field that is not the key Secondary indexes on nested columns
Column-oriented database Row-based systems are designed to efficiently return data for an entire row Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data Examples: BigTable, HBase, Druid Cassandra is a hybrid between a key-value and a column-oriented database 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004; 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000;
Graph DB No need to create tables to model many-to-many relations Instead they are explicitly modeling using edges Several use cases: Social Graph, Maps use cases, etc
NoSQL avantages SchemalessScalability Rich Content Cost Control
Favor Scale-out over Scale-up With NoSQL, adding server has often no Impact NoSQL are designed to utilize available in a cluster with minimal intervention by DBA Scale up Scale out With RDBMS, adding CPU, Memory, Processors rises migration issues or buying a new server maybe rises downtime Scalability
Flexible schema Schemaless Denormalization keeps data that is frequently used together in the document Embedded document
All NoSQL DB promote denormalization and that eliminates, or at least reduces, the need for joins Improve query performance over more normalized models (Join is a costly operation) Denormalization Schemaless Schemafree
Aggregate Data Model A more complex structure than a set of tuples An aggregate is a collection of related objects that we wish to treat as a unit for data manipulation, management a consistency Eric Evant’s DDD ● We can think on term of complex record that allows: List,Map and other data structures to be nested inside it ● We like to update aggregates with atomic operation RICH CONTENT
Aggregate Data Model Example ● The customer contains a list of billing addresses; The order contains a list of: order items, a shipping address, and payments The payment itself contains a billing address for that payment A single address appears 3 times, but instead of using an id it is copied each time We like to communicate with our data storage in terms of aggregates RICH CONTENT
Aggregate Models Different approach of relational data model ● Relation database are don’t have the concept of aggregate (aggregate-ignorant) ● With aggregates, there is often no need for joins RICH CONTENT
Aggregate Boundaries Two aggregates: Customer and Order Links between aggregates are relationships Instead of using an id, a same data can be stored several times (e.g. the address) We can draw our aggregate differently //Customer { "id": 1, "name": "Fabio", "billingAddress": [ { "city": "Paris" } ] } //Orders { "id": 99, "customerId": 1, "orderItems": [ ..], "shippingAddress": [ {"city": "Paris”} ], "orderPayment": [ "billingAddress": [ {"city": "Paris”} ], …. ] } RICH CONTENT
Aggregates, the trade-off Solve the impedance mismatch Easier to work on cluster (Unit for replication and sharding) NoSQL doesn’t support Atomicity that spans multiple aggregate Not adaptable for all the needs (e.g. analyze its product sales over the last months) RICH CONTENT
Aggregate with NoSQL types Key-Value and Document databases are strongly aggregate-oriented With key-value DBs the aggregate is opaque (Blob) the aggregate can be any type of object the aggregate is only accessed by the key With Document DBs, we can see a structure in the aggregate we define structure on the data can submit queries based on fields
Aggregate : not a systematic solution Advanced data denormalization with Redis
NoSQL are often free of cost COST CONTROL The major open source are free No licence No politics based on the number of users No politics depends on the numbers of server Most companies behind the NoSQL products provide commercial support, advanced (frequently indispensable) monitoring tools, in collaboration with SaaS solutions
Sharding & Replication Sharding (or partitioning depending of the products...) ● Divided into disjoint sets ● To scale out Replication ● Duplicate the data (on different node) ● To ensure high-availability Both: each shard is replicated
Sharding: goodness and costliness We shard data to allow scale out ● Scale up means use a more powerful machine ● Scale out means use more machines Scale out to increase ● The throughput or the total amount of data or ... The main cost of sharding is about distributed locks and transactions ● Give up TX and rely on atomic operations on aggregate is a solution to achieve linear horizontal scalability
Replication: the way to achieve HA Replication can be ● Synchronous or asynchronous ○ A trade off between performance and consistency ● Master/slaves or peer-to-peer ○ master/slaves is better to implement locks (no-distributed) ○ peer-to-peer is better to HA (no election when a failure occurs) Main motivations ● Mostly to increase the “High Availability”
Example 1: sharding and primary/slaves replicas Copy schema from old commercial presentation (page 40, CVAT)
Example 2: Sharding and p2p replicas
Cassandra is well suited for write intensive applications Mainly because each node performs APPENDS on the file systems Tunable consistency Focus on Cassandra with P2P architecture
CAP Theorem Distributed databases cannot have consistency (C), availability (A) and partition protection (P) at the same time Consistency: A read is guaranteed to return the most recent write for a given client Availability: every request received by a non-failing node in the system must result in a response Partition Tolerance: the system continues to operate despite arbitrary partitioning due to network failures Also known as the Brewer’s theorem
CAP theorem gotchas Consistent != global state There are several definitions of Consistency. It more about linearization: find a point of view (so an order of events respectful of causality) where the final state is correct Availability != Vivacity A failing node do not remove the availability property. But a dead system is not very useful. Because a read-only system is more convenient, we will prefer “CP” to “CA” for distributed systems. Networks are not reliable
NoSQL Quorum to the rescue A quorum is the number of servers that must respond to a read or write operations for the operation to be considered OK. A big enough is often required to ensure the wished consistency
Availability & Consistency in Distributed Databases We often sacrifice Consistency for Scalability, Availability or Performance However many enterprise use case needs (Strong) Consistency Eventual Consistency “There may be times when the data is inconsistent” Eventually consistent means that some replicas might be inconsistent for some period for time but will become consistent at some point
Two Phase Commit (2PC) A two-phase commit is a transaction that require writing data to two separate locations Help ensure consistency With 2PC, the DB favors consistency but at the risk of the most recent data not being available for a brief period of time While the 2PC is executing, transactions are longer. The updated data is delayed until the 2PC finishes (the lock takes more time) Favor Consistency over availability
BASE Transactions for NoSQL BA Basically available S Soft safe E Eventually consistency BA: There can be partial failure in some parts of the distributed system and the rest of teh system continues to function S: It refers to the fact that data may eventually be overwritten with more recent data (this property overlaps with eventual consistency) E: There may be times when the database is in an inconsistent state
Schemaless in depth Schemaless DBs do not require formal structure specification It doesn’t make sense to require data modelers to specify all possible document fields prior to building and populating the database Attention: Schemaless doesn’t mean no schema Schema is often implicit in the code
Polymorphic Schema Polymorphic Schema Derived from Latin and literally means “many shapes” Each document can have a different structure Created dynamically when the document is inserted
Which NoSQL database ? Multiple criteria - Volume of reads and write (throughput) - Tolerance for inconsistent data in replicas - The nature of relations between entities and how that affects query patterns - Availability and disaster recovery requirements - The need for flexibility in data models - Latency requirement - Volume of data
Quizz - NoSQL DBs Uses cases Application that use JSON data structure ? Frequent small reads and writes along with simple data models ? Caching data from relational DBs to improve performance ? Application that are geographically distributed over multiple data centers ? Social networking ?
Additional Key-value DBs Uses cases Backend support for websites with high volumes of reads and write Key-Value DBs Storing large objects such as images and audio files Key-Value DBs Tracking transient attributes in a web application such as a shopping cart Key-Value DBs
Additional Document DBs Uses cases Application that use JSON data structure Document DBs Tracking variable type of metadata Document DBs Storing configuration and user information for mobile applications Document DBs
Additional Column family DBs Uses cases Application with the potential for truly large volumes of data such as hundreds of terabytes Colum family DBs Applications with dynamic fields Colum family DBs
Additional Graph DBs Uses cases Network and IT infrastructure management Graph DBs Recommending products and services Graph DBs
Quizz - NoSQL DBs Uses cases Application that use JSON data structure Document DBs such as MongoDB Frequent small reads and writes along with simple data models Key-Value DBs such as Redis Caching data from relational DBs to improve performance Key-Value DBs such as Redis Application that are geographically distributed over multiple data centers Colum DBs such as Cassandra Social networking GraphDB such as Neo4j
NewSQL movement The co-existence between of RDBMS and NoSQL features in the same product NewSQL s a class of modern RDBMS’s that seek to provide The same scalable performance of NoSQL systems for read-write workloads ACID guarantees of a traditional relational database system.
TimeSeries DB ● Consists of sequence of values or events changing with time ○ Data is recorded at regular intervals ● Very used within Microservices Architecture and with DDD approaches ● Applications ○ Financial: stock price, inflation ○ Biomedical: blood pressure ○ Meteorological: precipitation ● Already several technologies ○ DruidDB ○ InfluxDB ○ Redis
Treat the database as a Application database The responsibility for database integrity is put in the service With application database, the database is only acceded by a single application codebase ⇒ a single team / a single application Only the team need to know the database structure We favor application communication by Web Services Give more freedom to choose a database
Polyglot Persistence Several DBs technologies for a single application ● We use Service wrapping pattern for each DB ● Developers want different APIs for different problems ● Most organizations have for now a mix of data storage technologies for different circumstances
Suitable for Microservices Architecture ● Each Service manages its own data ○ The data consistency is delegated to the service ● Each is an independent functional unit
Conclusion Four factors favors NoSQL usage: Scalability, Cost, Flexibility and Availability RDBMS and SQL is going to continue to exist The solution is likely to be an hybrid of multiple technologies Always the choice depends on your needs RDBMS stayed a good choice in many scenarios (strong legacy, critical data, etc) We are entering in a world of Polyglot Persistence
Annexe - Reference List Books

Beyond Relational Databases

  • 1.
    Beyond relational databases NoSQL,NewSQL, TimeSeries DB Grégory BoissinotJanuary 2015 v3
  • 2.
    Objectives Understand the dominanceof relational databases Know the existence of alternative technologies for differing needs Provide you enough background on how NoSQL databases work Make you know the existence of others movements
  • 3.
    Presentation Content RDBMS Stability Some RDBMSproblems Unsuitable use cases with RDBMS NoSQL Why the emergence of this movement? Transactions and scalability issues NoSQL types
  • 4.
    Relational Databases: alreadyachievement of maturity Files DB Hierarchical DB Network DB Relational DB temps 1970
  • 5.
    RDBMS (Relational DatabaseManagement System) Classic way to store data in the world of enterprise applications Often used for all database needs A powerful tool used for many more decades Providing persistence, concurrency control Accessible from many programming languages Mostly standard Widely understood The degree of standardisation is enough to keep things familiar SQL used as an integration mechanism between applications ACID transactions to modify multiple rows and multiple tables Atomic, Consistent, Isolated, etc Durable
  • 6.
    RDBMS Schema &Normalization Relational databases require an explicitly defined schema A schema is a specification that describes the structure of an object Data normalization is the process of organizing data into tables in such way to reduce the potential for data anomalies (an inconsistency in the data)
  • 7.
    Joining process Often theneed to read data from multiple tables : a join operation on the data is performed. The join is very easier to use in the SQL syntax As the size of table grows, the join operation take longer as more data blocks need to be read
  • 8.
    RDMS - Astability for more than more decades Stability of RDBMS Change in langages Change in architectures temps … 1980 Change in platforms Change in processes
  • 9.
    Some RDBMS Problems SCALEOUT IS HARD (Limited scale) RIGID SCHEMA IMPEDANCE MISMATCH BAD COST CONTROL
  • 10.
    Relational Model Example Everythingis normalized No data is repeated in multiple tables. We have referential integrity RIGID SCHEMA
  • 11.
    Changing relational databaseschema is hard Relational model is a set of structured data: tables with tuples and relations A tuple is a limited data structure We can’t use List, Map Can’t nest one tuple within another to get nested records Promote the data normalization No data is duplicated We referential integrity Data are modeled independently from their usage Enable to think on data manipulation as operation that have As input tuples, etc Return tuples RIGID SCHEMA
  • 12.
    A relational databaseused as an integration DB Very used in 80’ For a relational database, SQL is used as an integration mechanism between applications ● Simple ● Transactional ● Triggers are available (implementation specific) Shared database integration style
  • 13.
    Relational databases arenot designed to run on clusters But it’s cheaper and more effective to scale horizontally by buying lots of machines. However it requires DBA expertise With relation database, for scaling you have to buy a bigger machine SCALE OUT IS HARD (with RDBMS)
  • 14.
    Difference between therelational model and the in-memory data structures A lot of application development effort is spent on mapping data between in-memory data structures and a relational databases IMPEDANCE MISMATCH
  • 15.
    Tentatives for helpingto map data OODBMS ORM (JPA, Hibernate, etc) IBatis Spring Data jOOQ IMPEDANCE MISMATCH
  • 16.
    Often difficult tocontrol cost with relational database BAD COST CONTROL Multiple criterias ● Number of users to access database ● Number of servers ● The volume of the data
  • 17.
    Unsuitable use casesfor RDBMS Unpredictable Data (Accepts entry of any form and size) User or Session data, Log, Sensor Data from IoT Connected Data Social data, Recommendation System Real time Analytics Always context dependant Performance Responsiveness
  • 18.
    Why NoSQL? A newchallenger for a new world! There's a huge demand for things other than SQL
  • 19.
    Scalability NoSQL favors newfactors Arrival of Internet and new Web Application needs ● Large volume of read and write operations ● Low Latency response time ● High availability Flexibility Cost Control Availability
  • 20.
    Supporting large volumeof data: an old objective New use cases with huge amount of data Oracle RAC SQL server Influence of Google and Amazon (adopter of large clusters) New NoSQL products Google → BigTable Amazon → Dynamo Several actors have already addressed this in the past
  • 21.
    NoSQL and theBigData Galaxy A combination of V
  • 22.
    NoSQL: a movement Drivenby a set a common characteristics Open-sourceNot using a relational database Running well on clusters Schemaless
  • 23.
    NoSQL: very ill-defined NotOnly SQL Polyglot Persistence M.Fowler approach
  • 24.
    NoSQL databases types Key-Valuedatabase Document database Column Family database Graph databases
  • 25.
    Key-Value database Are basedon distributed hash tables ● 3 operations: set, get, delete Data in RAM (cache) or persisted in SSD or disk (true db) A lot of examples: Ehcache, MemcacheD, Redis, Amazon DynamoDB, Riak, Voldemort, Basho, ...
  • 26.
    Document database A documentis a set of ordered key-value pairs Any document could be different from all previous inserted documents ⇒ Document databases are designed to accommodate variations in documents within a collection Collections are groups of similar documents
  • 27.
    Document database Similar toKey-Value DBs where the Value is semi-structured, it is the with arbitrary, nested data formats and varying format Document DBs enable you to query and filter based on elements Sharding can be based on a field that is not the key Secondary indexes on nested columns
  • 28.
    Column-oriented database Row-based systemsare designed to efficiently return data for an entire row Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data Examples: BigTable, HBase, Druid Cassandra is a hybrid between a key-value and a column-oriented database 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004; 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000;
  • 29.
    Graph DB No needto create tables to model many-to-many relations Instead they are explicitly modeling using edges Several use cases: Social Graph, Maps use cases, etc
  • 30.
  • 31.
    Favor Scale-out overScale-up With NoSQL, adding server has often no Impact NoSQL are designed to utilize available in a cluster with minimal intervention by DBA Scale up Scale out With RDBMS, adding CPU, Memory, Processors rises migration issues or buying a new server maybe rises downtime Scalability
  • 32.
    Flexible schema Schemaless Denormalization keepsdata that is frequently used together in the document Embedded document
  • 33.
    All NoSQL DBpromote denormalization and that eliminates, or at least reduces, the need for joins Improve query performance over more normalized models (Join is a costly operation) Denormalization Schemaless Schemafree
  • 34.
    Aggregate Data Model Amore complex structure than a set of tuples An aggregate is a collection of related objects that we wish to treat as a unit for data manipulation, management a consistency Eric Evant’s DDD ● We can think on term of complex record that allows: List,Map and other data structures to be nested inside it ● We like to update aggregates with atomic operation RICH CONTENT
  • 35.
    Aggregate Data ModelExample ● The customer contains a list of billing addresses; The order contains a list of: order items, a shipping address, and payments The payment itself contains a billing address for that payment A single address appears 3 times, but instead of using an id it is copied each time We like to communicate with our data storage in terms of aggregates RICH CONTENT
  • 36.
    Aggregate Models Different approachof relational data model ● Relation database are don’t have the concept of aggregate (aggregate-ignorant) ● With aggregates, there is often no need for joins RICH CONTENT
  • 37.
    Aggregate Boundaries Two aggregates:Customer and Order Links between aggregates are relationships Instead of using an id, a same data can be stored several times (e.g. the address) We can draw our aggregate differently //Customer { "id": 1, "name": "Fabio", "billingAddress": [ { "city": "Paris" } ] } //Orders { "id": 99, "customerId": 1, "orderItems": [ ..], "shippingAddress": [ {"city": "Paris”} ], "orderPayment": [ "billingAddress": [ {"city": "Paris”} ], …. ] } RICH CONTENT
  • 38.
    Aggregates, the trade-off Solvethe impedance mismatch Easier to work on cluster (Unit for replication and sharding) NoSQL doesn’t support Atomicity that spans multiple aggregate Not adaptable for all the needs (e.g. analyze its product sales over the last months) RICH CONTENT
  • 39.
    Aggregate with NoSQLtypes Key-Value and Document databases are strongly aggregate-oriented With key-value DBs the aggregate is opaque (Blob) the aggregate can be any type of object the aggregate is only accessed by the key With Document DBs, we can see a structure in the aggregate we define structure on the data can submit queries based on fields
  • 40.
    Aggregate : nota systematic solution Advanced data denormalization with Redis
  • 41.
    NoSQL are oftenfree of cost COST CONTROL The major open source are free No licence No politics based on the number of users No politics depends on the numbers of server Most companies behind the NoSQL products provide commercial support, advanced (frequently indispensable) monitoring tools, in collaboration with SaaS solutions
  • 42.
    Sharding & Replication Sharding(or partitioning depending of the products...) ● Divided into disjoint sets ● To scale out Replication ● Duplicate the data (on different node) ● To ensure high-availability Both: each shard is replicated
  • 43.
    Sharding: goodness andcostliness We shard data to allow scale out ● Scale up means use a more powerful machine ● Scale out means use more machines Scale out to increase ● The throughput or the total amount of data or ... The main cost of sharding is about distributed locks and transactions ● Give up TX and rely on atomic operations on aggregate is a solution to achieve linear horizontal scalability
  • 44.
    Replication: the wayto achieve HA Replication can be ● Synchronous or asynchronous ○ A trade off between performance and consistency ● Master/slaves or peer-to-peer ○ master/slaves is better to implement locks (no-distributed) ○ peer-to-peer is better to HA (no election when a failure occurs) Main motivations ● Mostly to increase the “High Availability”
  • 45.
    Example 1: shardingand primary/slaves replicas Copy schema from old commercial presentation (page 40, CVAT)
  • 46.
    Example 2: Shardingand p2p replicas
  • 47.
    Cassandra is wellsuited for write intensive applications Mainly because each node performs APPENDS on the file systems Tunable consistency Focus on Cassandra with P2P architecture
  • 48.
    CAP Theorem Distributed databasescannot have consistency (C), availability (A) and partition protection (P) at the same time Consistency: A read is guaranteed to return the most recent write for a given client Availability: every request received by a non-failing node in the system must result in a response Partition Tolerance: the system continues to operate despite arbitrary partitioning due to network failures Also known as the Brewer’s theorem
  • 49.
    CAP theorem gotchas Consistent!= global state There are several definitions of Consistency. It more about linearization: find a point of view (so an order of events respectful of causality) where the final state is correct Availability != Vivacity A failing node do not remove the availability property. But a dead system is not very useful. Because a read-only system is more convenient, we will prefer “CP” to “CA” for distributed systems. Networks are not reliable
  • 50.
    NoSQL Quorum tothe rescue A quorum is the number of servers that must respond to a read or write operations for the operation to be considered OK. A big enough is often required to ensure the wished consistency
  • 51.
    Availability & Consistencyin Distributed Databases We often sacrifice Consistency for Scalability, Availability or Performance However many enterprise use case needs (Strong) Consistency Eventual Consistency “There may be times when the data is inconsistent” Eventually consistent means that some replicas might be inconsistent for some period for time but will become consistent at some point
  • 52.
    Two Phase Commit(2PC) A two-phase commit is a transaction that require writing data to two separate locations Help ensure consistency With 2PC, the DB favors consistency but at the risk of the most recent data not being available for a brief period of time While the 2PC is executing, transactions are longer. The updated data is delayed until the 2PC finishes (the lock takes more time) Favor Consistency over availability
  • 53.
    BASE Transactions forNoSQL BA Basically available S Soft safe E Eventually consistency BA: There can be partial failure in some parts of the distributed system and the rest of teh system continues to function S: It refers to the fact that data may eventually be overwritten with more recent data (this property overlaps with eventual consistency) E: There may be times when the database is in an inconsistent state
  • 54.
    Schemaless in depth SchemalessDBs do not require formal structure specification It doesn’t make sense to require data modelers to specify all possible document fields prior to building and populating the database Attention: Schemaless doesn’t mean no schema Schema is often implicit in the code
  • 55.
    Polymorphic Schema Polymorphic Schema Derivedfrom Latin and literally means “many shapes” Each document can have a different structure Created dynamically when the document is inserted
  • 56.
    Which NoSQL database? Multiple criteria - Volume of reads and write (throughput) - Tolerance for inconsistent data in replicas - The nature of relations between entities and how that affects query patterns - Availability and disaster recovery requirements - The need for flexibility in data models - Latency requirement - Volume of data
  • 57.
    Quizz - NoSQLDBs Uses cases Application that use JSON data structure ? Frequent small reads and writes along with simple data models ? Caching data from relational DBs to improve performance ? Application that are geographically distributed over multiple data centers ? Social networking ?
  • 58.
    Additional Key-value DBsUses cases Backend support for websites with high volumes of reads and write Key-Value DBs Storing large objects such as images and audio files Key-Value DBs Tracking transient attributes in a web application such as a shopping cart Key-Value DBs
  • 59.
    Additional Document DBsUses cases Application that use JSON data structure Document DBs Tracking variable type of metadata Document DBs Storing configuration and user information for mobile applications Document DBs
  • 60.
    Additional Column familyDBs Uses cases Application with the potential for truly large volumes of data such as hundreds of terabytes Colum family DBs Applications with dynamic fields Colum family DBs
  • 61.
    Additional Graph DBsUses cases Network and IT infrastructure management Graph DBs Recommending products and services Graph DBs
  • 62.
    Quizz - NoSQLDBs Uses cases Application that use JSON data structure Document DBs such as MongoDB Frequent small reads and writes along with simple data models Key-Value DBs such as Redis Caching data from relational DBs to improve performance Key-Value DBs such as Redis Application that are geographically distributed over multiple data centers Colum DBs such as Cassandra Social networking GraphDB such as Neo4j
  • 63.
    NewSQL movement The co-existencebetween of RDBMS and NoSQL features in the same product NewSQL s a class of modern RDBMS’s that seek to provide The same scalable performance of NoSQL systems for read-write workloads ACID guarantees of a traditional relational database system.
  • 64.
    TimeSeries DB ● Consistsof sequence of values or events changing with time ○ Data is recorded at regular intervals ● Very used within Microservices Architecture and with DDD approaches ● Applications ○ Financial: stock price, inflation ○ Biomedical: blood pressure ○ Meteorological: precipitation ● Already several technologies ○ DruidDB ○ InfluxDB ○ Redis
  • 65.
    Treat the databaseas a Application database The responsibility for database integrity is put in the service With application database, the database is only acceded by a single application codebase ⇒ a single team / a single application Only the team need to know the database structure We favor application communication by Web Services Give more freedom to choose a database
  • 66.
    Polyglot Persistence Several DBstechnologies for a single application ● We use Service wrapping pattern for each DB ● Developers want different APIs for different problems ● Most organizations have for now a mix of data storage technologies for different circumstances
  • 67.
    Suitable for MicroservicesArchitecture ● Each Service manages its own data ○ The data consistency is delegated to the service ● Each is an independent functional unit
  • 68.
    Conclusion Four factors favorsNoSQL usage: Scalability, Cost, Flexibility and Availability RDBMS and SQL is going to continue to exist The solution is likely to be an hybrid of multiple technologies Always the choice depends on your needs RDBMS stayed a good choice in many scenarios (strong legacy, critical data, etc) We are entering in a world of Polyglot Persistence
  • 69.