Implementation of Cluster-wide Causal Consistency in
- What is causal consistency - Academics view on Causal Consistency - MongoDB architecture - Causal Consistency building blocks - Making Causal Consistency secure - Making Causal Consistency fast - Making Causal Consistency reliable - Causal consistency for end-users Outline
Client-side properties of causal consistency - Read your writes - Writes follow reads - Monotonic reads - Monotonic writes
Implementing with a non causally consistent system - Single server systems are causally consistent - Read and write from the same node - Add an application logic to handle the scenarios that have to be causally consistent
Ordering of Events in Distributed System Process P Process Q Process R q2: <C2> = 11 p1 r1: <C3> = 11 q1 <C3> = 0<C1> = 10 <C2> = 10 r2: <C3> = 12 q3
Server-side causal consistency Causal consistency is a partial order of events in a distributed system. If an event A causes another event B, then causal consistency provides an assurance that each other process of the system observes event A before observing event B. If an event A is not causally related to an event B then they are concurrent.
System’s Architecture
Gossiping clusterTime ClusterTime: {uint64} Timestamp(1495470881, 6)
Ticking clusterTime {ts: 6} ... {ts: 5} ... {ts: 11} ... insert({x:1}, clusterTime: 4) <clusterTime>: 6 <Wall clock>: 11
Reporting operationTime insert({x:1}) {ts: 11} ... {ts: 5} ... {ok:1}, { operationTime: 12} {ts: 12} ... <clusterTime>: 11 <Wall clock>: 11
Waiting for afterClusterTime find({x:1}, afterClusterTime: {10}, clusterTime: {15}) {ts: 6} ... {ts: 5} ... {x:1}, { operationTime: {11}} {ts: 11} ...
Breaking clusterTime {ts: Timestamp(1495470881, 6), term: 1}, ... {ts: Timestamp(1495470881, 5), term: 1}, ... {Timestamp(0xFFFFFFFF, 0xFFFFFFFF} ... insert({x:1}, clusterTime: {0xFFFFFFFF, 0xFFFFFFFE}) LogicalClock:clusterTime = Timestamp(0xFFFFFFFF, 0xFFFFFFFE)
Protecting clusterTime "$clusterTime" : { "clusterTime" : Timestamp(1495470881, 5), "signature" : { "hash" : BinData(0,"7olYjQCLtnfORsI9IAhdsftESR4="), "keyId" : NumberLong("6422998367101517844") } }
Protecting against operator errors {ts: 6} ... {ts: 5} ... insert({x:1}, clusterTime: 100,000) <clusterTime>: 6 <Wall clock>: 11 {Error}, { operationTime: 6}
Signing a range of clusterTime find({x:1}, clusterTime: <val>, signature:<hash>) <timeRange> = <val> | 0x0000’0000’0000’FFFF cache:{ <timeRange>:<hash> }
Use dummy signatures - When the auth is off - When a user has advanceClusterTime privilege
How end users see it let session=db.getMongo().startSession({causalConsistency: true}) db = session.getDatabase(db.getName());
{checking:100} find({name:”misha”}) afterClusterTime: 15 update({name:”misha” checking:100}) {ok:1} operationTime: 15 startSession()
Misha Tyulenev misha@mongodb.com

Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев (MongoDB)

Editor's Notes

  • #4 Even if the read request goes to primary it's not guaranteed to read its own writes for example read concern level = majority may delay it
  • #11 Add a logical clock object to each cluster node (routers, storage, clients) Every client tracks the greatest operationTime inside a causally consistent session
  • #12 clusterTime is incremented only on the write to the oplog (storage) (clusterTime + election term) is a primary key in the oplog collection
  • #13 Every command returns an operationTime: the greatest clusterTime stored with an opLog entry at the time the command finishes its execution
  • #14 Every request includes the afterClusterTime A storage node waits for opLog to replicate the entry with clusterTime >= afterClusterTime
  • #15 Clients have to participate, but we don’t trust the clients There is a maximum time after which primary can’t do a write So we want to be sure that all cluster times from clients are from trusted source
  • #17 clusterTime is incremented only on the write to the oplog (storage) (clusterTime + election term) is a primary key in the oplog collection