In-memory databases offer significant gains in But three issues performance as all data is have stunted their freely available. There is no uptake: Address Traditional disk- spaces only being need to page to and from disk. oriented database large enough for a architecture is subset of a typical showing its age. This makes joins a user’s data. The ‘one problem. When data more bit’ problem must be joined across and durability. multiple machines performance degradation Snowflake is inevitable. Distributed in- Schemas allow memory databases us to mix But this model only goes solve these three Partitioning so far. “Connected problems but at the and Replication Replication” takes us a price of loosing the so joins never step further allowing us single address space. hit the wire. to make the best possible use of replication.
The lay of the land: The main architectural constructs in the database industry
Shared Disk
ms μs ns ps 1MB Disk/Network 1MB Main Memory 0.000,000,000,000 Cross Continental Main Memory L1 Cache Ref Round Trip Ref Cross Network L2 Cache Ref Round Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
Distributed Cache
Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
Shared Nothing Teradata, Vertica, Greenplumb… SN Regular In-Memory In-Memory Database Database Drop Disk Exasol, VoltDB, Oracle, Sybase, MySql Times Ten, HSQL, KDB Distribute Hana ODC Distributed Caching Coherence, Gemfire, Gigaspaces
Distributed Architecture Simplify the Contract. Stick to RAM
450 processes 2TB of RAM Oracle Coherence Messaging (Topic Based) as a system of record (persistence)
Access Layer Java Java client client API API Query Layer Transactions Data Layer Mtms Cashflows Persistence Layer
Indexing Partitioning Replication
But your storage is limited by the memory on a node
Keys Fs-Fz Keys Xa-Yd Scalable storage, bandwidth and processing
Trader Party Version 1 Trade Trader Party Version 2 Trade Trader Party Version 3 Trade Trader Party Version 4 Trade …and you need versioning to do MVCC
Trade Trader Party Party Trader Trade Party Trader Trade Party
So better to use partitioning, spreading data around the cluster.
Trader Party Trade Trade Trader Party
Trader Party Trade Trade Trader Party
! This is what using Snowflake Schemas and the Connected Replication pattern is all about!
Crosscutting Keys Common Keys
Replicated Trader Party Trade Partitioned
Valuation Legs Valuations art Transaction Mapping Cashflow Mapping Facts: Party Alias Transaction =>Big, Cashflows common Legs Parties keys Ledger Book Source Book Dimensions Cost Centre Product =>Small, Risk Organisation Unit Business Unit crosscutting HCS Entity Keys Set of Books 0 37,500,000 75,000,000 112,500,000 150,000,000
Coherence’s KeyAssociation gives us this Trades MTMs Common Key
Replicated Trader Party Trade Partitioned (
Query Layer Trader Party Trade Transactions Data Layer Mtms Cashflows Fact Storage (Partitioned)
Dimensions (repliacte) Transactions Mtms Facts Cashflows (distribute/ partition) Fact Storage (Partitioned)
Valuation Legs Valuations Facts: art Transaction Mapping Cashflow Mapping Party Alias =>Big =>Distribute Transaction Cashflows Legs Parties Ledger Book Source Book Dimensions =>Small Cost Centre Product Risk Organisation Unit => Replicate Business Unit HCS Entity Set of Books 0 37,500,000 75,000,000 112,500,000 150,000,000
We use a variant on a Snowflake Schema to partition big stuff, that has the same key and replicate small stuff that has crosscutting keys.
Replicate Distribute
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ LBs[]=getLedgerBooksFor(CC1) SBs[]=getSourceBooksFor(LBs[]) So we have all the bottom level dimensions needed to query facts Transactions Mtms Cashflows Partitioned
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ LBs[]=getLedgerBooksFor(CC1) SBs[]=getSourceBooksFor(LBs[]) So we have all the bottom level dimensions needed to query facts Transactions Get all Transactions and Mtms MTMs (cluster side join) for the passed Source Books Cashflows Partitioned
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Populate raw facts LBs[]=getLedgerBooksFor(CC1) (Transactions) with SBs[]=getSourceBooksFor(LBs[]) dimension data So we have all the bottom level before returning to dimensions needed to query facts client. Transactions Get all Transactions and Mtms MTMs (cluster side join) for the passed Source Books Cashflows Partitioned
Replicated Partitioned Java client Dimensions Facts API We never have to do a distributed join!
So all the big stuff is held paritioned And we can join without shipping keys around and having intermediate results
Trader Party Trade Trade Trader Party
Trader Party Version 1 Trade Trader Party Version 2 Trade Trader Party Version 3 Trade Trader Party Version 4 Trade
Trade Trader Party Party Trader Trade Party Trader Trade Party
Valuation Legs Valuations rt Transaction Mapping Cashflow Mapping Party Alias Facts Transaction Cashflows Legs Parties This is a dimension Ledger Book •  It has a different Source Book Cost Centre key to the Facts. Dimensions Product •  And it’s BIG Risk Organisation Unit Business Unit HCS Entity Set of Books 0 125,000,000
Party Alias Parties Ledger Book Source Book Cost Centre Product Risk Organisation Unit Business Unit HCS Entity Set of Books 0 1,250,000 2,500,000 3,750,000 5,000,000
Party Alias Parties Ledger Book Source Book Cost Centre Product Risk Organisation Unit Business Unit HCS Entity Set of Books 20 1,250,015 2,500,010 3,750,005 5,000,000
So we only replicate ‘Connected’ or ‘Used’ dimensions
Processing Layer Dimension Caches (Replicated) Transactions Data Layer As new Facts are added Mtms relevant Dimensions that they reference are moved Cashflows to processing layer caches Fact Storage (Partitioned)
Query Layer Save Trade (With connected dimension Caches) Data Layer Cache Trade (All Normalised) Store Partitioned Trigger Source Cache Party Ccy Alias Book
Query Layer (With connected dimension Caches) Data Layer Trade (All Normalised) Party Source Ccy Alias Book
Query Layer (With connected dimension Caches) Data Layer Trade (All Normalised) Party Source Ccy Alias Book Party Ledger Book
‘Connected Replication’ A simple pattern which recurses through the foreign keys in the domain model, ensuring only ‘Connected’ dimensions are replicated
Java client Java schema API Java ‘Stored Procedures’ and ‘Triggers’
Partitioned Storage
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database

Balancing Replication and Partitioning in a Distributed Java Database

  • 2.
    In-memory databases offer significant gains in But three issues performance as all data is have stunted their freely available. There is no uptake: Address Traditional disk- spaces only being need to page to and from disk. oriented database large enough for a architecture is subset of a typical showing its age. This makes joins a user’s data. The ‘one problem. When data more bit’ problem must be joined across and durability. multiple machines performance degradation Snowflake is inevitable. Distributed in- Schemas allow memory databases us to mix But this model only goes solve these three Partitioning so far. “Connected problems but at the and Replication Replication” takes us a price of loosing the so joins never step further allowing us single address space. hit the wire. to make the best possible use of replication.
  • 5.
    The lay ofthe land: The main architectural constructs in the database industry
  • 7.
  • 10.
    ms μs ns ps 1MB Disk/Network 1MB Main Memory 0.000,000,000,000 Cross Continental Main Memory L1 Cache Ref Round Trip Ref Cross Network L2 Cache Ref Round Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
  • 19.
  • 21.
    Taken from “OLTPThrough the Looking Glass, and What We Found There” Harizopoulos et al
  • 23.
    Shared Nothing Teradata, Vertica, Greenplumb… SN Regular In-Memory In-Memory Database Database Drop Disk Exasol, VoltDB, Oracle, Sybase, MySql Times Ten, HSQL, KDB Distribute Hana ODC Distributed Caching Coherence, Gemfire, Gigaspaces
  • 24.
    Distributed Architecture Simplifythe Contract. Stick to RAM
  • 25.
    450 processes 2TB of RAM Oracle Coherence Messaging (Topic Based) as a system of record (persistence)
  • 26.
    Access Layer Java Java client client API API Query Layer Transactions Data Layer Mtms Cashflows Persistence Layer
  • 27.
  • 28.
    But your storageis limited by the memory on a node
  • 29.
    Keys Fs-Fz Keys Xa-Yd Scalable storage, bandwidth and processing
  • 33.
    Trader Party Version 1 Trade Trader Party Version 2 Trade Trader Party Version 3 Trade Trader Party Version 4 Trade …and you need versioning to do MVCC
  • 34.
    Trade Trader Party Party Trader Trade Party Trader Trade Party
  • 35.
    So better touse partitioning, spreading data around the cluster.
  • 36.
    Trader Party Trade Trade Trader Party
  • 37.
    Trader Party Trade Trade Trader Party
  • 41.
    ! This is whatusing Snowflake Schemas and the Connected Replication pattern is all about!
  • 44.
    Crosscutting Keys Common Keys
  • 45.
    Replicated Trader Party Trade Partitioned
  • 52.
    Valuation Legs Valuations art Transaction Mapping Cashflow Mapping Facts: Party Alias Transaction =>Big, Cashflows common Legs Parties keys Ledger Book Source Book Dimensions Cost Centre Product =>Small, Risk Organisation Unit Business Unit crosscutting HCS Entity Keys Set of Books 0 37,500,000 75,000,000 112,500,000 150,000,000
  • 54.
    Coherence’s KeyAssociation gives us this Trades MTMs Common Key
  • 55.
    Replicated Trader Party Trade Partitioned (
  • 56.
    Query Layer Trader Party Trade Transactions Data Layer Mtms Cashflows Fact Storage (Partitioned)
  • 57.
    Dimensions (repliacte) Transactions Mtms Facts Cashflows (distribute/ partition) Fact Storage (Partitioned)
  • 58.
    Valuation Legs Valuations Facts: art Transaction Mapping Cashflow Mapping Party Alias =>Big =>Distribute Transaction Cashflows Legs Parties Ledger Book Source Book Dimensions =>Small Cost Centre Product Risk Organisation Unit => Replicate Business Unit HCS Entity Set of Books 0 37,500,000 75,000,000 112,500,000 150,000,000
  • 59.
    We use avariant on a Snowflake Schema to partition big stuff, that has the same key and replicate small stuff that has crosscutting keys.
  • 60.
  • 61.
    Select Transaction, MTM,ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
  • 63.
    Select Transaction, MTM,ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ LBs[]=getLedgerBooksFor(CC1) SBs[]=getSourceBooksFor(LBs[]) So we have all the bottom level dimensions needed to query facts Transactions Mtms Cashflows Partitioned
  • 64.
    Select Transaction, MTM,ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ LBs[]=getLedgerBooksFor(CC1) SBs[]=getSourceBooksFor(LBs[]) So we have all the bottom level dimensions needed to query facts Transactions Get all Transactions and Mtms MTMs (cluster side join) for the passed Source Books Cashflows Partitioned
  • 66.
    Select Transaction, MTM,ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Populate raw facts LBs[]=getLedgerBooksFor(CC1) (Transactions) with SBs[]=getSourceBooksFor(LBs[]) dimension data So we have all the bottom level before returning to dimensions needed to query facts client. Transactions Get all Transactions and Mtms MTMs (cluster side join) for the passed Source Books Cashflows Partitioned
  • 68.
    Replicated Partitioned Java client Dimensions Facts API We never have to do a distributed join!
  • 69.
    So all thebig stuff is held paritioned And we can join without shipping keys around and having intermediate results
  • 70.
    Trader Party Trade Trade Trader Party
  • 71.
    Trader Party Version 1 Trade Trader Party Version 2 Trade Trader Party Version 3 Trade Trader Party Version 4 Trade
  • 72.
    Trade Trader Party Party Trader Trade Party Trader Trade Party
  • 78.
    Valuation Legs Valuations rt Transaction Mapping Cashflow Mapping Party Alias Facts Transaction Cashflows Legs Parties This is a dimension Ledger Book •  It has a different Source Book Cost Centre key to the Facts. Dimensions Product •  And it’s BIG Risk Organisation Unit Business Unit HCS Entity Set of Books 0 125,000,000
  • 83.
    Party Alias Parties Ledger Book Source Book Cost Centre Product Risk Organisation Unit Business Unit HCS Entity Set of Books 0 1,250,000 2,500,000 3,750,000 5,000,000
  • 84.
    Party Alias Parties Ledger Book Source Book Cost Centre Product Risk Organisation Unit Business Unit HCS Entity Set of Books 20 1,250,015 2,500,010 3,750,005 5,000,000
  • 86.
    So we onlyreplicate ‘Connected’ or ‘Used’ dimensions
  • 87.
    Processing Layer Dimension Caches (Replicated) Transactions Data Layer As new Facts are added Mtms relevant Dimensions that they reference are moved Cashflows to processing layer caches Fact Storage (Partitioned)
  • 89.
    Query Layer Save Trade (With connected dimension Caches) Data Layer Cache Trade (All Normalised) Store Partitioned Trigger Source Cache Party Ccy Alias Book
  • 90.
    Query Layer (With connected dimension Caches) Data Layer Trade (All Normalised) Party Source Ccy Alias Book
  • 91.
    Query Layer (With connected dimension Caches) Data Layer Trade (All Normalised) Party Source Ccy Alias Book Party Ledger Book
  • 92.
    ‘Connected Replication’ A simple pattern which recurses through the foreign keys in the domain model, ensuring only ‘Connected’ dimensions are replicated
  • 94.
    Java client Java schema API Java ‘Stored Procedures’ and ‘Triggers’
  • 97.

Editor's Notes

  • #46 Big data sets are held distributed and only joined on the grid to collocated objects. Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)
  • #56 Big data sets are held distributed and only joined on the grid to collocated objects. Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)