S U R I N D E R 2 N D M A R C H 2 0 2 2 Apache Ignite
Agenda  Setting up context  Cache Evolution  Apache Ignite  Data Queries  Compute  Data Partitioning  Eviction policies  Performance Comparison
Stream Consuming Application Too many read, write and updates to database Limited connections Can slow down stream under load
Stream Consuming Application: 1 Cache serves as first data layer Manage persisting data to database Processing much faster due to no direct DB access
Stream Consuming Application cont… Cache serves as first class in memory data database Manage persisting data to native storage No DB connections, mechanism overhead
Cache Evolution
Cache Evolution  Distributed caches  Shared cache for app instances  Beyond local RAM capacity  Ease of maintenance  No auto sync with DB(yes/no) ?  In App caches  Cache results  More responsive application  Reduce load on DB  Limited to local RAM size
Cache Evolution : Data grids  Benefits  Distributed caches with brains  Compute capabilities  DB Read/Write through  Collocated processing  Better scalability
Cache Evolution : In memory computing  Memory centric storage  Scalable to store data in TBs  Sql, transactions support  Collocate related data  DB Read/Write through  Pluggable to ext databases  Native storage on disk  No Ram warm up  Compute capabilities  Map Reduce  Collocated processing  Better scalability
What is Apache Ignite ?  A distributed cache  A Distributed in memory data grid  A Distributed in memory database  High-performance computing with in-memory  ANSI 99 SQL Compliant  Transactional operations  SQL transactions in beta
Ignite cluster  Group of nodes  Types:  Server : stores data, baseline node  Thick client node : doesn’t store data  Thin client node : not part of cluster  Attribute based grouping possible  Scalable  Fault tolerant  Data consistency  Demo
Data Grid  Distributed In-Memory Caching  Read/Write through  Data Consistency  Off-Heap Storage  Distributed SQL  ACID Support  Transactions
Keep required backup Everyone knows everything Cache Modes
Cache Queries…  Scan Query : Return data matching BiPredicate  Predicate sent to each node,  Node scan its cache  Data consolidated by requested node  Sql Query : load data based on sql given  Needs indexing to be enabled  Registering indexing in config  Annotations for fields visibility  Other queries:  Text Query  Index query  Continuous query
Data Partitioning  Partitioned caches  Backups  Ensures data availability in node failures  Read from backup node when primary node leaves  Demo
Demo Queries  Scan Query  Sql Query  Data collocation  Next week : this slide onwards
Data collocation  Collocate related data for performance  All Employees of dept. can be stored together  Affinity on dept. attribute  Only key attribute can be used in affinity key  Performant CRUD operations  Avoids network trips  Reduced latency  Can cause hot nodes if used inappropriately
Compute Tasks  Run distributed computations on grid  Tasks can be run on selected nodes  Ignite manages the task management  E.g. node specific aggregates  List each dept.. students stored on each node  Can be parallelized
Continuous Queries  Exactly once processing semantic  3 basic components  Cache to monitor updates  Remote filter to look for data changes  Local listener to act upon data changes  Optional initial query to process initial data  Used to capture data changes on cache  Use case: Reacting to cache entry change  Listen for particular state of cache value  Process the state  Move to next state
Eviction Policies  On Heap [cache level]  LRU : Recommended when in doubt  FIFO : It ignores the element access order  Sorted : Sorted according to key for order  Off Heap [data region level]  Random LRU:  Random-2 LRU  Persistence On [Page replacement]  Random-LRU  Segmented-LRU  Clock
Persistent Store  CacheStoreAdapter extendable  Read through  Write through  Write behind  Works behind the cache API’s
Data Distribution  Why distributing data ?  Data size can go beyond node limits  Load beyond node processing limits  Solutions:  partition the dataset  Migrate to distributed database  Both will have set of nodes : topology
Data Distribution Soln.  Distribution Requirements:  Algorithm  Distribution Uniformity  Minimal disruption  Approaches:  Mod N  Consistent Hashing  Rendezvous(HRW)
Data Distribution in Ignite  Mapping partition to node  Rendezvous Hashing  Cluster changes moves partitions  Mapping key to partition  Mod N  Partitions are fixed  1024 by default
Data Rebalancing  Used when new node join the grid  In memory grids start rebalancing immediately  Enabled manually when persistence is enabled  Possibly more backups than configured in such scenarios  Rebalance Modes  SYNC: cache calls blocked until rebalancing is completed  ASYNC: rebalancing happen in background. Cache respond immediately  NONE : No rebalancing, cache loaded on demand when required or explicitly loading
Partition Map Exchange  Triggered when partitions need to moved across nodes  A node joins/leaves the cluster  New cache is created/stopped  An index is created etc.  Cluster waits for ongoing operations  Oldest/youngest node is coordinator
Native Storage Architecture  Work directory  Binary data : internal metadata  Marshaler : marshaler info  DB  Lock file : used to ensure node lock  node dir.(s) : cache partitions  cp dir. (checkpoint start end markers)  WAL dir.  node(s) dir. : wal segments  Archive dir.  Node(s) dir. : wal segments
Dirty Pages  Pages are always on disk, optionally in RAM  Each cache update is written to RAM and appended to WAL  Cache operation cause dirty pages  Dirty pages are accumulated in RAM  Checkpoint: batch of dirty pages written to disk  WAL file cleared after checkpoint  Updates between checkpoints are logged  Nodes crashes between checkpoints ?  WAL to the rescue
Apache Ignite ~ Cassandra  Insert and Update performance is comparable  Read and mixed(read + update) are 2x+ better in ignite  Cassandra UPADTE outperforms under high load  Cassandra demands upfront query patterns  Major model changes/new tables if  Query changes required  New queries with different requirements needed  Ignite support collocated/non collocated joins and hence  Queries can be created just like old school sql  No major changes required except creating few indexes if needed  Check reference slide for more
Next steps  Read docs  Get hands dirty with ignite  Explore queries  Ignite compute tasks  Native persistence  Third party persistence
References  https://ignite.apache.org/docs/latest/  https://www.youtube.com/watch?v=eMs_2vEsbBk  https://dzone.com/articles/apache-ignite-client-connectors-variety  https://apacheignite.readme.io/docs/leader-election  https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exc hange+-+under+the+hood   https://data-science-blog.com/blog/2020/09/25/in-memory-data-grid-vs- distributed-cache-which-is-best/  https://hazelcast.com/blog/imdg-vs-imdb-a-business-level-perspective/  https://www.gridgain.com/resources/blog/apacher-ignitetm-and-apacher- cassandratm-benchmarks-power-in-memory-computing
Questions

Apache ignite as in-memory computing platform

  • 1.
    S U RI N D E R 2 N D M A R C H 2 0 2 2 Apache Ignite
  • 2.
    Agenda  Setting upcontext  Cache Evolution  Apache Ignite  Data Queries  Compute  Data Partitioning  Eviction policies  Performance Comparison
  • 3.
    Stream Consuming Application Toomany read, write and updates to database Limited connections Can slow down stream under load
  • 4.
    Stream Consuming Application:1 Cache serves as first data layer Manage persisting data to database Processing much faster due to no direct DB access
  • 5.
    Stream Consuming Applicationcont… Cache serves as first class in memory data database Manage persisting data to native storage No DB connections, mechanism overhead
  • 6.
  • 7.
    Cache Evolution  Distributedcaches  Shared cache for app instances  Beyond local RAM capacity  Ease of maintenance  No auto sync with DB(yes/no) ?  In App caches  Cache results  More responsive application  Reduce load on DB  Limited to local RAM size
  • 8.
    Cache Evolution :Data grids  Benefits  Distributed caches with brains  Compute capabilities  DB Read/Write through  Collocated processing  Better scalability
  • 9.
    Cache Evolution :In memory computing  Memory centric storage  Scalable to store data in TBs  Sql, transactions support  Collocate related data  DB Read/Write through  Pluggable to ext databases  Native storage on disk  No Ram warm up  Compute capabilities  Map Reduce  Collocated processing  Better scalability
  • 10.
    What is ApacheIgnite ?  A distributed cache  A Distributed in memory data grid  A Distributed in memory database  High-performance computing with in-memory  ANSI 99 SQL Compliant  Transactional operations  SQL transactions in beta
  • 11.
    Ignite cluster  Groupof nodes  Types:  Server : stores data, baseline node  Thick client node : doesn’t store data  Thin client node : not part of cluster  Attribute based grouping possible  Scalable  Fault tolerant  Data consistency  Demo
  • 12.
    Data Grid  DistributedIn-Memory Caching  Read/Write through  Data Consistency  Off-Heap Storage  Distributed SQL  ACID Support  Transactions
  • 13.
    Keep required backup Everyoneknows everything Cache Modes
  • 14.
    Cache Queries…  ScanQuery : Return data matching BiPredicate  Predicate sent to each node,  Node scan its cache  Data consolidated by requested node  Sql Query : load data based on sql given  Needs indexing to be enabled  Registering indexing in config  Annotations for fields visibility  Other queries:  Text Query  Index query  Continuous query
  • 15.
    Data Partitioning  Partitionedcaches  Backups  Ensures data availability in node failures  Read from backup node when primary node leaves  Demo
  • 16.
    Demo Queries  ScanQuery  Sql Query  Data collocation  Next week : this slide onwards
  • 17.
    Data collocation  Collocaterelated data for performance  All Employees of dept. can be stored together  Affinity on dept. attribute  Only key attribute can be used in affinity key  Performant CRUD operations  Avoids network trips  Reduced latency  Can cause hot nodes if used inappropriately
  • 18.
    Compute Tasks  Rundistributed computations on grid  Tasks can be run on selected nodes  Ignite manages the task management  E.g. node specific aggregates  List each dept.. students stored on each node  Can be parallelized
  • 19.
    Continuous Queries  Exactlyonce processing semantic  3 basic components  Cache to monitor updates  Remote filter to look for data changes  Local listener to act upon data changes  Optional initial query to process initial data  Used to capture data changes on cache  Use case: Reacting to cache entry change  Listen for particular state of cache value  Process the state  Move to next state
  • 20.
    Eviction Policies  OnHeap [cache level]  LRU : Recommended when in doubt  FIFO : It ignores the element access order  Sorted : Sorted according to key for order  Off Heap [data region level]  Random LRU:  Random-2 LRU  Persistence On [Page replacement]  Random-LRU  Segmented-LRU  Clock
  • 21.
    Persistent Store  CacheStoreAdapterextendable  Read through  Write through  Write behind  Works behind the cache API’s
  • 22.
    Data Distribution  Whydistributing data ?  Data size can go beyond node limits  Load beyond node processing limits  Solutions:  partition the dataset  Migrate to distributed database  Both will have set of nodes : topology
  • 23.
    Data Distribution Soln. Distribution Requirements:  Algorithm  Distribution Uniformity  Minimal disruption  Approaches:  Mod N  Consistent Hashing  Rendezvous(HRW)
  • 24.
    Data Distribution inIgnite  Mapping partition to node  Rendezvous Hashing  Cluster changes moves partitions  Mapping key to partition  Mod N  Partitions are fixed  1024 by default
  • 25.
    Data Rebalancing  Usedwhen new node join the grid  In memory grids start rebalancing immediately  Enabled manually when persistence is enabled  Possibly more backups than configured in such scenarios  Rebalance Modes  SYNC: cache calls blocked until rebalancing is completed  ASYNC: rebalancing happen in background. Cache respond immediately  NONE : No rebalancing, cache loaded on demand when required or explicitly loading
  • 26.
    Partition Map Exchange Triggered when partitions need to moved across nodes  A node joins/leaves the cluster  New cache is created/stopped  An index is created etc.  Cluster waits for ongoing operations  Oldest/youngest node is coordinator
  • 27.
    Native Storage Architecture Work directory  Binary data : internal metadata  Marshaler : marshaler info  DB  Lock file : used to ensure node lock  node dir.(s) : cache partitions  cp dir. (checkpoint start end markers)  WAL dir.  node(s) dir. : wal segments  Archive dir.  Node(s) dir. : wal segments
  • 28.
    Dirty Pages  Pagesare always on disk, optionally in RAM  Each cache update is written to RAM and appended to WAL  Cache operation cause dirty pages  Dirty pages are accumulated in RAM  Checkpoint: batch of dirty pages written to disk  WAL file cleared after checkpoint  Updates between checkpoints are logged  Nodes crashes between checkpoints ?  WAL to the rescue
  • 29.
    Apache Ignite ~Cassandra  Insert and Update performance is comparable  Read and mixed(read + update) are 2x+ better in ignite  Cassandra UPADTE outperforms under high load  Cassandra demands upfront query patterns  Major model changes/new tables if  Query changes required  New queries with different requirements needed  Ignite support collocated/non collocated joins and hence  Queries can be created just like old school sql  No major changes required except creating few indexes if needed  Check reference slide for more
  • 30.
    Next steps  Readdocs  Get hands dirty with ignite  Explore queries  Ignite compute tasks  Native persistence  Third party persistence
  • 31.
    References  https://ignite.apache.org/docs/latest/  https://www.youtube.com/watch?v=eMs_2vEsbBk https://dzone.com/articles/apache-ignite-client-connectors-variety  https://apacheignite.readme.io/docs/leader-election  https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exc hange+-+under+the+hood   https://data-science-blog.com/blog/2020/09/25/in-memory-data-grid-vs- distributed-cache-which-is-best/  https://hazelcast.com/blog/imdg-vs-imdb-a-business-level-perspective/  https://www.gridgain.com/resources/blog/apacher-ignitetm-and-apacher- cassandratm-benchmarks-power-in-memory-computing
  • 32.

Editor's Notes

  • #21 https://ignite.apache.org/docs/latest/memory-configuration/replacement-policies