Demi Ben-Ari Cassandra Meetup – 10/11/2015 Israel
About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
Agenda  Data flow with Mongo DB  The Problem  Solution  Lessons learned from a Newbi  Conclusion
Environment Description Cluster Dev Testing Live Staging ProductionEnv
Data pipeline flow – Use Case  Batch Apache Spark applications running every 10 - 60 minutes  Request Rate: ◦ Bursts of ~9 million requests per batch job ◦ Beginning – Reads ◦ End - Writes
Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
Spark Slave - Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+
MongoDB - Server Specs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)
The Problem  Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes  Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
Alternative Solutions  Shareded MongoDB (With replica sets) ◦ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◦ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas
Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
Our DevOps – After that solution We had no DevOps guy at that time at all 
Alternative Solutions  DynamoDB (We’re hosted on Amazon) ◦ Pros:  No need to manage DevOps ◦ Cons:  Catholic Wedding Amazons Service  Not enough usage use cases  Might get to a high cost for the service
Alternative Solutions  Apache Cassandra ◦ Pros:  Very large developer community  Linearly scalable Database  No single master architecture  Proven working with distributed engines like Apache Spark ◦ Cons:  We had no experience at all with the Database  No Geo Spatial Index – Needed to implement by ourselves
The Solution  Migration to Apache Cassandra (Steps) ◦ Writing to Mongo and Cassandra simultaneously ◦ Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark  Cassandra) ◦ Creating a monitoring dashboard to Cassandra
Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
Result  Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
Lessons learned from a Newbi  Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators Cluster cluster = null; Builder builder = Cluster.builder() .withSocketOptions(socketOptions); builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy())); cluster = builder.build();
Lessons learned from a Newbi  Monitor everything!!! – All of the Metrics ◦ Cassandra ◦ JVM ◦ OS  Feature flag every parameter to the connection, you’ll need it for tuning later
Monitor Everything!!!  DataStax – OpsCenter ◦ Comes bundled with the DataStax Community AMI on AWS
Monitor Everything!!!  Graphite + Grafana ◦ Pluggable metrics – Since Cassandra 2.0.x  Cassandra internal metrics  JVM metrics ◦ OS – Metrics  CollectD / StatsD – Reporting to graphite ◦ Should be combined with application level metrics in the same graphs  Better visibility on correlations of the metrics
Monitor Everything!!!  Graphite + Grafana
Lessons learned from a Newbi  “nodetool” is your friend ◦ tpstats, cfhistograms, cfstats…  Data Modeling ◦ Time series data ◦ Evenly distributed partitions ◦ Everything becomes more rigid  Know your queries before you model
Lessons learned from a Newbi  CQL Queries ◦ Once we got to know our data model better, It got more efficient performance wise to use CQL statement instead of the “spark-cassandra- connector” ◦ Prepared Statements, Delete queries (of full partitions), Range queries…
Useful Cassandra GUI Clients  DevCenter – By DataStax - Free  Dbeaver – Free & Open Source ◦ Supports a wide variety of databeses
Conclusion  Cassandra is a great linear scaling Distributed Database  Monitor as much as you can ◦ Get visibility of what’s going on in the Cluster  Data Modeling correctly is the Key for success  Be ready for your next war ◦ Cassandra performance tuning – You’ll get to that for sure
Questions?
Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

Migrating Data Pipeline from MongoDB to Cassandra

  • 1.
    Demi Ben-Ari Cassandra Meetup– 10/11/2015 Israel
  • 2.
    About me Demi Ben-Ari SeniorSoftware Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  • 3.
    Agenda  Data flowwith Mongo DB  The Problem  Solution  Lessons learned from a Newbi  Conclusion
  • 4.
  • 5.
    Data pipeline flow– Use Case  Batch Apache Spark applications running every 10 - 60 minutes  Request Rate: ◦ Bursts of ~9 million requests per batch job ◦ Beginning – Reads ◦ End - Writes
  • 6.
    Workflow with MongoDB Worker1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
  • 7.
    Spark Slave -Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+
  • 8.
    MongoDB - ServerSpecs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)
  • 9.
    The Problem  Batchjobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes  Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
  • 10.
    Alternative Solutions  SharededMongoDB (With replica sets) ◦ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◦ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas
  • 11.
    Workflow with MongoDB Worker1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
  • 12.
    Our DevOps –After that solution We had no DevOps guy at that time at all 
  • 13.
    Alternative Solutions  DynamoDB(We’re hosted on Amazon) ◦ Pros:  No need to manage DevOps ◦ Cons:  Catholic Wedding Amazons Service  Not enough usage use cases  Might get to a high cost for the service
  • 14.
    Alternative Solutions  ApacheCassandra ◦ Pros:  Very large developer community  Linearly scalable Database  No single master architecture  Proven working with distributed engines like Apache Spark ◦ Cons:  We had no experience at all with the Database  No Geo Spatial Index – Needed to implement by ourselves
  • 15.
    The Solution  Migrationto Apache Cassandra (Steps) ◦ Writing to Mongo and Cassandra simultaneously ◦ Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark  Cassandra) ◦ Creating a monitoring dashboard to Cassandra
  • 16.
    Workflow with Cassandra Worker1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  • 17.
    Result  Performance improvement ◦Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  • 18.
    Lessons learned froma Newbi  Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators Cluster cluster = null; Builder builder = Cluster.builder() .withSocketOptions(socketOptions); builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy())); cluster = builder.build();
  • 19.
    Lessons learned froma Newbi  Monitor everything!!! – All of the Metrics ◦ Cassandra ◦ JVM ◦ OS  Feature flag every parameter to the connection, you’ll need it for tuning later
  • 20.
    Monitor Everything!!!  DataStax– OpsCenter ◦ Comes bundled with the DataStax Community AMI on AWS
  • 21.
    Monitor Everything!!!  Graphite+ Grafana ◦ Pluggable metrics – Since Cassandra 2.0.x  Cassandra internal metrics  JVM metrics ◦ OS – Metrics  CollectD / StatsD – Reporting to graphite ◦ Should be combined with application level metrics in the same graphs  Better visibility on correlations of the metrics
  • 22.
  • 23.
    Lessons learned froma Newbi  “nodetool” is your friend ◦ tpstats, cfhistograms, cfstats…  Data Modeling ◦ Time series data ◦ Evenly distributed partitions ◦ Everything becomes more rigid  Know your queries before you model
  • 24.
    Lessons learned froma Newbi  CQL Queries ◦ Once we got to know our data model better, It got more efficient performance wise to use CQL statement instead of the “spark-cassandra- connector” ◦ Prepared Statements, Delete queries (of full partitions), Range queries…
  • 25.
    Useful Cassandra GUIClients  DevCenter – By DataStax - Free  Dbeaver – Free & Open Source ◦ Supports a wide variety of databeses
  • 26.
    Conclusion  Cassandra isa great linear scaling Distributed Database  Monitor as much as you can ◦ Get visibility of what’s going on in the Cluster  Data Modeling correctly is the Key for success  Be ready for your next war ◦ Cassandra performance tuning – You’ll get to that for sure
  • 27.
  • 28.
    Thanks, Resources and Contact Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter