Migrating Data Pipeline from MongoDB to Cassandra

Demi Ben-Ari Cassandra Meetup – 10/11/2015 Israel

About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF

Agenda  Data flow with Mongo DB  The Problem  Solution  Lessons learned from a Newbi  Conclusion

Environment Description Cluster Dev Testing Live Staging ProductionEnv

Data pipeline flow – Use Case  Batch Apache Spark applications running every 10 - 60 minutes  Request Rate: ◦ Bursts of ~9 million requests per batch job ◦ Beginning – Reads ◦ End - Writes

Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read

Spark Slave - Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+

MongoDB - Server Specs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)

The Problem  Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes  Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving

Alternative Solutions  Shareded MongoDB (With replica sets) ◦ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◦ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas

Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master

Our DevOps – After that solution We had no DevOps guy at that time at all 

Alternative Solutions  DynamoDB (We’re hosted on Amazon) ◦ Pros:  No need to manage DevOps ◦ Cons:  Catholic Wedding Amazons Service  Not enough usage use cases  Might get to a high cost for the service

Alternative Solutions  Apache Cassandra ◦ Pros:  Very large developer community  Linearly scalable Database  No single master architecture  Proven working with distributed engines like Apache Spark ◦ Cons:  We had no experience at all with the Database  No Geo Spatial Index – Needed to implement by ourselves

The Solution  Migration to Apache Cassandra (Steps) ◦ Writing to Mongo and Cassandra simultaneously ◦ Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark  Cassandra) ◦ Creating a monitoring dashboard to Cassandra

Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read

Result  Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches

Lessons learned from a Newbi  Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators Cluster cluster = null; Builder builder = Cluster.builder() .withSocketOptions(socketOptions); builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy())); cluster = builder.build();

Lessons learned from a Newbi  Monitor everything!!! – All of the Metrics ◦ Cassandra ◦ JVM ◦ OS  Feature flag every parameter to the connection, you’ll need it for tuning later

Monitor Everything!!!  DataStax – OpsCenter ◦ Comes bundled with the DataStax Community AMI on AWS

Monitor Everything!!!  Graphite + Grafana ◦ Pluggable metrics – Since Cassandra 2.0.x  Cassandra internal metrics  JVM metrics ◦ OS – Metrics  CollectD / StatsD – Reporting to graphite ◦ Should be combined with application level metrics in the same graphs  Better visibility on correlations of the metrics

Monitor Everything!!!  Graphite + Grafana

Lessons learned from a Newbi  “nodetool” is your friend ◦ tpstats, cfhistograms, cfstats…  Data Modeling ◦ Time series data ◦ Evenly distributed partitions ◦ Everything becomes more rigid  Know your queries before you model

Lessons learned from a Newbi  CQL Queries ◦ Once we got to know our data model better, It got more efficient performance wise to use CQL statement instead of the “spark-cassandra- connector” ◦ Prepared Statements, Delete queries (of full partitions), Range queries…

Useful Cassandra GUI Clients  DevCenter – By DataStax - Free  Dbeaver – Free & Open Source ◦ Supports a wide variety of databeses

Conclusion  Cassandra is a great linear scaling Distributed Database  Monitor as much as you can ◦ Get visibility of what’s going on in the Cluster  Data Modeling correctly is the Key for success  Be ready for your next war ◦ Cassandra performance tuning – You’ll get to that for sure

Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

Migrating Data Pipeline from MongoDB to Cassandra

More Related Content

What's hot

Viewers also liked

Similar to Migrating Data Pipeline from MongoDB to Cassandra

More from Demi Ben-Ari

Recently uploaded

Migrating Data Pipeline from MongoDB to Cassandra