Enterprise Grade Spark Processing at Totango 2015-11-10 Oren Raboy, VP Eng. @ Totango
AGENDA (PART 1) • Background about Totango and our data architecture • Spark in the Totango Architecture • Quality: Testing Spark code in production
About us: Founded: 2010 Offices: TLV, SF Team: ~60 Customers: ~200
We help online business make their customers more successful through use of data Totango:
~ 500M accounts ~ $5B revenue under management ~ 100M events per day Our Customers: The Worlds Leading Cloud Services
ANALYTICS • Usage Metrics • Trends over time • Trends across customers • Health score
• Alerts • Triggered Workflows • Email Campaigns AUTOMATION
Totango Data Architecture Collection Real-time processing Batch processing Pixel 3rd Party (SFDC) CSV Serving Layer • ‘Lambda Architecture’ • Hosted on AWS • AWS and Open-source technologies • Java with a dash of Python
Totango Data Architecture Pixel 3rd Party (SFDC) CSV • Hosted on AWS • ‘Lambda Architecture’ • AWS and Open-source technologies • Java with a dash of Python Kinesis Kinesis S3 ELB
Totango Data Architecture Pixel 3rd Party (SFDC) CSV • Hosted on AWS • ‘Lambda Architecture’ • AWS and Open-source technologies • Java with a dash of Python Kinesis Kinesis S3 ELB
Batch Processing • Executed once a day (midnight at customer’s local-time) • Each task calculates a set of account-metrics (e.g. Health, Change) • One Spark cluster runs all tasks for all customers • Pipeline executed by Pipeline Runner, using Spotify Luigi calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents
Environment • Multi tenant: Shared infrastructure for all Totango customers (Services) • Daily, hourly and on-demand schedule • Standalone Spark cluster on AWS EC2 instances • Input and Output on S3. Final results also indexed on Elasticsearch Service A calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents Service A calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents Service XYZ calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents
Challenge: Quality
Requirements from infrastructure: • Reliability: Calculate metrics accurately at all times • Velocity: Frequent release of new data processing code Challenge: High quality and highly automated regression testing calc some metrics calc some metric more merge results Some dependent computation Merge results Into final document Raw Events Account Documents NEW VERSION How do we make sure the new version didn’t break anything?
calc some metrics merge results Some dependent computation Merge results Into final document Raw Events Account Documents NEW VERSION SHADOW OLD VERSION compare csv Testing In Production: How • Before deployment, run release-candidate ‘side by side’ older version. • New version runs in Shadow mode and does not propagate results • Compare old and new version results. Output unexpected diffs • Deploy to production only if no diffs across all customer data sets
1. 2. 3. 4. 5. Unit testing Test environment: Integration testing Side by side testing in production of new code New code rolled-out, old version side-by-side as backup Rollout complete! Deployment Flow • We know the new version works correctly • We do not need to think of all the corner test-cases • We do not need to write lots of regression tests
QUESTIONS? • labs.totango.com <-- engineering team blog • oren@totango.com <-- me! • Yes, we are hiring!

How Totango uses Apache Spark

  • 1.
    Enterprise Grade Spark Processingat Totango 2015-11-10 Oren Raboy, VP Eng. @ Totango
  • 2.
    AGENDA (PART 1) •Background about Totango and our data architecture • Spark in the Totango Architecture • Quality: Testing Spark code in production
  • 3.
    About us: Founded: 2010 Offices:TLV, SF Team: ~60 Customers: ~200
  • 4.
    We help onlinebusiness make their customers more successful through use of data Totango:
  • 5.
    ~ 500M accounts ~$5B revenue under management ~ 100M events per day Our Customers: The Worlds Leading Cloud Services
  • 6.
    ANALYTICS • Usage Metrics •Trends over time • Trends across customers • Health score
  • 7.
    • Alerts • TriggeredWorkflows • Email Campaigns AUTOMATION
  • 8.
    Totango Data Architecture Collection Real-timeprocessing Batch processing Pixel 3rd Party (SFDC) CSV Serving Layer • ‘Lambda Architecture’ • Hosted on AWS • AWS and Open-source technologies • Java with a dash of Python
  • 9.
    Totango Data Architecture Pixel 3rdParty (SFDC) CSV • Hosted on AWS • ‘Lambda Architecture’ • AWS and Open-source technologies • Java with a dash of Python Kinesis Kinesis S3 ELB
  • 10.
    Totango Data Architecture Pixel 3rdParty (SFDC) CSV • Hosted on AWS • ‘Lambda Architecture’ • AWS and Open-source technologies • Java with a dash of Python Kinesis Kinesis S3 ELB
  • 11.
    Batch Processing • Executedonce a day (midnight at customer’s local-time) • Each task calculates a set of account-metrics (e.g. Health, Change) • One Spark cluster runs all tasks for all customers • Pipeline executed by Pipeline Runner, using Spotify Luigi calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents
  • 12.
    Environment • Multi tenant:Shared infrastructure for all Totango customers (Services) • Daily, hourly and on-demand schedule • Standalone Spark cluster on AWS EC2 instances • Input and Output on S3. Final results also indexed on Elasticsearch Service A calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents Service A calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents Service XYZ calc some metrics calc other metrics more merge results Some dependent computation Merge results Into final document Raw Events Account Documents
  • 13.
  • 14.
    Requirements from infrastructure: •Reliability: Calculate metrics accurately at all times • Velocity: Frequent release of new data processing code Challenge: High quality and highly automated regression testing calc some metrics calc some metric more merge results Some dependent computation Merge results Into final document Raw Events Account Documents NEW VERSION How do we make sure the new version didn’t break anything?
  • 15.
    calc some metrics merge results Some dependent computation Merge results Into final document Raw EventsAccount Documents NEW VERSION SHADOW OLD VERSION compare csv Testing In Production: How • Before deployment, run release-candidate ‘side by side’ older version. • New version runs in Shadow mode and does not propagate results • Compare old and new version results. Output unexpected diffs • Deploy to production only if no diffs across all customer data sets
  • 16.
    1. 2. 3. 4. 5. Unit testing Test environment:Integration testing Side by side testing in production of new code New code rolled-out, old version side-by-side as backup Rollout complete! Deployment Flow • We know the new version works correctly • We do not need to think of all the corner test-cases • We do not need to write lots of regression tests
  • 17.
    QUESTIONS? • labs.totango.com <--engineering team blog • oren@totango.com <-- me! • Yes, we are hiring!

Editor's Notes

  • #2 Thanks everyone