1© Cloudera, Inc. All rights reserved. Data Science at Scale Using Apache Spark and Apache Hadoop
2© Cloudera, Inc. All rights reserved. About Sean • Data Science @ Cloudera • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark committer • Co-author, "Advanced Analytics with Spark" • sowen@cloudera.com @sean_r_owen
3© Cloudera, Inc. All rights reserved. About Tom • Principal Curriculum Developer @ Cloudera • Previously: software engineer/instructor • Original author of four Cloudera training courses • Data Science • Data Analyst • Big Data Applications • Cloudera Search
4© Cloudera, Inc. All rights reserved. What is Data Science?
5© Cloudera, Inc. All rights reserved. Describing a Data Scientist • Engineer • Programming languages/Systems languages • Scale up to huge data • Automating, operational • Statistician • Visual, high-level languages • Choosing, tuning models • Offline / Ad-hoc • Business Expert
6© Cloudera, Inc. All rights reserved. • Data scientists write code that is designed to be used by other people • Modular • Includes unit tests • Even documentation (sometimes) • Familiar with the process of building and deploying production software • Source control • Dependency management • Code reviews Data Scientist vs. Data Analyst/Statistician
7© Cloudera, Inc. All rights reserved. • Most software engineers do not understand the assumptions behind statistical models • Independence • Normally distributed errors • Common errors/pitfalls when conducting data analysis • Correlation != causation • Simpson’s Paradox Data Scientist vs. Software Engineer
8© Cloudera, Inc. All rights reserved. Why Companies Need Data (Scientists) • "Experts estimate the amount of data some companies hold could be worth as much as $8 trillion, according to a Wall Street Journal report, while a study conducted by tech research firm IDC at the end of 2013 predicted the big data market would hit $16.1 billion in 2014." http://www.skilledup.com/insights/tackle-data-scientist-skills-gap- company • "The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data." http://www.mckinsey.com/features/big_data
9© Cloudera, Inc. All rights reserved. Data Products Are The New Competitive Advantage
10© Cloudera, Inc. All rights reserved. Build Data Products: More Questions, Less Time
11© Cloudera, Inc. All rights reserved. Apache Spark and Hadoop for Data Science
12© Cloudera, Inc. All rights reserved. Hadoop Data Science: Tools of the Trade
13© Cloudera, Inc. All rights reserved. Trade-offs of the Tools Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
14© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language R Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
15© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language Crunch, Mahout Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
16© Cloudera, Inc. All rights reserved. Apache Spark: Something for Everyone • Apache TLP in 2014 • Scala-based • Expressive, efficient • JVM-based • Scala-like API • Distributed works like local • Like Crunch is Collection-like • REPL • Interactive • Distributed • Hadoop-friendly • Integrate with where data already is • ETL no longer separate • Spark MLlib
17© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language Spark Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
18© Cloudera, Inc. All rights reserved. Course details
19© Cloudera, Inc. All rights reserved. Who is this course for? • Mathematical or scientific background working in an analytical role OR • Software engineers with experience in any of these areas • Experience with any scripting language (e.g. Python) and comfortable with Linux or UNIX required
20© Cloudera, Inc. All rights reserved. What will you learn? Introduction to a complete data science problem • Acquiring data • Data cleansing and transformation • Evaluating data using various statistical methods • Designing experiments • Deploying recommender systems to production • Evaluating results
21© Cloudera, Inc. All rights reserved. What are the course tools? • Hadoop Streaming • Python • Hive • R • Apache Spark MLlib (pyspark)
22© Cloudera, Inc. All rights reserved. What’s new in the course? Spark has taken over as a preferred tool for data scientists. Sub-project of Spark. Data scientists are using Spark and are starting to use the built in machine learning library. What is it? A basic library of machine learning algorithms. Why is it going? Integrations brought by Spark have made it the preferred tool for at-scale machine learning Apache Spark MLlib is in Mahout is out MLlib
23© Cloudera, Inc. All rights reserved. Thank you

Data Science at Scale Using Apache Spark and Apache Hadoop

  • 1.
    1© Cloudera, Inc.All rights reserved. Data Science at Scale Using Apache Spark and Apache Hadoop
  • 2.
    2© Cloudera, Inc.All rights reserved. About Sean • Data Science @ Cloudera • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark committer • Co-author, "Advanced Analytics with Spark" • sowen@cloudera.com @sean_r_owen
  • 3.
    3© Cloudera, Inc.All rights reserved. About Tom • Principal Curriculum Developer @ Cloudera • Previously: software engineer/instructor • Original author of four Cloudera training courses • Data Science • Data Analyst • Big Data Applications • Cloudera Search
  • 4.
    4© Cloudera, Inc.All rights reserved. What is Data Science?
  • 5.
    5© Cloudera, Inc.All rights reserved. Describing a Data Scientist • Engineer • Programming languages/Systems languages • Scale up to huge data • Automating, operational • Statistician • Visual, high-level languages • Choosing, tuning models • Offline / Ad-hoc • Business Expert
  • 6.
    6© Cloudera, Inc.All rights reserved. • Data scientists write code that is designed to be used by other people • Modular • Includes unit tests • Even documentation (sometimes) • Familiar with the process of building and deploying production software • Source control • Dependency management • Code reviews Data Scientist vs. Data Analyst/Statistician
  • 7.
    7© Cloudera, Inc.All rights reserved. • Most software engineers do not understand the assumptions behind statistical models • Independence • Normally distributed errors • Common errors/pitfalls when conducting data analysis • Correlation != causation • Simpson’s Paradox Data Scientist vs. Software Engineer
  • 8.
    8© Cloudera, Inc.All rights reserved. Why Companies Need Data (Scientists) • "Experts estimate the amount of data some companies hold could be worth as much as $8 trillion, according to a Wall Street Journal report, while a study conducted by tech research firm IDC at the end of 2013 predicted the big data market would hit $16.1 billion in 2014." http://www.skilledup.com/insights/tackle-data-scientist-skills-gap- company • "The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data." http://www.mckinsey.com/features/big_data
  • 9.
    9© Cloudera, Inc.All rights reserved. Data Products Are The New Competitive Advantage
  • 10.
    10© Cloudera, Inc.All rights reserved. Build Data Products: More Questions, Less Time
  • 11.
    11© Cloudera, Inc.All rights reserved. Apache Spark and Hadoop for Data Science
  • 12.
    12© Cloudera, Inc.All rights reserved. Hadoop Data Science: Tools of the Trade
  • 13.
    13© Cloudera, Inc.All rights reserved. Trade-offs of the Tools Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • 14.
    14© Cloudera, Inc.All rights reserved. Data Context Metrics Library Language R Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
  • 15.
    15© Cloudera, Inc.All rights reserved. Data Context Metrics Library Language Crunch, Mahout Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
  • 16.
    16© Cloudera, Inc.All rights reserved. Apache Spark: Something for Everyone • Apache TLP in 2014 • Scala-based • Expressive, efficient • JVM-based • Scala-like API • Distributed works like local • Like Crunch is Collection-like • REPL • Interactive • Distributed • Hadoop-friendly • Integrate with where data already is • ETL no longer separate • Spark MLlib
  • 17.
    17© Cloudera, Inc.All rights reserved. Data Context Metrics Library Language Spark Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational
  • 18.
    18© Cloudera, Inc.All rights reserved. Course details
  • 19.
    19© Cloudera, Inc.All rights reserved. Who is this course for? • Mathematical or scientific background working in an analytical role OR • Software engineers with experience in any of these areas • Experience with any scripting language (e.g. Python) and comfortable with Linux or UNIX required
  • 20.
    20© Cloudera, Inc.All rights reserved. What will you learn? Introduction to a complete data science problem • Acquiring data • Data cleansing and transformation • Evaluating data using various statistical methods • Designing experiments • Deploying recommender systems to production • Evaluating results
  • 21.
    21© Cloudera, Inc.All rights reserved. What are the course tools? • Hadoop Streaming • Python • Hive • R • Apache Spark MLlib (pyspark)
  • 22.
    22© Cloudera, Inc.All rights reserved. What’s new in the course? Spark has taken over as a preferred tool for data scientists. Sub-project of Spark. Data scientists are using Spark and are starting to use the built in machine learning library. What is it? A basic library of machine learning algorithms. Why is it going? Integrations brought by Spark have made it the preferred tool for at-scale machine learning Apache Spark MLlib is in Mahout is out MLlib
  • 23.
    23© Cloudera, Inc.All rights reserved. Thank you

Editor's Notes

  • #6 You’re the one (SME) that understands the business you operate in, we can never teach that knowledge. We are here to give you the tools to analyze it.
  • #7 Code not just for their own use
  • #8 Easy to get nonsense Simpson’s paradox can be artifact or real. 1964 voting rights act exapmle
  • #11 Building infrastructure . Don’t want data scientists, want INFRA, MODELS, INSIGHTS More Qs asked == better results/insights Moneyball example: buying wins, for which you ened to buy runs. Not players. Don’t want data scientists, want INFRA, MODELS, INSIGHTS. For that, you need to be able to ask and answer lots of questions, using the minimum resources. Fewer questions than data analyst, better than software engineers. Qs == runs in moneyball exmaple
  • #20 Someone coming from the business org. Software engineers with experience in these areas.A more difficult student case for this course to accommodate would be a student coming in with neither a computing nor analytics background, although a dedicated student should be able to complete this course coming from this background as well.