Data Science at Scale Using Apache Spark and Apache Hadoop

2© Cloudera, Inc. All rights reserved. About Sean • Data Science @ Cloudera • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark committer • Co-author, "Advanced Analytics with Spark" • sowen@cloudera.com @sean_r_owen

3© Cloudera, Inc. All rights reserved. About Tom • Principal Curriculum Developer @ Cloudera • Previously: software engineer/instructor • Original author of four Cloudera training courses • Data Science • Data Analyst • Big Data Applications • Cloudera Search

5© Cloudera, Inc. All rights reserved. Describing a Data Scientist • Engineer • Programming languages/Systems languages • Scale up to huge data • Automating, operational • Statistician • Visual, high-level languages • Choosing, tuning models • Offline / Ad-hoc • Business Expert

6© Cloudera, Inc. All rights reserved. • Data scientists write code that is designed to be used by other people • Modular • Includes unit tests • Even documentation (sometimes) • Familiar with the process of building and deploying production software • Source control • Dependency management • Code reviews Data Scientist vs. Data Analyst/Statistician

7© Cloudera, Inc. All rights reserved. • Most software engineers do not understand the assumptions behind statistical models • Independence • Normally distributed errors • Common errors/pitfalls when conducting data analysis • Correlation != causation • Simpson’s Paradox Data Scientist vs. Software Engineer

8© Cloudera, Inc. All rights reserved. Why Companies Need Data (Scientists) • "Experts estimate the amount of data some companies hold could be worth as much as $8 trillion, according to a Wall Street Journal report, while a study conducted by tech research firm IDC at the end of 2013 predicted the big data market would hit $16.1 billion in 2014." http://www.skilledup.com/insights/tackle-data-scientist-skills-gap- company • "The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data." http://www.mckinsey.com/features/big_data

13© Cloudera, Inc. All rights reserved. Trade-offs of the Tools Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational

14© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language R Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational

15© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language Crunch, Mahout Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational

16© Cloudera, Inc. All rights reserved. Apache Spark: Something for Everyone • Apache TLP in 2014 • Scala-based • Expressive, efficient • JVM-based • Scala-like API • Distributed works like local • Like Crunch is Collection-like • REPL • Interactive • Distributed • Hadoop-friendly • Integrate with where data already is • ETL no longer separate • Spark MLlib

17© Cloudera, Inc. All rights reserved. Data Context Metrics Library Language Spark Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Investigative Operational

19© Cloudera, Inc. All rights reserved. Who is this course for? • Mathematical or scientific background working in an analytical role OR • Software engineers with experience in any of these areas • Experience with any scripting language (e.g. Python) and comfortable with Linux or UNIX required

20© Cloudera, Inc. All rights reserved. What will you learn? Introduction to a complete data science problem • Acquiring data • Data cleansing and transformation • Evaluating data using various statistical methods • Designing experiments • Deploying recommender systems to production • Evaluating results

22© Cloudera, Inc. All rights reserved. What’s new in the course? Spark has taken over as a preferred tool for data scientists. Sub-project of Spark. Data scientists are using Spark and are starting to use the built in machine learning library. What is it? A basic library of machine learning algorithms. Why is it going? Integrations brought by Spark have made it the preferred tool for at-scale machine learning Apache Spark MLlib is in Mahout is out MLlib

Data Science at Scale Using Apache Spark and Apache Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Data Science at Scale Using Apache Spark and Apache Hadoop

More from Cloudera, Inc.

Recently uploaded

Data Science at Scale Using Apache Spark and Apache Hadoop

Editor's Notes