Building Custom Machine Learning Algorithms with Apache SystemML Fred Reiss Chief Architect, IBM Spark Technology Center Member, IBM Academy of Technology
Roadmap • What is Apache SystemML? • Demo! • How to get SystemML
What is Apache SystemML?
Origins of the SystemML Project 20162015 You are here.
2014201320122011
200920082007 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create ML solutions. 2009: We form a dedicated team for scalable ML
Case Study: An Auto Manufacturer Warranty Claims Repair History Diagnostic Readouts Predict Reacquired Cars
Case Study: An Auto Manufacturer Warranty Claims Repair History Features Labels Predict Reacquired Cars Machine Learning Algorithm Algorithm Algorithm Algorithm Result: 25x improvement in precision! False Positives Diagnostic Readouts
The Iterative Development Process Build a pipeline Results good enough? Yes Customize part of the pipeline No
State-of-the-Art: Small Data R or Python Data Scientist Personal Computer Data Results
State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala
State-of-the-Art: Big Data R or Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
The SystemML Vision R or Python Data Scientist Results SystemML
The SystemML Vision R or Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
200920082007 2007-2008: Multiple projects at IBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms. 2009: We form a dedicated team for scalable ML
2014201320122011 Research
20162015 Apache SystemML June 2015: IBM Announces open- source SystemML September 2015: Code available on Github November 2015: SystemML enters Apache incubation June 2016: Second Apache release (0.10) February 2016: First release (0.9) of Apache SystemML
SystemML at • Built algorithms for predicting treatment outcomes – Substantial improvement in accuracy • Moved from Hadoop MapReduce to Spark – SystemML supports both frameworks – Exact same code – 300X faster on 1/40th as many nodes
SystemML at Cadent Technology “SystemML allows Cadent to implement advanced numerical programming methods in Apache Spark, empowering us to leverage specialized algorithms in our predictive analytics software.” Michael Zargham Chief Scientist Cadent is a leading provider of TV advertising and data solutions, reaching over 140 million homes and trusted by the world’s largest service providers.
Demo!
Demo Scenario • Application: Targeted ads using demographic information tied to cookies • Problem: The information is incomplete • Solution: Estimate the missing values – Treat the problem as a matrix completion problem
Data • The U.S. Census Public Use Microdata Sample (PUMS) data set for 2010 • 10% sample of the U.S. population – We’ll use just California today • Use this full data set to generate synthetic incomplete data
Demo Scenario • Application: Identify products that are complementary (often purchased together) • Problem: Customers are not currently buying the best complements at the same time • Solution: Suggest new product pairings – Treat the problem as a matrix completion problem
Demographics Users i j Value of demographic field j for customer i Matrix Factorization Top Factor LeftFactor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become interpolated demographic information
Demo Part 1: Data wrangling
Demo Part 2: Custom algorithm
Key Points • SystemML, Spark, and Zeppelin work together • Linear algebra is great for data science • Customization is important
How to get Apache SystemML
The Apache SystemML Web Site http://systemml.apache.org Download the binary release! Try out some tutorials! Browse the source! Contribute to the project!
THANK YOU. Please try out Apache SystemML! http://systemml.apache.org Special thanks to Nakul Jindal and Mike Dusenberry for helping with the demo!

Building Custom Machine Learning Algorithms With Apache SystemML

  • 1.
    Building Custom Machine LearningAlgorithms with Apache SystemML Fred Reiss Chief Architect, IBM Spark Technology Center Member, IBM Academy of Technology
  • 2.
    Roadmap • What isApache SystemML? • Demo! • How to get SystemML
  • 3.
    What is ApacheSystemML?
  • 4.
    Origins of theSystemML Project 20162015 You are here.
  • 5.
  • 6.
    200920082007 2007-2008: Multiple projects atIBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create ML solutions. 2009: We form a dedicated team for scalable ML
  • 7.
    Case Study: AnAuto Manufacturer Warranty Claims Repair History Diagnostic Readouts Predict Reacquired Cars
  • 8.
    Case Study: AnAuto Manufacturer Warranty Claims Repair History Features Labels Predict Reacquired Cars Machine Learning Algorithm Algorithm Algorithm Algorithm Result: 25x improvement in precision! False Positives Diagnostic Readouts
  • 9.
    The Iterative DevelopmentProcess Build a pipeline Results good enough? Yes Customize part of the pipeline No
  • 10.
    State-of-the-Art: Small Data Ror Python Data Scientist Personal Computer Data Results
  • 11.
    State-of-the-Art: Big Data Ror Python Data Scientist Results Systems Programmer Scala
  • 12.
    State-of-the-Art: Big Data Ror Python Data Scientist Results Systems Programmer Scala 😞 Days or weeks per iteration 😞 Errors while translating algorithms
  • 13.
    The SystemML Vision Ror Python Data Scientist Results SystemML
  • 14.
    The SystemML Vision Ror Python Data Scientist Results SystemML 😃 Fast iteration 😃 Same answer
  • 15.
    200920082007 2007-2008: Multiple projects atIBM Research – Almaden involving machine learning on Hadoop. 2010 2009-2010: Through engagements with customers, we observe how data scientists create machine learning algorithms. 2009: We form a dedicated team for scalable ML
  • 16.
  • 17.
    20162015 Apache SystemML June 2015:IBM Announces open- source SystemML September 2015: Code available on Github November 2015: SystemML enters Apache incubation June 2016: Second Apache release (0.10) February 2016: First release (0.9) of Apache SystemML
  • 18.
    SystemML at • Builtalgorithms for predicting treatment outcomes – Substantial improvement in accuracy • Moved from Hadoop MapReduce to Spark – SystemML supports both frameworks – Exact same code – 300X faster on 1/40th as many nodes
  • 19.
    SystemML at CadentTechnology “SystemML allows Cadent to implement advanced numerical programming methods in Apache Spark, empowering us to leverage specialized algorithms in our predictive analytics software.” Michael Zargham Chief Scientist Cadent is a leading provider of TV advertising and data solutions, reaching over 140 million homes and trusted by the world’s largest service providers.
  • 20.
  • 21.
    Demo Scenario • Application:Targeted ads using demographic information tied to cookies • Problem: The information is incomplete • Solution: Estimate the missing values – Treat the problem as a matrix completion problem
  • 22.
    Data • The U.S.Census Public Use Microdata Sample (PUMS) data set for 2010 • 10% sample of the U.S. population – We’ll use just California today • Use this full data set to generate synthetic incomplete data
  • 23.
    Demo Scenario • Application:Identify products that are complementary (often purchased together) • Problem: Customers are not currently buying the best complements at the same time • Solution: Suggest new product pairings – Treat the problem as a matrix completion problem
  • 24.
    Demographics Users i j Value of demographic field jfor customer i Matrix Factorization Top Factor LeftFactor Multiply these two factors to produce a less- sparse matrix. × New nonzero values become interpolated demographic information
  • 25.
    Demo Part 1:Data wrangling
  • 26.
    Demo Part 2:Custom algorithm
  • 27.
    Key Points • SystemML,Spark, and Zeppelin work together • Linear algebra is great for data science • Customization is important
  • 28.
    How to getApache SystemML
  • 29.
    The Apache SystemMLWeb Site http://systemml.apache.org Download the binary release! Try out some tutorials! Browse the source! Contribute to the project!
  • 30.
    THANK YOU. Please tryout Apache SystemML! http://systemml.apache.org Special thanks to Nakul Jindal and Mike Dusenberry for helping with the demo!