Why Scala for Data Science?
HELLO! I am Guglielmo Iozzia I am here because I love AI and the With the Best conference series You can follow me at @GuglielmoIozzia 2
Something about me ✘ Big Data Delivery Lead at (UHG) ✘ Previously at and of the UN ✘ Current fields of expertise are Big Data, ML/DL and DevOps ✘ Author of the upcoming book “Hands- on Deep Learning with Apache Spark” ✘ I love preparing home-made pizza3
What is Scala? Let’s get everyone on the same page
The Scala PL Scala is a programming language that blends object-oriented and functional programming concepts on the JVM. 5
Functional Programming ✘ In FP you write pure functions. ✘ Given the same input, a function always return the same output, producing no side effect. ✘ A function is first-class: it can be used like any other type. ✘ That means that it can be assigned to a variable, passed as a parameter to another function or returned by a function.6
Place your screenshot here Functional Programming in Scala An example of functional programming in Scala. 7
Why Scala for Data Science? Let’s move towards the main topic of this talk
The Python’s Temptation When it comes to Data Science the first programming language people take into consideration is Python. 9
Here are three valid reasons to consider Scala. 10
#1 Robustness Robustness and performance when it comes to production system and large datasets. 11
#2 Integration Most part of the systems/tools in the Big Data/ML space run on the JVM. 12
Think about these systems you most probably have in your production tech stack. They all run in JVMs. 13
#3 Libraries Good availability of ready to production Open Source ML/DL frameworks and libraries. 14
Scala Open Source Projects for AI/ML/DL ✘ Spark MLlib: Spark’s library for ML algorithms, feature extraction, dimensionality reduction, linear algebra, etc. ✘ ND4J: a linear algebra and matrix manipulation library which supports n- dimensional arrays and it is integrated with Apache Hadoop and Spark. 15
Scala Open Source Projects for AI/ML/DL ✘ DeepLearning4J: a distributed deep- learning framework written for Java and Scala. It is integrated with Hadoop and Apache Spark, for use on distributed GPUs and CPUs. ✘ BigDL: a distributed deep learning framework for Apache Spark, created at Intel. 16
Scala Open Source Projects for AI/ML/DL ✘ XGBoost: a scalable, portable and distributed Gradient Boosting library. ✘ PredictionIO: an Apache template system for creating machine learning engines. ✘ Smile: a fast and comprehensive machine learning system. ✘ Saddle: a high-performance data manipulation library.17
Scala Open Source Projects for AI/ML/DL ✘ Deeplearning.scala: a simple library for creating complex neural networks. It can be used either in standalone JVM applications or Jupyter Notebooks. ✘ ScalaNLP: a suite of ML and numerical computing libraries. It includes Breeze and Epic. 18
Code Examples Let’s get practical!
object Nd4JScalaSample { def main (args: Array[String]) { // Create arrays using the numpy syntax var arr1 = Nd4j.create(4) val arr2 = Nd4j.linspace(1, 10, 10) // Fill an array with the value 5 (equivalent to fill method in numpy) println(arr1.assign(5) + "Assigned value of 5 to the array") // Basic stats methods println(Nd4j.mean(arr1) + "Calculate mean of array") println(Nd4j.std(arr2) + "Calculate standard deviation of array") println(Nd4j.`var`(arr2), "Calculate variance") ... ND4J Example ND4J tries to fill the gap between JVM languages and Python programmers in terms of availability of powerful data analysis tools. 20
Place your screenshot here DL4J Example (1 of 3) Multilayer Neural Network configuration in Scala with DL4J. 21
Place your screenshot here DL4J Example (2 of 3) Network initialization and training in Scala with DL4J. 22
Place your screenshot here DL4J Example (3 of 3) The DL4J web UI (training time). 23
Can Scala and Python co-exist in Data Science projects? Is there any bridge between this two worlds?
139,000 The result of a search on Google about MNN models implemented through Tensorflow 8,330,000 The result of a generic search on Google about models implemented through Tensorflow 120,000 The result of a search on Google about MNN examples implemented through Tensorflow 25
Tensorflow Pros and Cons ✘ Big community ✘ Lots of models, example and use cases available ✘ Stunning features Mostly Python. The Java API is currently experimental and is not covered by the TensorFlow API stability guarantees. 26
Keras to the Rescue ✘ It is an open source neural network library written in Python ✘ It can run on top of TensorFlow (and other backend engines) ✘ Easy prototyping ✘ Lightweight ✘ Can be used to import Python models to DL4J 27
TensorFlow + Keras + DL4J 28
Place your screenshot here Importing Keras Models into DL4J: example DL4J provides Java/Scala API to import a pre-trained TensorFlow model through Keras. 29
Place your screenshot here Importing Keras Models into DL4J: example The imported model can then be used in a DL4J application implemented through Java or Scala only. 30
Conclusion Bridging the Gap between Data Engineers and Data Scientists
The Missing Link Data Engineers • Scala/Java skills and experience • Hands-on Big Data and Streaming tools (Hadoop, HBase, Spark, Kafka, Beam, etc.) • DevOps mindset • Attention on testing, performance, scalability • Containerization • Often no skills in ML/DL Data Scientist • Strong ML/DL skills • Python and R users • Good data understanding • Model training and evaluating strategies • Probably knowledge on Big Data and Streaming tools • No DevOps mindset • Research more than production 32
To Leaverage the Specific Skills of Each Team DL4J Keras TensorFlow Data Engineers Data Scientists 33
To Leaverage the Specific Skills of Each Team Keras Scala (DL4J) TensorFlow (Python) 34
Place your screenshot here Hands-on Deep Learning with Apache Spark More on some topics covered in this talk can be found in this book. https://tinyurl.com/y9jkvtuy 35
THANK YOU! Any questions? You can find me at ✘ @GuglielmoIozzia ✘ https://ie.linkedin.com/in/giozzia ✘ googlielmo.blogspot.com/ ✘ https://dzone.com/users/253294 8/virtualramblas.html 36
Credits Special thanks to all the people who made and released these awesome resources for free: ✘ Presentation template by SlidesCarnival ✘ The painting in slide 9 is a detail of “Eve Tempted” (1887) by John Roddam Spencer Stanhope 37

Why scala for data science

  • 1.
    Why Scala forData Science?
  • 2.
    HELLO! I am Guglielmo Iozzia Iam here because I love AI and the With the Best conference series You can follow me at @GuglielmoIozzia 2
  • 3.
    Something about me ✘Big Data Delivery Lead at (UHG) ✘ Previously at and of the UN ✘ Current fields of expertise are Big Data, ML/DL and DevOps ✘ Author of the upcoming book “Hands- on Deep Learning with Apache Spark” ✘ I love preparing home-made pizza3
  • 4.
    What is Scala? Let’sget everyone on the same page
  • 5.
    The Scala PL Scalais a programming language that blends object-oriented and functional programming concepts on the JVM. 5
  • 6.
    Functional Programming ✘ InFP you write pure functions. ✘ Given the same input, a function always return the same output, producing no side effect. ✘ A function is first-class: it can be used like any other type. ✘ That means that it can be assigned to a variable, passed as a parameter to another function or returned by a function.6
  • 7.
    Place your screenshothere Functional Programming in Scala An example of functional programming in Scala. 7
  • 8.
    Why Scala forData Science? Let’s move towards the main topic of this talk
  • 9.
    The Python’s Temptation Whenit comes to Data Science the first programming language people take into consideration is Python. 9
  • 10.
    Here are threevalid reasons to consider Scala. 10
  • 11.
    #1 Robustness Robustness andperformance when it comes to production system and large datasets. 11
  • 12.
    #2 Integration Most partof the systems/tools in the Big Data/ML space run on the JVM. 12
  • 13.
    Think about thesesystems you most probably have in your production tech stack. They all run in JVMs. 13
  • 14.
    #3 Libraries Good availabilityof ready to production Open Source ML/DL frameworks and libraries. 14
  • 15.
    Scala Open SourceProjects for AI/ML/DL ✘ Spark MLlib: Spark’s library for ML algorithms, feature extraction, dimensionality reduction, linear algebra, etc. ✘ ND4J: a linear algebra and matrix manipulation library which supports n- dimensional arrays and it is integrated with Apache Hadoop and Spark. 15
  • 16.
    Scala Open SourceProjects for AI/ML/DL ✘ DeepLearning4J: a distributed deep- learning framework written for Java and Scala. It is integrated with Hadoop and Apache Spark, for use on distributed GPUs and CPUs. ✘ BigDL: a distributed deep learning framework for Apache Spark, created at Intel. 16
  • 17.
    Scala Open SourceProjects for AI/ML/DL ✘ XGBoost: a scalable, portable and distributed Gradient Boosting library. ✘ PredictionIO: an Apache template system for creating machine learning engines. ✘ Smile: a fast and comprehensive machine learning system. ✘ Saddle: a high-performance data manipulation library.17
  • 18.
    Scala Open SourceProjects for AI/ML/DL ✘ Deeplearning.scala: a simple library for creating complex neural networks. It can be used either in standalone JVM applications or Jupyter Notebooks. ✘ ScalaNLP: a suite of ML and numerical computing libraries. It includes Breeze and Epic. 18
  • 19.
  • 20.
    object Nd4JScalaSample { defmain (args: Array[String]) { // Create arrays using the numpy syntax var arr1 = Nd4j.create(4) val arr2 = Nd4j.linspace(1, 10, 10) // Fill an array with the value 5 (equivalent to fill method in numpy) println(arr1.assign(5) + "Assigned value of 5 to the array") // Basic stats methods println(Nd4j.mean(arr1) + "Calculate mean of array") println(Nd4j.std(arr2) + "Calculate standard deviation of array") println(Nd4j.`var`(arr2), "Calculate variance") ... ND4J Example ND4J tries to fill the gap between JVM languages and Python programmers in terms of availability of powerful data analysis tools. 20
  • 21.
    Place your screenshothere DL4J Example (1 of 3) Multilayer Neural Network configuration in Scala with DL4J. 21
  • 22.
    Place your screenshothere DL4J Example (2 of 3) Network initialization and training in Scala with DL4J. 22
  • 23.
    Place your screenshothere DL4J Example (3 of 3) The DL4J web UI (training time). 23
  • 24.
    Can Scala andPython co-exist in Data Science projects? Is there any bridge between this two worlds?
  • 25.
    139,000 The result ofa search on Google about MNN models implemented through Tensorflow 8,330,000 The result of a generic search on Google about models implemented through Tensorflow 120,000 The result of a search on Google about MNN examples implemented through Tensorflow 25
  • 26.
    Tensorflow Pros andCons ✘ Big community ✘ Lots of models, example and use cases available ✘ Stunning features Mostly Python. The Java API is currently experimental and is not covered by the TensorFlow API stability guarantees. 26
  • 27.
    Keras to theRescue ✘ It is an open source neural network library written in Python ✘ It can run on top of TensorFlow (and other backend engines) ✘ Easy prototyping ✘ Lightweight ✘ Can be used to import Python models to DL4J 27
  • 28.
  • 29.
    Place your screenshothere Importing Keras Models into DL4J: example DL4J provides Java/Scala API to import a pre-trained TensorFlow model through Keras. 29
  • 30.
    Place your screenshothere Importing Keras Models into DL4J: example The imported model can then be used in a DL4J application implemented through Java or Scala only. 30
  • 31.
    Conclusion Bridging the Gapbetween Data Engineers and Data Scientists
  • 32.
    The Missing Link DataEngineers • Scala/Java skills and experience • Hands-on Big Data and Streaming tools (Hadoop, HBase, Spark, Kafka, Beam, etc.) • DevOps mindset • Attention on testing, performance, scalability • Containerization • Often no skills in ML/DL Data Scientist • Strong ML/DL skills • Python and R users • Good data understanding • Model training and evaluating strategies • Probably knowledge on Big Data and Streaming tools • No DevOps mindset • Research more than production 32
  • 33.
    To Leaverage theSpecific Skills of Each Team DL4J Keras TensorFlow Data Engineers Data Scientists 33
  • 34.
    To Leaverage theSpecific Skills of Each Team Keras Scala (DL4J) TensorFlow (Python) 34
  • 35.
    Place your screenshothere Hands-on Deep Learning with Apache Spark More on some topics covered in this talk can be found in this book. https://tinyurl.com/y9jkvtuy 35
  • 36.
    THANK YOU! Any questions? You canfind me at ✘ @GuglielmoIozzia ✘ https://ie.linkedin.com/in/giozzia ✘ googlielmo.blogspot.com/ ✘ https://dzone.com/users/253294 8/virtualramblas.html 36
  • 37.
    Credits Special thanks toall the people who made and released these awesome resources for free: ✘ Presentation template by SlidesCarnival ✘ The painting in slide 9 is a detail of “Eve Tempted” (1887) by John Roddam Spencer Stanhope 37