Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark Kai Huang Intel Corporation Jason Dai Intel Corporation
Agenda Overview of Analytics Zoo Introduction to Ray Motivations for Ray On Apache Spark Implementation details and API design Real-world use cases Conclusion
Distributed, High-Performance Deep Learning Framework for Apache Spark* AI on Big Data Accelerating Data Analytics + AI Solutions At Scale https://github.com/intel-analytics/bigdl Unified Analytics + AI Platform for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark* https://github.com/intel-analytics/analytics-zoo
Analytics Zoo https://github.com/intel-analytics/analytics-zoo Recommendation Distributed TensorFlow & PyTorch on Spark Spark Dataframes & ML Pipelines for DL RayOnSpark InferenceModel Models & Algorithms Integrated Analytics & AI Pipelines Time Series Computer Vision NLP Automated ML Workflow AutoML for Time Series Automatic Cluster Serving Compute Environment K8s Cluster Spark Cluster Python Libraries (Numpy/Pandas/sklearn/…) DL Frameworks (TF/PyTorch/OpenVINO/…) Distributed Analytics (Spark/Flink/Ray/…) Laptop Hadoop Cluster Powered by oneAPI Unified Data Analytics and AI Platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
Unified Data Analytics and AI Platform Easily prototype end-to-end pipelines that apply AI models to big data. “Zero” code change from laptop to distributed cluster. Seamlessly deployed on production Hadoop/K8s clusters. Automate the process of applying machine learning to big data. Seamless Scaling from Laptop to Distributed Big Data Clusters Production Data pipeline Prototype on laptop using sample data Experiment on clusters with history data Production deployment w/ distributed data pipeline
Ray Ray Core provides easy Python interface for parallelism by using remote functions and actors. Ray is a fast and simple framework for building and running distributed applications. import ray ray.init() @ray.remote(num_cpus, …) def f(x): return x * x # Executed in parallel ray.get([f.remote(i) for i in range(5)]) @ray.remote(num_cpus, …) class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] # Executed in parallel ray.get([c.increment.remote() for c in counters])
Ray Tune: Scalable Experiment Execution and Hyperparameter Tuning RLlib: Scalable Reinforcement Learning RaySGD: Distributed Training Wrappers https://github.com/ray-project/ray/ Ray is packaged with several high-level libraries to accelerate machine learning workloads.
Motivations for RayOnSpark Demand to embrace emerging AI technologies on production data. Efforts required to directly deploy Ray applications on existing Hadoop/Spark clusters. Challenge to prepare the Python environment on each node without modifying the cluster. Need a unified system for big data analytics and advanced AI applications.
Implementation of RayOnSpark Leverage conda-pack and Spark for runtime Python package distribution. RayContext on Spark driver launches Ray across the cluster. Ray processes exist alongside Spark executors. For each Spark executor, a Ray Manager is created to manage Ray processes. Able to run in-memory Spark RDDs or DataFrames in Ray applications. RayOnSpark allows Ray applications to seamlessly integrate into Spark data processing pipelines. Launch Ray* on Apache Spark*
Interface of RayOnSpark Three-step programming with minimum code changes: Initiate or use an existing SparkContext. Initiate RayContext. Shut down SparkContext and RayContext after tasks finish. More instructions at: https://analytics- zoo.github.io/master/#ProgrammingGuide/r ayonspark/ import ray from zoo import init_spark_on_yarn from zoo.ray import RayContext sc = init_spark_on_yarn(hadoop_conf, conda_name, num_executors, executor_cores,…) ray_ctx = RayContext(sc, object_store_memory,…) ray_ctx.init() @ray.remote class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] ray.get([c.increment.remote() for c in counters]) ray_ctx.stop() sc.stop() RayOnSpark code Pure Ray code
Use Cases of RayOnSpark Scalable AutoML for time series prediction. - Automate the feature generation, model selection and hyperparameter tuning processes. - See more at: https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/automl. Source: Yao, Q., Wang, et. al Taking the Human out of Learning Applications : A Survey on Automated Machine Learning.
Use Cases of RayOnSpark Data parallel pre-processing and distributed training pipeline of deep neural networks. - Use PySpark or Ray for parallel data loading and processing. - Use RayOnSpark to implement thin wrappers to automatically setup distributed environment. - Run distributed training with either framework native modules or Horovod (from Uber) as the backend. - Users only need to write the training script on the single node and make minimum code changes to achieve distributed training.
Drive-thru Recommendation System at Burger King Burger King performs Spark ETL tasks first, followed by distributed MXNet training. Similar to RaySGD, we implement a lightweight shim layer around native MXNet modules for easy deployment on YARN cluster. The entire pipeline runs on a single cluster. No extra data transfer needed.
Conclusion RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo https://medium.com/riselab/rayonspark-running-emerging-ai- applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a More information for Analytics Zoo at: ▪ https://github.com/intel-analytics/analytics-zoo ▪ https://analytics-zoo.github.io/ We are working on full support and more out-of-box solutions for easily scaling out Python AI pipelines from single node to cluster based on Ray and Spark.
16 Accelerate Your Data Analytics & AI Journey with
Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark

  • 2.
    Running Emerging AIApplications on Big Data Platforms with Ray On Apache Spark Kai Huang Intel Corporation Jason Dai Intel Corporation
  • 3.
    Agenda Overview of AnalyticsZoo Introduction to Ray Motivations for Ray On Apache Spark Implementation details and API design Real-world use cases Conclusion
  • 4.
    Distributed, High-Performance Deep LearningFramework for Apache Spark* AI on Big Data Accelerating Data Analytics + AI Solutions At Scale https://github.com/intel-analytics/bigdl Unified Analytics + AI Platform for TensorFlow*, PyTorch*, Keras*, BigDL, Ray* and Apache Spark* https://github.com/intel-analytics/analytics-zoo
  • 5.
    Analytics Zoo https://github.com/intel-analytics/analytics-zoo Recommendation Distributed TensorFlow& PyTorch on Spark Spark Dataframes & ML Pipelines for DL RayOnSpark InferenceModel Models & Algorithms Integrated Analytics & AI Pipelines Time Series Computer Vision NLP Automated ML Workflow AutoML for Time Series Automatic Cluster Serving Compute Environment K8s Cluster Spark Cluster Python Libraries (Numpy/Pandas/sklearn/…) DL Frameworks (TF/PyTorch/OpenVINO/…) Distributed Analytics (Spark/Flink/Ray/…) Laptop Hadoop Cluster Powered by oneAPI Unified Data Analytics and AI Platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
  • 6.
    Unified Data Analyticsand AI Platform Easily prototype end-to-end pipelines that apply AI models to big data. “Zero” code change from laptop to distributed cluster. Seamlessly deployed on production Hadoop/K8s clusters. Automate the process of applying machine learning to big data. Seamless Scaling from Laptop to Distributed Big Data Clusters Production Data pipeline Prototype on laptop using sample data Experiment on clusters with history data Production deployment w/ distributed data pipeline
  • 7.
    Ray Ray Core provideseasy Python interface for parallelism by using remote functions and actors. Ray is a fast and simple framework for building and running distributed applications. import ray ray.init() @ray.remote(num_cpus, …) def f(x): return x * x # Executed in parallel ray.get([f.remote(i) for i in range(5)]) @ray.remote(num_cpus, …) class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] # Executed in parallel ray.get([c.increment.remote() for c in counters])
  • 8.
    Ray Tune: Scalable ExperimentExecution and Hyperparameter Tuning RLlib: Scalable Reinforcement Learning RaySGD: Distributed Training Wrappers https://github.com/ray-project/ray/ Ray is packaged with several high-level libraries to accelerate machine learning workloads.
  • 9.
    Motivations for RayOnSpark Demandto embrace emerging AI technologies on production data. Efforts required to directly deploy Ray applications on existing Hadoop/Spark clusters. Challenge to prepare the Python environment on each node without modifying the cluster. Need a unified system for big data analytics and advanced AI applications.
  • 10.
    Implementation of RayOnSpark Leverageconda-pack and Spark for runtime Python package distribution. RayContext on Spark driver launches Ray across the cluster. Ray processes exist alongside Spark executors. For each Spark executor, a Ray Manager is created to manage Ray processes. Able to run in-memory Spark RDDs or DataFrames in Ray applications. RayOnSpark allows Ray applications to seamlessly integrate into Spark data processing pipelines. Launch Ray* on Apache Spark*
  • 11.
    Interface of RayOnSpark Three-stepprogramming with minimum code changes: Initiate or use an existing SparkContext. Initiate RayContext. Shut down SparkContext and RayContext after tasks finish. More instructions at: https://analytics- zoo.github.io/master/#ProgrammingGuide/r ayonspark/ import ray from zoo import init_spark_on_yarn from zoo.ray import RayContext sc = init_spark_on_yarn(hadoop_conf, conda_name, num_executors, executor_cores,…) ray_ctx = RayContext(sc, object_store_memory,…) ray_ctx.init() @ray.remote class Counter(object): def __init__(self): self.n = 0 def increment(self): self.n += 1 return self.n counters = [Counter.remote() for i in range(5)] ray.get([c.increment.remote() for c in counters]) ray_ctx.stop() sc.stop() RayOnSpark code Pure Ray code
  • 12.
    Use Cases ofRayOnSpark Scalable AutoML for time series prediction. - Automate the feature generation, model selection and hyperparameter tuning processes. - See more at: https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/automl. Source: Yao, Q., Wang, et. al Taking the Human out of Learning Applications : A Survey on Automated Machine Learning.
  • 13.
    Use Cases ofRayOnSpark Data parallel pre-processing and distributed training pipeline of deep neural networks. - Use PySpark or Ray for parallel data loading and processing. - Use RayOnSpark to implement thin wrappers to automatically setup distributed environment. - Run distributed training with either framework native modules or Horovod (from Uber) as the backend. - Users only need to write the training script on the single node and make minimum code changes to achieve distributed training.
  • 14.
    Drive-thru Recommendation System atBurger King Burger King performs Spark ETL tasks first, followed by distributed MXNet training. Similar to RaySGD, we implement a lightweight shim layer around native MXNet modules for easy deployment on YARN cluster. The entire pipeline runs on a single cluster. No extra data transfer needed.
  • 15.
    Conclusion RayOnSpark: Running EmergingAI Applications on Big Data Clusters with Ray and Analytics Zoo https://medium.com/riselab/rayonspark-running-emerging-ai- applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a More information for Analytics Zoo at: ▪ https://github.com/intel-analytics/analytics-zoo ▪ https://analytics-zoo.github.io/ We are working on full support and more out-of-box solutions for easily scaling out Python AI pipelines from single node to cluster based on Ray and Spark.
  • 16.
    16 Accelerate Your DataAnalytics & AI Journey with
  • 17.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.