Simplifying AI Integration on Spark Hemshankar Sahu Principal Software Engineer @ Informatica
About Speaker Hemshankar Sahu Principal Software Engineer @ Informatica M. Tech. in Computer Science and Engg. From IIT Roorkee 9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer. Currently working on developing framework to help Integration of Machine Learning Algorithm and Models into production system.
About Informatica Enterprise Cloud Data Management leader 9,500+ customers 18 Trillion cloud transactions per month 85% of Fortune 100 5 A Leader in Five Gartner Magic Quadrants
Agenda ▪ Context for the Talk ▪ Personas Involved ▪ Informatica On Spark ▪ Problem Details ▪ AI/ML Integration Problems ▪ Solution Details ▪ New Offering: AISR ▪ Simplifying AI/ML integration on Spark ▪ Demo ▪ Deploying, Integration, Auto CI-CD of AI Solutions ▪ Summary
Context for the Talk
Personas Involved Data Scientist vs Data Engineers: Personas involved in operationalizing the ML Algorithms Data Scientist Data Engineer Tasks Data Exploring, Model Building, Model Training Data Ingestion, Data Pre-processing, Transformation and Cleansing Languages Python, R, Lisp SQL, Scala, Java/Python Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica) Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark
Informatica On Spark Informatica Data Engineering Integration (DEI) Generates Spark Code Executes On Cluster Data Engineering Tool which uses Spark as Execution Engine
Same, familiar Informatica design-time Informatica Intelligent Cloud Services Cloud Data Integration Elastic Enabling Spark serverless support for auto-scaling and provisioning Auto-scaling Spark cluster Deployed to your cloud network
Problem Details
AI/ML Integration Issues Example problem use-case: Collaborating Data Engineers and Data Scientists Informatica DEI Python 2.7 Python 2.7 Python 2.7 Python 3.6Python Developer Python Developer R Developer Python 2.7 Python 2.7 Master V1 V2 ? ? Spark Cluster Issues ▪ Team Collaboration Required ▪ Data Scientist and Data Engineer invests time to collaborate ▪ Manually Deploy the Binaries ▪ Downtime for each new version ▪ No Support for Different Runtimes Data Science Team Data Engineering Team V2 V2
Solution Details
New Offering: AISR ▪ Repository of AI Solutions ▪ A Solution is ▪ Code and Metadata ▪ Dependencies ▪ Runtime Details ▪ A Solution can ▪ Be in any language* ▪ With any dependency ▪ Run on GPU** AI Solutions Repository * Only Python supported in current release ** Provided hardware are present and drivers are installed, and solution contains the respective code Runtimes Tensorflow_Numpy Sickitlearn_OpenCV Solutions Sentiment Analysis AISR Generated Code for executing from various platforms Solution code, can be in any language Dependencies: Files, installed software etc. AISR Image Processing Image Classification Image To Text Example Based on A General Solutions Repository Solutions Repository CPP Python R Java DEI Spark REST Java
Simplifying AI/ML integration on Spark Example use-case solution: Collaborating Data Scientists and Data Engineers Python 2.7 Python 2.7 Informatica DEI Python 3.6 Python Developer Python Developer R Developer Master V1 V2 AISR Runtime-1 Runtime-1 Runtime-2 Runtime-3 V1 Runtime V1 Runtime V1 Runtime Cluster Benefits ▪ Minimum Collaboration ▪ Between Data Scientist and Data Engineer ▪ Auto Deploy of new Version ▪ No Downtime ▪ Multiple Versions Support ▪ Different version of same solution can be used. ▪ Support for Different Runtime Data Science Team Data Engineering Team V1 Runtime V1 Runtime
Demo
Demo Use Case Easy Collaboration, No Downtime and CI-CD AISR DEI Data Scientist Data Engineer Image Classification
Simplified Integration In Action Runtimes Python + TF + OpenCV R Eco System Solutions Image To Text V1 AI Solutions Repo DEI Generated Java Code for executing at spark executors INFA wrapper and Core code, can be in any language Dependencies: Files, installed software etc. Object Detection V1 YARN Spark Job Executor 1 Executor 2 Node 1 Node 2 Node 3 HDFS CLUSTERInformatica Data Scientist Data Engineer Mapping Cached Binaries Spark Job
Demo Recap ▪ Easily Created Solution ▪ Easily added a new AI Solution from Jupyter Notebook ▪ Explored the details of added solution ▪ Deployed and Tested ▪ Added Solution was deployed ▪ Explored various consumption options ▪ Created REST Endpoint and used it for testing ▪ Easily Integrated with Spark ▪ Created a mapping job using Informatica ▪ Created new Transformation to use the Deployed Solution ▪ Ran the mapping on Spark with selected Solution ▪ CI-CD ▪ Retrained the Solution with few clicks ▪ Used the re-trained Solution without any changes or downtime AISR DEI
Summary
Summary ▪ Data Scientist Vs Data Engineer ▪ Collaboration is challenging and time consuming ▪ Easy Spark Job Creation using DEI ▪ Drag and Drop way of Spark Job Creation ▪ Easy Spark-AI Solution Integration using AISR ▪ Minimum Collaboration ▪ Processing happens at Spark Scale within Spark Cluster ▪ Better performance as compared to other serving platforms. ▪ Inbuilt CI-CD for AI Solutions ▪ No downtime in case Solution upgrades ▪ No changes required from Data Engineering environment ▪ AISR Framework ▪ Based on Generic Solutions Repository Implementation ▪ Partners can develop plugins to add or consume AI Solutions ▪ Overall Production Cost Reduction
Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Simplifying AI integration on Apache Spark

  • 1.
    Simplifying AI Integrationon Spark Hemshankar Sahu Principal Software Engineer @ Informatica
  • 2.
    About Speaker Hemshankar Sahu PrincipalSoftware Engineer @ Informatica M. Tech. in Computer Science and Engg. From IIT Roorkee 9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer. Currently working on developing framework to help Integration of Machine Learning Algorithm and Models into production system.
  • 3.
    About Informatica Enterprise CloudData Management leader 9,500+ customers 18 Trillion cloud transactions per month 85% of Fortune 100 5 A Leader in Five Gartner Magic Quadrants
  • 4.
    Agenda ▪ Context forthe Talk ▪ Personas Involved ▪ Informatica On Spark ▪ Problem Details ▪ AI/ML Integration Problems ▪ Solution Details ▪ New Offering: AISR ▪ Simplifying AI/ML integration on Spark ▪ Demo ▪ Deploying, Integration, Auto CI-CD of AI Solutions ▪ Summary
  • 5.
  • 6.
    Personas Involved Data Scientistvs Data Engineers: Personas involved in operationalizing the ML Algorithms Data Scientist Data Engineer Tasks Data Exploring, Model Building, Model Training Data Ingestion, Data Pre-processing, Transformation and Cleansing Languages Python, R, Lisp SQL, Scala, Java/Python Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica) Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark
  • 7.
    Informatica On Spark InformaticaData Engineering Integration (DEI) Generates Spark Code Executes On Cluster Data Engineering Tool which uses Spark as Execution Engine
  • 8.
    Same, familiar Informatica design-time InformaticaIntelligent Cloud Services Cloud Data Integration Elastic Enabling Spark serverless support for auto-scaling and provisioning Auto-scaling Spark cluster Deployed to your cloud network
  • 9.
  • 10.
    AI/ML Integration Issues Exampleproblem use-case: Collaborating Data Engineers and Data Scientists Informatica DEI Python 2.7 Python 2.7 Python 2.7 Python 3.6Python Developer Python Developer R Developer Python 2.7 Python 2.7 Master V1 V2 ? ? Spark Cluster Issues ▪ Team Collaboration Required ▪ Data Scientist and Data Engineer invests time to collaborate ▪ Manually Deploy the Binaries ▪ Downtime for each new version ▪ No Support for Different Runtimes Data Science Team Data Engineering Team V2 V2
  • 11.
  • 12.
    New Offering: AISR ▪Repository of AI Solutions ▪ A Solution is ▪ Code and Metadata ▪ Dependencies ▪ Runtime Details ▪ A Solution can ▪ Be in any language* ▪ With any dependency ▪ Run on GPU** AI Solutions Repository * Only Python supported in current release ** Provided hardware are present and drivers are installed, and solution contains the respective code Runtimes Tensorflow_Numpy Sickitlearn_OpenCV Solutions Sentiment Analysis AISR Generated Code for executing from various platforms Solution code, can be in any language Dependencies: Files, installed software etc. AISR Image Processing Image Classification Image To Text Example Based on A General Solutions Repository Solutions Repository CPP Python R Java DEI Spark REST Java
  • 13.
    Simplifying AI/ML integrationon Spark Example use-case solution: Collaborating Data Scientists and Data Engineers Python 2.7 Python 2.7 Informatica DEI Python 3.6 Python Developer Python Developer R Developer Master V1 V2 AISR Runtime-1 Runtime-1 Runtime-2 Runtime-3 V1 Runtime V1 Runtime V1 Runtime Cluster Benefits ▪ Minimum Collaboration ▪ Between Data Scientist and Data Engineer ▪ Auto Deploy of new Version ▪ No Downtime ▪ Multiple Versions Support ▪ Different version of same solution can be used. ▪ Support for Different Runtime Data Science Team Data Engineering Team V1 Runtime V1 Runtime
  • 14.
  • 15.
    Demo Use Case EasyCollaboration, No Downtime and CI-CD AISR DEI Data Scientist Data Engineer Image Classification
  • 16.
    Simplified Integration InAction Runtimes Python + TF + OpenCV R Eco System Solutions Image To Text V1 AI Solutions Repo DEI Generated Java Code for executing at spark executors INFA wrapper and Core code, can be in any language Dependencies: Files, installed software etc. Object Detection V1 YARN Spark Job Executor 1 Executor 2 Node 1 Node 2 Node 3 HDFS CLUSTERInformatica Data Scientist Data Engineer Mapping Cached Binaries Spark Job
  • 17.
    Demo Recap ▪ EasilyCreated Solution ▪ Easily added a new AI Solution from Jupyter Notebook ▪ Explored the details of added solution ▪ Deployed and Tested ▪ Added Solution was deployed ▪ Explored various consumption options ▪ Created REST Endpoint and used it for testing ▪ Easily Integrated with Spark ▪ Created a mapping job using Informatica ▪ Created new Transformation to use the Deployed Solution ▪ Ran the mapping on Spark with selected Solution ▪ CI-CD ▪ Retrained the Solution with few clicks ▪ Used the re-trained Solution without any changes or downtime AISR DEI
  • 18.
  • 19.
    Summary ▪ Data ScientistVs Data Engineer ▪ Collaboration is challenging and time consuming ▪ Easy Spark Job Creation using DEI ▪ Drag and Drop way of Spark Job Creation ▪ Easy Spark-AI Solution Integration using AISR ▪ Minimum Collaboration ▪ Processing happens at Spark Scale within Spark Cluster ▪ Better performance as compared to other serving platforms. ▪ Inbuilt CI-CD for AI Solutions ▪ No downtime in case Solution upgrades ▪ No changes required from Data Engineering environment ▪ AISR Framework ▪ Based on Generic Solutions Repository Implementation ▪ Partners can develop plugins to add or consume AI Solutions ▪ Overall Production Cost Reduction
  • 20.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.