Simplifying AI integration on Apache Spark

Simplifying AI Integration on Spark Hemshankar Sahu Principal Software Engineer @ Informatica

About Speaker Hemshankar Sahu Principal Software Engineer @ Informatica M. Tech. in Computer Science and Engg. From IIT Roorkee 9+ Years of Experience in IT Industry working as Full Stack Developer and ML Engineer. Currently working on developing framework to help Integration of Machine Learning Algorithm and Models into production system.

About Informatica Enterprise Cloud Data Management leader 9,500+ customers 18 Trillion cloud transactions per month 85% of Fortune 100 5 A Leader in Five Gartner Magic Quadrants

Agenda ▪ Context for the Talk ▪ Personas Involved ▪ Informatica On Spark ▪ Problem Details ▪ AI/ML Integration Problems ▪ Solution Details ▪ New Offering: AISR ▪ Simplifying AI/ML integration on Spark ▪ Demo ▪ Deploying, Integration, Auto CI-CD of AI Solutions ▪ Summary

Personas Involved Data Scientist vs Data Engineers: Personas involved in operationalizing the ML Algorithms Data Scientist Data Engineer Tasks Data Exploring, Model Building, Model Training Data Ingestion, Data Pre-processing, Transformation and Cleansing Languages Python, R, Lisp SQL, Scala, Java/Python Tools Notebook, R Studio, Matlab Spark, Data Engg. Tools (like Informatica) Libraries Tensorflow, Keres, Pandas, Sickit Learn Hadoop, Spark

Informatica On Spark Informatica Data Engineering Integration (DEI) Generates Spark Code Executes On Cluster Data Engineering Tool which uses Spark as Execution Engine

Same, familiar Informatica design-time Informatica Intelligent Cloud Services Cloud Data Integration Elastic Enabling Spark serverless support for auto-scaling and provisioning Auto-scaling Spark cluster Deployed to your cloud network

AI/ML Integration Issues Example problem use-case: Collaborating Data Engineers and Data Scientists Informatica DEI Python 2.7 Python 2.7 Python 2.7 Python 3.6Python Developer Python Developer R Developer Python 2.7 Python 2.7 Master V1 V2 ? ? Spark Cluster Issues ▪ Team Collaboration Required ▪ Data Scientist and Data Engineer invests time to collaborate ▪ Manually Deploy the Binaries ▪ Downtime for each new version ▪ No Support for Different Runtimes Data Science Team Data Engineering Team V2 V2

New Offering: AISR ▪ Repository of AI Solutions ▪ A Solution is ▪ Code and Metadata ▪ Dependencies ▪ Runtime Details ▪ A Solution can ▪ Be in any language* ▪ With any dependency ▪ Run on GPU** AI Solutions Repository * Only Python supported in current release ** Provided hardware are present and drivers are installed, and solution contains the respective code Runtimes Tensorflow_Numpy Sickitlearn_OpenCV Solutions Sentiment Analysis AISR Generated Code for executing from various platforms Solution code, can be in any language Dependencies: Files, installed software etc. AISR Image Processing Image Classification Image To Text Example Based on A General Solutions Repository Solutions Repository CPP Python R Java DEI Spark REST Java

Simplifying AI/ML integration on Spark Example use-case solution: Collaborating Data Scientists and Data Engineers Python 2.7 Python 2.7 Informatica DEI Python 3.6 Python Developer Python Developer R Developer Master V1 V2 AISR Runtime-1 Runtime-1 Runtime-2 Runtime-3 V1 Runtime V1 Runtime V1 Runtime Cluster Benefits ▪ Minimum Collaboration ▪ Between Data Scientist and Data Engineer ▪ Auto Deploy of new Version ▪ No Downtime ▪ Multiple Versions Support ▪ Different version of same solution can be used. ▪ Support for Different Runtime Data Science Team Data Engineering Team V1 Runtime V1 Runtime

Demo Use Case Easy Collaboration, No Downtime and CI-CD AISR DEI Data Scientist Data Engineer Image Classification

Simplified Integration In Action Runtimes Python + TF + OpenCV R Eco System Solutions Image To Text V1 AI Solutions Repo DEI Generated Java Code for executing at spark executors INFA wrapper and Core code, can be in any language Dependencies: Files, installed software etc. Object Detection V1 YARN Spark Job Executor 1 Executor 2 Node 1 Node 2 Node 3 HDFS CLUSTERInformatica Data Scientist Data Engineer Mapping Cached Binaries Spark Job

Demo Recap ▪ Easily Created Solution ▪ Easily added a new AI Solution from Jupyter Notebook ▪ Explored the details of added solution ▪ Deployed and Tested ▪ Added Solution was deployed ▪ Explored various consumption options ▪ Created REST Endpoint and used it for testing ▪ Easily Integrated with Spark ▪ Created a mapping job using Informatica ▪ Created new Transformation to use the Deployed Solution ▪ Ran the mapping on Spark with selected Solution ▪ CI-CD ▪ Retrained the Solution with few clicks ▪ Used the re-trained Solution without any changes or downtime AISR DEI

Summary ▪ Data Scientist Vs Data Engineer ▪ Collaboration is challenging and time consuming ▪ Easy Spark Job Creation using DEI ▪ Drag and Drop way of Spark Job Creation ▪ Easy Spark-AI Solution Integration using AISR ▪ Minimum Collaboration ▪ Processing happens at Spark Scale within Spark Cluster ▪ Better performance as compared to other serving platforms. ▪ Inbuilt CI-CD for AI Solutions ▪ No downtime in case Solution upgrades ▪ No changes required from Data Engineering environment ▪ AISR Framework ▪ Based on Generic Solutions Repository Implementation ▪ Partners can develop plugins to add or consume AI Solutions ▪ Overall Production Cost Reduction

Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Simplifying AI integration on Apache Spark

More Related Content

What's hot

Similar to Simplifying AI integration on Apache Spark

More from Databricks

Recently uploaded

Simplifying AI integration on Apache Spark