MongoDB in Data Science How to convert a Pandas Proof-of-Concept to a scalable product and why MongoDB is the key to success !
Who I am Software Engineer Compiler Engineer Compiler Engineer LLVM contributor Software Engineer R/D Lead ML Engineer Backend Infrastructure Sr. ML Engineer
What will we learn ? ● Understand existing tools for delivering Data Science projects and when to use them. ● Why MongoDB could be crucial for your product and business ● How to easily productionize a Pandas Proof-of-Concept ● How to use MongoDB while being open to other technologies.
Motivation
Speed of inference Speed of development Key factors
Feature Aggregation Model Prediction Service Speed of inference Key factors
Research Data Scientist Productionization Data/ML Engineer Speed of development Key factors
What is Pandas? Most popular Python framework for data manipulation and data wrangling in Data Science community.
What is Pandas? Most popular Python framework for data manipulation and data wrangling in Data Science community. Source: numpy.org, scipy.org, matplotlib.org, scikit-learn.org, pandas.pydata.org
Source: Stackoverflow post by David Robinson
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Why use Pandas Dataframes ?
Drawbacks of Pandas ● Doesn’t have persistence layer ● Doesn’t support primary and secondary indexes ○ As a result, not efficient for querying ● Doesn’t support multi-threading
Productionization options Real time service Batch Job
Productionization options Real time service Batch Job Slow Inference
Productionization options Real time service Batch Job Slow Inference Fast Inference
Real time service demo (recommendation) Event Store
Real time service demo (recommendation) Event Store Model Training Job
Real time service demo (recommendation) Event Store Model Training Job Model store
Real time service demo (recommendation) Inference 1 Event Store Inference 2 Inference N Model Training Job Model store
Real time service demo (recommendation) Inference 1 Event Store Inference 2 Inference N Model Training Job Model store
Real time service demo (recommendation) Inference 1 Event Store Inference 2 Inference N Model Training Job Model store
Real time service demo (recommendation) Event Store Feature Aggregation Model Inference Inference Service request respond
Real time service demo (recommendation)
Real time service demo (recommendation)
Real time service demo (recommendation)
Things to avoid ● Don’t forget to put indexes on your collection ● Don’t put indexes on every field ● Don’t read and write from the same replica
But… we generate a tons of user events! Is this solution going to work for us?
user events Consumer 1 Consumer 2 Consumer N MongoDB Postgres DFS Typical data pipeline
user events Consumer 1 Consumer 2 Consumer N MongoDB Postgres DFS Typical data pipeline
MongoDB TTL index Filters event_type ... Consumer Shrink down the amount of data
Real time service demo (recommendation) Inference 1 Event Store Inference 2 Inference N Model Training Job Model store
Training Job Inference 1 Event Store Inference 2 Inference N Model Training Job Model store
Source: mongodb.com
MongoDB Connector Event Store Model Training Job Model Training job
MongoDB Connector Event Store Inference Job Inference as a batch job
Flexibility Spark DataFrame MongoDB Aggregate Pandas Dataframe
Batch Job versus Real Time Service Real Time Service Batch Job Pros On demand (scales as needed) Easier to develop and maintain Cons Harder to develop and maintain Constantly utilizing resources
Benefits of MongoDB ● Schema-Less ● Horizontally scalable ● Available as PaaS from many vendors. ● Has a huge community ● Easier to hire people
Summary ● Allows to provide a real time experience ● Could help save expensive computational resources ● Provides a way to do real time as well as batch inference
We are hiring !!! careers.shopbonsai.ca
References ● https://stackoverflow.blog/2017/09/14/python-growing-quickly/ ● https://www.mongodb.com/products/spark-connector ● https://pandas.pydata.org/ ● https://scikit-learn.org/ ● https://matplotlib.org/ ● https://www.scipy.org/ ● https://www.numpy.org/ ● https://iconscout.com/icon/device-management-mobile-computer-seo-tool-analyze-7
Thanks !!!

MongoDB World 2019: MongoDB in Data Science: How to Build a Scalable Product Using MongoDB