Simulation in R and Python

The problem simulate 10, 000, 000 users simulate 10 scenarios total number of simulations: 10 × 10, 000, 000 Embarassingly parallel problem, since there is no need to exchange data between simulations

Things I tried 1. distribute simulations across instances 2. {doRedis} 3. Dask I learned the best way to solve this problem the hard way

The answer. . . as for now Dask was the fastest and easiest way to accomplish this task

Or was it? Not quite. . . making the model from scratch would have been more diﬃcult if I had done it in Python than in R

How it happened 1. I needed to simulate user behavior 2. R was chosen for modelling because it is well suited for development and experimentation 3. Computing simulations took hours so I decided to parallelize it 4. As a ﬁrst step I made an ansible script to run simulations on many instances

Problems with simple distributed simulations Stochastic simulations do not ﬁnish at the same time

DoRedis to the rescue!? Pros scheduler elastic easy to setup works in R Cons many things to choose from diﬃcult to ﬁgure bugs

DoRedis to the rescue!? Pros scheduler this allows to use your cluster of instances very eﬃciently elastic this allows using spot-instances so it is very cost eﬀective easy to setup just installing Redis was enough works in R

DoRedis to the rescue!? Cons tuning In order to get good performance I had to choose a good chunkSize, best way to combine results with .combine, etc difficult to figure bugs It was difficult to see logs from workers, and sometimes workers would not start for some reason difficulty setting up cluster In order to make a cluster one needs to run a script in each instance. I made an ansible script for this but it was still work

Using Dask as a scheduler Pros Easy to setup Fast Cons aimed for Python scripts unmaintained tools some gotchas when running a cluster

Using Dask as a scheduler Pros Easy to setup hammerlab/dask-distributed-on-kubernetes was used to run Dask on kubernetes and it was very easy. The only thing I had to make was a custom docker image. The final kubernetes yaml file can be found here. Also since I was using kubernetes it was easy to change the number of nodes on the fly. Fast maybe because I was using Python and had refactored my R code

Using Dask as a scheduler Cons aimed for Python scripts When it comes to experimentation with data I think R is better with its syntax and I also like Rstudio much more than Spyder. However many functions that I needed, such as {truncnorm}, were implemented in Python as well and some R functions such as base::sample was easy replicate in Python

Using Dask as a scheduler Cons unmaintained tools I ﬁrst tried using dask-ec2 but it didn’t work. I tried to make it work by making my own ami and looking at a problematic part of the code but I decided to abandon it some gotchas when running a cluster Dask was easy to run in a single node but the same code didn’t work in a cluster you need to import and run multiprocessing.freeze_support you need to upload modules with client.upload_file some packages such as dask and distributed have to be of the same precise version in the client and in the cluster

Using Dask as a scheduler Cons some gotchas when running a cluster to persist state of objects you need to make a threading.local variable like this: import threading myenv = threading.local() # and then # myenv = MyEnv() you have to execute multiprocessing.freeze_support() before running your code

Final thoughts I haven’t looked at other parallel backends for R. So maybe {snow} would be satisfactory although elasticity would be sacriﬁced.

About myself Shunichi Otsuka Working at Opt as an engineer/researcher

Simulation in R and Python

More Related Content

What's hot

Similar to Simulation in R and Python

Recently uploaded

Simulation in R and Python