Simulation in R and Python
The problem simulate 10, 000, 000 users simulate 10 scenarios total number of simulations: 10 × 10, 000, 000 Embarassingly parallel problem, since there is no need to exchange data between simulations
Things I tried 1. distribute simulations across instances 2. {doRedis} 3. Dask I learned the best way to solve this problem the hard way
The answer. . . as for now Dask was the fastest and easiest way to accomplish this task
Or was it? Not quite. . . making the model from scratch would have been more difficult if I had done it in Python than in R
How it happened 1. I needed to simulate user behavior 2. R was chosen for modelling because it is well suited for development and experimentation 3. Computing simulations took hours so I decided to parallelize it 4. As a first step I made an ansible script to run simulations on many instances
Problems with simple distributed simulations Stochastic simulations do not finish at the same time
DoRedis to the rescue!? Pros scheduler elastic easy to setup works in R Cons many things to choose from difficult to figure bugs
DoRedis to the rescue!? Pros scheduler this allows to use your cluster of instances very efficiently elastic this allows using spot-instances so it is very cost effective easy to setup just installing Redis was enough works in R
DoRedis to the rescue!? Cons tuning In order to get good performance I had to choose a good chunkSize, best way to combine results with .combine, etc difficult to figure bugs It was difficult to see logs from workers, and sometimes workers would not start for some reason difficulty setting up cluster In order to make a cluster one needs to run a script in each instance. I made an ansible script for this but it was still work
Using Dask as a scheduler Pros Easy to setup Fast Cons aimed for Python scripts unmaintained tools some gotchas when running a cluster
Using Dask as a scheduler Pros Easy to setup hammerlab/dask-distributed-on-kubernetes was used to run Dask on kubernetes and it was very easy. The only thing I had to make was a custom docker image. The final kubernetes yaml file can be found here. Also since I was using kubernetes it was easy to change the number of nodes on the fly. Fast maybe because I was using Python and had refactored my R code
Using Dask as a scheduler Cons aimed for Python scripts When it comes to experimentation with data I think R is better with its syntax and I also like Rstudio much more than Spyder. However many functions that I needed, such as {truncnorm}, were implemented in Python as well and some R functions such as base::sample was easy replicate in Python
Using Dask as a scheduler Cons unmaintained tools I first tried using dask-ec2 but it didn’t work. I tried to make it work by making my own ami and looking at a problematic part of the code but I decided to abandon it some gotchas when running a cluster Dask was easy to run in a single node but the same code didn’t work in a cluster you need to import and run multiprocessing.freeze_support you need to upload modules with client.upload_file some packages such as dask and distributed have to be of the same precise version in the client and in the cluster
Using Dask as a scheduler Cons some gotchas when running a cluster to persist state of objects you need to make a threading.local variable like this: import threading myenv = threading.local() # and then # myenv = MyEnv() you have to execute multiprocessing.freeze_support() before running your code
Final thoughts I haven’t looked at other parallel backends for R. So maybe {snow} would be satisfactory although elasticity would be sacrificed.
About myself Shunichi Otsuka Working at Opt as an engineer/researcher

Simulation in R and Python

  • 1.
    Simulation in Rand Python
  • 2.
    The problem simulate 10,000, 000 users simulate 10 scenarios total number of simulations: 10 × 10, 000, 000 Embarassingly parallel problem, since there is no need to exchange data between simulations
  • 3.
    Things I tried 1.distribute simulations across instances 2. {doRedis} 3. Dask I learned the best way to solve this problem the hard way
  • 4.
    The answer. .. as for now Dask was the fastest and easiest way to accomplish this task
  • 5.
    Or was it? Notquite. . . making the model from scratch would have been more difficult if I had done it in Python than in R
  • 6.
    How it happened 1.I needed to simulate user behavior 2. R was chosen for modelling because it is well suited for development and experimentation 3. Computing simulations took hours so I decided to parallelize it 4. As a first step I made an ansible script to run simulations on many instances
  • 7.
    Problems with simpledistributed simulations Stochastic simulations do not finish at the same time
  • 8.
    DoRedis to therescue!? Pros scheduler elastic easy to setup works in R Cons many things to choose from difficult to figure bugs
  • 9.
    DoRedis to therescue!? Pros scheduler this allows to use your cluster of instances very efficiently elastic this allows using spot-instances so it is very cost effective easy to setup just installing Redis was enough works in R
  • 10.
    DoRedis to therescue!? Cons tuning In order to get good performance I had to choose a good chunkSize, best way to combine results with .combine, etc difficult to figure bugs It was difficult to see logs from workers, and sometimes workers would not start for some reason difficulty setting up cluster In order to make a cluster one needs to run a script in each instance. I made an ansible script for this but it was still work
  • 11.
    Using Dask asa scheduler Pros Easy to setup Fast Cons aimed for Python scripts unmaintained tools some gotchas when running a cluster
  • 12.
    Using Dask asa scheduler Pros Easy to setup hammerlab/dask-distributed-on-kubernetes was used to run Dask on kubernetes and it was very easy. The only thing I had to make was a custom docker image. The final kubernetes yaml file can be found here. Also since I was using kubernetes it was easy to change the number of nodes on the fly. Fast maybe because I was using Python and had refactored my R code
  • 13.
    Using Dask asa scheduler Cons aimed for Python scripts When it comes to experimentation with data I think R is better with its syntax and I also like Rstudio much more than Spyder. However many functions that I needed, such as {truncnorm}, were implemented in Python as well and some R functions such as base::sample was easy replicate in Python
  • 14.
    Using Dask asa scheduler Cons unmaintained tools I first tried using dask-ec2 but it didn’t work. I tried to make it work by making my own ami and looking at a problematic part of the code but I decided to abandon it some gotchas when running a cluster Dask was easy to run in a single node but the same code didn’t work in a cluster you need to import and run multiprocessing.freeze_support you need to upload modules with client.upload_file some packages such as dask and distributed have to be of the same precise version in the client and in the cluster
  • 15.
    Using Dask asa scheduler Cons some gotchas when running a cluster to persist state of objects you need to make a threading.local variable like this: import threading myenv = threading.local() # and then # myenv = MyEnv() you have to execute multiprocessing.freeze_support() before running your code
  • 16.
    Final thoughts I haven’tlooked at other parallel backends for R. So maybe {snow} would be satisfactory although elasticity would be sacrificed.
  • 17.
    About myself Shunichi Otsuka Workingat Opt as an engineer/researcher