Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases Frank McQuillan Feb 2020
Nikhil Kak Ekta Khanna Orhan Kislal Domino Valdano Professor Arun Kumar Yuhao Zhang
Free and Open Source
Agenda 1. Training deep neural networks in parallel 2. Model hopper parallelism 3. Model hopper on MPP 4. Examples – Grid search – AutoML (Hyperband)
Deep Learning • Type of machine learning inspired by biology of the brain Convolutional neural network • Artificial neural network have multiple layers between input and output
Problem: Training deep nets is painful Batch size 8, 16, 64, 256 ... Model architecture 3 layer CNN, 5 layer CNN, LSTM … Learning rate 0.1, 0.01, 0.001, 0.0001 ... Regularization L2, L1, dropout, batch norm ... 4 4 4 4 256 different configurations! Need for speed → parallelism Model accuracy = f(model architecture, hyperparameters, datasets) Trial and error
1. Training Deep Neural Nets in Parallel
Source: https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1 Gradient Descent
Mini-batch SGD on a Single Machine Model Updated Model η ∇ Learning rate Avg. of gradients x1 x2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One mini- batch
Mini-batch SGD on a Single Machine X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One iteration (epoch) One minibatch Sequential
How to parallelize?
Models (tasks) Replicate datasets on each machine Task Parallelism Train multiple models at the same time
Task Parallelism Con: wasted memory and storage
Task Parallelism Con: wasted network bandwidth Common FS or data repo
Data Parallelism Models (tasks) Partition data across machines Train models one at a time
Data Parallelism Queue Training on one mini-batch or full partition ● Update model every iteration: bulk synchronous parallelism (model averaging) ○ Convergence can be poor Updates ● Update per mini-batch: ○ Sync parameter server ○ Async parameter server ○ Decentralized: MPI allreduce (Horovod) ○ High communication cost
Task Parallelism Data Parallelism Model Hopper Parallelism Let’s try to get the best of both worlds
2. Model Hopper Parallelism
Model Hopper Parallelism Models (tasks) Partition data across machines
Model Hopper Parallelism Training on entire partition
Model Hopper Parallelism Training on entire partition Model hopping & training
Model Hopper Parallelism Training on entire partition Model hopping & training Model hopping & training
Model Hopper Parallelism Training on entire partition Model hopping & training Model hopping & trainingOne iteration One sub- epoch
Model Hopper Parallelism https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf ● 8-node GPU cluster ● Nvidia P100 GPUs ● Imagenet dataset ● 16 model configurations using VGG16 and ResNet50 Model hopper
3. Model Hopper on Greenplum Database
Greenplum Database with GPUs … Master Host SQL Segment Host Segment Host Segment Host Segment Host GPU N … GPU 1 … GPU N … GPU 1 GPUs only on certain hosts for cost reasons … … … …
API -- load_model_selection_table() SELECT load_model_selection_table( 'model_arch_table', -- model architecture table 'model_selection_table’, -- output table ARRAY[...], -- model architecture ids ARRAY[...], -- compile hyperparameters ARRAY[...] -- fit hyperparameters ); model_selection_table
API -- madlib_keras_fit_multiple_model() SELECT madlib_keras_fit_multiple_model( 'data', -- data table 'trained_models', -- output table name 'model_selection_table', -- model selection table 100, -- number of iterations True, -- use GPUs ... -- optional parameters ); trained_models
4a. Grid Search Example Using Model Hopper
Test Setup • 60k 32x32 color images in 10 classes, with 6k images per class • 50k training images and 10k test images https://www.cs.toronto.edu/~kriz/cifar.html • 4 hosts each with 32 vCPUs & 150 GB memory & 4 NVIDIA Tesla P100 GPUs • Greenplum 5 with 4 segments per host • Apache MADlib 1.17 • Keras 2.2.4, TensorFlow 1.13.1 CIFAR-10
Approach to Training Henri Cartier-Bresson Area of promise to fine tune
Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 3 4 8 96 configs
Too much information! Run time: 2:47:57 on 16 workers/16 GPUS
Why not just use the most accurate? 82.3%
Because it is overfitting the data Overfit Overfit
Focus on models with low overfit Models: 2 & 3 Optimizers: Adam & SGD Batch sizes: 128 & 256 These charts only consider models with accuracy > 70%
Example with less overfitting 70.3%
Refine Model Configs Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 2 2 3 12 configs (was 96)
Re-run the best 12 with more iterations Run time: 1:53:06 on 16 workers/16 GPUS
Example good result 80.0% sgd (lr=0.001,momentum=0.95) batch size=256
Inference Who am I ??? Model says: dog 0.9532 deer 0.0258 bird 0.0093
4b. AutoML Example Using Model Hopper
Hyperband Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization https://arxiv.org/pdf/1603.06560.pdf Initial models (tasks) train ri iterations ... train ri iterations Iteration 1 survivors Iteration 2 survivors Successive halving
Hyperband Inputs: ● R: maximum amount of resource that can be allocated to a single configuration ● 𝜂: controls the proportion of configurations discarded in each round of successive halving R= 81 𝜂 = 3 num model configsbracket num iterations
Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.001] SGD lr=[.001, 0.01] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Hyperband schedule Note ranges specified, not exact values
Again, a lot of information! Run time: 3:21:04 on 16 workers/16 GPUS
Refine Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.0005] SGD lr=[.001, 0.01] lr=[.001, 0.005] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Adjust Hyperband schedule Adjust ranges
Re-run the best with Hyperband again Run time: 55:49 on 16 workers/16 GPUS
Example good result 78.2% sgd (lr=0.00485,momentum=0.90919) batch size=128
Summary • Model hopper can efficiently train many deep nets at a time on an MPP database • Can add autoML methods on top of model hopper • Areas of future work: – Improve GPU efficiency – Add more autoML methods
References • Model hopper – Supun Nakandala, Yuhao Zhang, and Arun Kumar, "Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems," DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf • Greenplum – https://github.com/greenplum-db/gpdb – https://greenplum.org/ • Apache MADlib – https://github.com/apache/madlib – http://madlib.apache.org/ • Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf-site/community-artifacts
Thank you!
Backup Slides
Tuning Epochs on a Distributed System
Effect of cluster size CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 5 passes per sub-epoch X 10 iterations = 50 epochs 5 10 (16 workers)(8 workers)(4 workers)
Effect of cluster size 1 50 CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 50 passes per sub-epoch X 1 iteration = 50 epochs (16 workers)(8 workers)(4 workers)(1 worker)
Effect of passes per sub-epoch CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 4 hosts X 4 workers per host = 16 workers 4 hosts X 4 GPUs per host = 16 GPUs
Other MOP Slides
Comparison of Parallel Training Approaches
Implementation on GPDB 1. Scheduling 2. Dispatch model to specific segment (worker) 3. Run training 4. No data motion
Implementation on GPDB -- Scheduling One iteration One sub- epoch Round-robin
Implementation on GPDB -- Dispatching Models DISTRIBUTED BY(__dist_key__) DISTRIBUTED BY(__dist_key__) Data table 1 2 3 1 2 3 Model table JOIN
Implementation on GPDB -- Training 1 2 3 SELECT FROM GROUP BY UDA __dist_key__
Other Hyperband Slides
Hyperband Implementation on Greenplum R= 81 𝜂 = 3 81 model configurations 1 iteration each 1 model configuration 81 iterations Running 1 bracket at a time will result in idle machines
Hyperband Implementation on Greenplum R= 81 𝜂 = 3 1+1+1+2+5 = 10 model configurations 81 iteration eachs Run across brackets “on the diagonal” to keep machines busy

Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases

  • 1.
    Efficient Model Selectionfor Deep Neural Networks on Massively Parallel Processing Databases Frank McQuillan Feb 2020
  • 2.
    Nikhil Kak EktaKhanna Orhan Kislal Domino Valdano Professor Arun Kumar Yuhao Zhang
  • 3.
  • 4.
    Agenda 1. Training deepneural networks in parallel 2. Model hopper parallelism 3. Model hopper on MPP 4. Examples – Grid search – AutoML (Hyperband)
  • 5.
    Deep Learning • Typeof machine learning inspired by biology of the brain Convolutional neural network • Artificial neural network have multiple layers between input and output
  • 6.
    Problem: Training deepnets is painful Batch size 8, 16, 64, 256 ... Model architecture 3 layer CNN, 5 layer CNN, LSTM … Learning rate 0.1, 0.01, 0.001, 0.0001 ... Regularization L2, L1, dropout, batch norm ... 4 4 4 4 256 different configurations! Need for speed → parallelism Model accuracy = f(model architecture, hyperparameters, datasets) Trial and error
  • 7.
    1. Training DeepNeural Nets in Parallel
  • 8.
  • 9.
    Mini-batch SGD ona Single Machine Model Updated Model η ∇ Learning rate Avg. of gradients x1 x2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One mini- batch
  • 10.
    Mini-batch SGD ona Single Machine X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One iteration (epoch) One minibatch Sequential
  • 11.
  • 12.
    Models (tasks) Replicate datasetson each machine Task Parallelism Train multiple models at the same time
  • 13.
    Task Parallelism Con: wastedmemory and storage
  • 14.
    Task Parallelism Con: wastednetwork bandwidth Common FS or data repo
  • 15.
    Data Parallelism Models (tasks) Partitiondata across machines Train models one at a time
  • 16.
    Data Parallelism Queue Training onone mini-batch or full partition ● Update model every iteration: bulk synchronous parallelism (model averaging) ○ Convergence can be poor Updates ● Update per mini-batch: ○ Sync parameter server ○ Async parameter server ○ Decentralized: MPI allreduce (Horovod) ○ High communication cost
  • 17.
    Task Parallelism DataParallelism Model Hopper Parallelism Let’s try to get the best of both worlds
  • 18.
    2. Model HopperParallelism
  • 19.
    Model Hopper Parallelism Models(tasks) Partition data across machines
  • 20.
  • 21.
    Model Hopper Parallelism Trainingon entire partition Model hopping & training
  • 22.
    Model Hopper Parallelism Trainingon entire partition Model hopping & training Model hopping & training
  • 23.
    Model Hopper Parallelism Trainingon entire partition Model hopping & training Model hopping & trainingOne iteration One sub- epoch
  • 24.
    Model Hopper Parallelism https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf ●8-node GPU cluster ● Nvidia P100 GPUs ● Imagenet dataset ● 16 model configurations using VGG16 and ResNet50 Model hopper
  • 25.
    3. Model Hopperon Greenplum Database
  • 26.
    Greenplum Database withGPUs … Master Host SQL Segment Host Segment Host Segment Host Segment Host GPU N … GPU 1 … GPU N … GPU 1 GPUs only on certain hosts for cost reasons … … … …
  • 27.
    API -- load_model_selection_table() SELECTload_model_selection_table( 'model_arch_table', -- model architecture table 'model_selection_table’, -- output table ARRAY[...], -- model architecture ids ARRAY[...], -- compile hyperparameters ARRAY[...] -- fit hyperparameters ); model_selection_table
  • 28.
    API -- madlib_keras_fit_multiple_model() SELECTmadlib_keras_fit_multiple_model( 'data', -- data table 'trained_models', -- output table name 'model_selection_table', -- model selection table 100, -- number of iterations True, -- use GPUs ... -- optional parameters ); trained_models
  • 29.
    4a. Grid SearchExample Using Model Hopper
  • 30.
    Test Setup • 60k32x32 color images in 10 classes, with 6k images per class • 50k training images and 10k test images https://www.cs.toronto.edu/~kriz/cifar.html • 4 hosts each with 32 vCPUs & 150 GB memory & 4 NVIDIA Tesla P100 GPUs • Greenplum 5 with 4 segments per host • Apache MADlib 1.17 • Keras 2.2.4, TensorFlow 1.13.1 CIFAR-10
  • 31.
    Approach to Training HenriCartier-Bresson Area of promise to fine tune
  • 32.
    Model Configurations Model architecture Modelid Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 3 4 8 96 configs
  • 33.
    Too much information! Runtime: 2:47:57 on 16 workers/16 GPUS
  • 34.
    Why not justuse the most accurate? 82.3%
  • 35.
    Because it isoverfitting the data Overfit Overfit
  • 36.
    Focus on modelswith low overfit Models: 2 & 3 Optimizers: Adam & SGD Batch sizes: 128 & 256 These charts only consider models with accuracy > 70%
  • 37.
    Example with lessoverfitting 70.3%
  • 38.
    Refine Model Configs Modelarchitecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 2 2 3 12 configs (was 96)
  • 39.
    Re-run the best12 with more iterations Run time: 1:53:06 on 16 workers/16 GPUS
  • 40.
    Example good result 80.0% sgd(lr=0.001,momentum=0.95) batch size=256
  • 41.
    Inference Who am I??? Model says: dog 0.9532 deer 0.0258 bird 0.0093
  • 42.
    4b. AutoML ExampleUsing Model Hopper
  • 43.
    Hyperband Hyperband: A NovelBandit-Based Approach to Hyperparameter Optimization https://arxiv.org/pdf/1603.06560.pdf Initial models (tasks) train ri iterations ... train ri iterations Iteration 1 survivors Iteration 2 survivors Successive halving
  • 44.
    Hyperband Inputs: ● R: maximumamount of resource that can be allocated to a single configuration ● 𝜂: controls the proportion of configurations discarded in each round of successive halving R= 81 𝜂 = 3 num model configsbracket num iterations
  • 45.
    Model Configurations Model architecture Modelid Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.001] SGD lr=[.001, 0.01] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Hyperband schedule Note ranges specified, not exact values
  • 46.
    Again, a lotof information! Run time: 3:21:04 on 16 workers/16 GPUS
  • 47.
    Refine Model Configurations Modelarchitecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.0005] SGD lr=[.001, 0.01] lr=[.001, 0.005] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Adjust Hyperband schedule Adjust ranges
  • 48.
    Re-run the bestwith Hyperband again Run time: 55:49 on 16 workers/16 GPUS
  • 49.
    Example good result 78.2% sgd(lr=0.00485,momentum=0.90919) batch size=128
  • 50.
    Summary • Model hoppercan efficiently train many deep nets at a time on an MPP database • Can add autoML methods on top of model hopper • Areas of future work: – Improve GPU efficiency – Add more autoML methods
  • 51.
    References • Model hopper –Supun Nakandala, Yuhao Zhang, and Arun Kumar, "Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems," DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf • Greenplum – https://github.com/greenplum-db/gpdb – https://greenplum.org/ • Apache MADlib – https://github.com/apache/madlib – http://madlib.apache.org/ • Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf-site/community-artifacts
  • 52.
  • 53.
  • 54.
    Tuning Epochs ona Distributed System
  • 55.
    Effect of clustersize CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 5 passes per sub-epoch X 10 iterations = 50 epochs 5 10 (16 workers)(8 workers)(4 workers)
  • 56.
    Effect of clustersize 1 50 CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 50 passes per sub-epoch X 1 iteration = 50 epochs (16 workers)(8 workers)(4 workers)(1 worker)
  • 57.
    Effect of passesper sub-epoch CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 4 hosts X 4 workers per host = 16 workers 4 hosts X 4 GPUs per host = 16 GPUs
  • 58.
  • 59.
    Comparison of ParallelTraining Approaches
  • 60.
    Implementation on GPDB 1.Scheduling 2. Dispatch model to specific segment (worker) 3. Run training 4. No data motion
  • 61.
    Implementation on GPDB-- Scheduling One iteration One sub- epoch Round-robin
  • 62.
    Implementation on GPDB-- Dispatching Models DISTRIBUTED BY(__dist_key__) DISTRIBUTED BY(__dist_key__) Data table 1 2 3 1 2 3 Model table JOIN
  • 63.
    Implementation on GPDB-- Training 1 2 3 SELECT FROM GROUP BY UDA __dist_key__
  • 64.
  • 65.
    Hyperband Implementation onGreenplum R= 81 𝜂 = 3 81 model configurations 1 iteration each 1 model configuration 81 iterations Running 1 bracket at a time will result in idle machines
  • 66.
    Hyperband Implementation onGreenplum R= 81 𝜂 = 3 1+1+1+2+5 = 10 model configurations 81 iteration eachs Run across brackets “on the diagonal” to keep machines busy