Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases

Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases Frank McQuillan Feb 2020

Nikhil Kak Ekta Khanna Orhan Kislal Domino Valdano Professor Arun Kumar Yuhao Zhang

Agenda 1. Training deep neural networks in parallel 2. Model hopper parallelism 3. Model hopper on MPP 4. Examples – Grid search – AutoML (Hyperband)

Deep Learning • Type of machine learning inspired by biology of the brain Convolutional neural network • Artificial neural network have multiple layers between input and output

Problem: Training deep nets is painful Batch size 8, 16, 64, 256 ... Model architecture 3 layer CNN, 5 layer CNN, LSTM … Learning rate 0.1, 0.01, 0.001, 0.0001 ... Regularization L2, L1, dropout, batch norm ... 4 4 4 4 256 different configurations! Need for speed → parallelism Model accuracy = f(model architecture, hyperparameters, datasets) Trial and error

1. Training Deep Neural Nets in Parallel

Source: https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1 Gradient Descent

Mini-batch SGD on a Single Machine Model Updated Model η ∇ Learning rate Avg. of gradients x1 x2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One mini- batch

Mini-batch SGD on a Single Machine X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One iteration (epoch) One minibatch Sequential

Models (tasks) Replicate datasets on each machine Task Parallelism Train multiple models at the same time

Task Parallelism Con: wasted memory and storage

Task Parallelism Con: wasted network bandwidth Common FS or data repo

Data Parallelism Models (tasks) Partition data across machines Train models one at a time

Data Parallelism Queue Training on one mini-batch or full partition ● Update model every iteration: bulk synchronous parallelism (model averaging) ○ Convergence can be poor Updates ● Update per mini-batch: ○ Sync parameter server ○ Async parameter server ○ Decentralized: MPI allreduce (Horovod) ○ High communication cost

Task Parallelism Data Parallelism Model Hopper Parallelism Let’s try to get the best of both worlds

Model Hopper Parallelism Models (tasks) Partition data across machines

Model Hopper Parallelism Training on entire partition

Model Hopper Parallelism Training on entire partition Model hopping & training

Model Hopper Parallelism Training on entire partition Model hopping & training Model hopping & training

Model Hopper Parallelism Training on entire partition Model hopping & training Model hopping & trainingOne iteration One sub- epoch

Model Hopper Parallelism https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf ● 8-node GPU cluster ● Nvidia P100 GPUs ● Imagenet dataset ● 16 model configurations using VGG16 and ResNet50 Model hopper

3. Model Hopper on Greenplum Database

Greenplum Database with GPUs … Master Host SQL Segment Host Segment Host Segment Host Segment Host GPU N … GPU 1 … GPU N … GPU 1 GPUs only on certain hosts for cost reasons … … … …

API -- load_model_selection_table() SELECT load_model_selection_table( 'model_arch_table', -- model architecture table 'model_selection_table’, -- output table ARRAY[...], -- model architecture ids ARRAY[...], -- compile hyperparameters ARRAY[...] -- fit hyperparameters ); model_selection_table

API -- madlib_keras_fit_multiple_model() SELECT madlib_keras_fit_multiple_model( 'data', -- data table 'trained_models', -- output table name 'model_selection_table', -- model selection table 100, -- number of iterations True, -- use GPUs ... -- optional parameters ); trained_models

4a. Grid Search Example Using Model Hopper

Test Setup • 60k 32x32 color images in 10 classes, with 6k images per class • 50k training images and 10k test images https://www.cs.toronto.edu/~kriz/cifar.html • 4 hosts each with 32 vCPUs & 150 GB memory & 4 NVIDIA Tesla P100 GPUs • Greenplum 5 with 4 segments per host • Apache MADlib 1.17 • Keras 2.2.4, TensorFlow 1.13.1 CIFAR-10

Approach to Training Henri Cartier-Bresson Area of promise to fine tune

Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 3 4 8 96 configs

Too much information! Run time: 2:47:57 on 16 workers/16 GPUS

Why not just use the most accurate? 82.3%

Because it is overfitting the data Overfit Overfit

Focus on models with low overfit Models: 2 & 3 Optimizers: Adam & SGD Batch sizes: 128 & 256 These charts only consider models with accuracy > 70%

Example with less overfitting 70.3%

Refine Model Configs Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=.0001 decay=1e-6 RMSprop lr=.001 decay=1e-6 Adam lr=.0001 Adam lr=.001 SGD r=.001 momentum=0.9 SGD r=.01 momentum=0.9 SGD r=.001 momentum=0.95 SGD r=.01 momentum=0.95 Batch size Model id 32 64 128 256 2 2 3 12 configs (was 96)

Re-run the best 12 with more iterations Run time: 1:53:06 on 16 workers/16 GPUS

Example good result 80.0% sgd (lr=0.001,momentum=0.95) batch size=256

Inference Who am I ??? Model says: dog 0.9532 deer 0.0258 bird 0.0093

4b. AutoML Example Using Model Hopper

Hyperband Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization https://arxiv.org/pdf/1603.06560.pdf Initial models (tasks) train ri iterations ... train ri iterations Iteration 1 survivors Iteration 2 survivors Successive halving

Hyperband Inputs: ● R: maximum amount of resource that can be allocated to a single configuration ● 𝜂: controls the proportion of configurations discarded in each round of successive halving R= 81 𝜂 = 3 num model configsbracket num iterations

Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.001] SGD lr=[.001, 0.01] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Hyperband schedule Note ranges specified, not exact values

Again, a lot of information! Run time: 3:21:04 on 16 workers/16 GPUS

Refine Model Configurations Model architecture Model id Type Weights 1 CNN 1.2M 2 CNN 552K 3 CNN 500K Optimizer Type Params RMSprop lr=[.0001, 0.001] decay=1e-6 Adam lr=[.0001, 0.0005] SGD lr=[.001, 0.01] lr=[.001, 0.005] momentum=[0.9, 0.95] Batch size Model id 32 64 128 256 Adjust Hyperband schedule Adjust ranges

Re-run the best with Hyperband again Run time: 55:49 on 16 workers/16 GPUS

Example good result 78.2% sgd (lr=0.00485,momentum=0.90919) batch size=128

Summary • Model hopper can efficiently train many deep nets at a time on an MPP database • Can add autoML methods on top of model hopper • Areas of future work: – Improve GPU efficiency – Add more autoML methods

References • Model hopper – Supun Nakandala, Yuhao Zhang, and Arun Kumar, "Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems," DEEM’30, June 30, 2019, Amsterdam, Netherlands https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf • Greenplum – https://github.com/greenplum-db/gpdb – https://greenplum.org/ • Apache MADlib – https://github.com/apache/madlib – http://madlib.apache.org/ • Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf-site/community-artifacts

Tuning Epochs on a Distributed System

Effect of cluster size CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 5 passes per sub-epoch X 10 iterations = 50 epochs 5 10 (16 workers)(8 workers)(4 workers)

Effect of cluster size 1 50 CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 50 passes per sub-epoch X 1 iteration = 50 epochs (16 workers)(8 workers)(4 workers)(1 worker)

Effect of passes per sub-epoch CIFAR-10, 2 CNNs with ~500k and ~1M weights 16 model configurations 4 hosts X 4 workers per host = 16 workers 4 hosts X 4 GPUs per host = 16 GPUs

Comparison of Parallel Training Approaches

Implementation on GPDB 1. Scheduling 2. Dispatch model to specific segment (worker) 3. Run training 4. No data motion

Implementation on GPDB -- Scheduling One iteration One sub- epoch Round-robin

Implementation on GPDB -- Dispatching Models DISTRIBUTED BY(__dist_key__) DISTRIBUTED BY(__dist_key__) Data table 1 2 3 1 2 3 Model table JOIN

Implementation on GPDB -- Training 1 2 3 SELECT FROM GROUP BY UDA __dist_key__

Hyperband Implementation on Greenplum R= 81 𝜂 = 3 81 model configurations 1 iteration each 1 model configuration 81 iterations Running 1 bracket at a time will result in idle machines

Hyperband Implementation on Greenplum R= 81 𝜂 = 3 1+1+1+2+5 = 10 model configurations 81 iteration eachs Run across brackets “on the diagonal” to keep machines busy

Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases

More Related Content

What's hot

Similar to Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases

More from inside-BigData.com

Recently uploaded

Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases