Skip to content

Deepspeed ZeRO Infinity

afiaka87 edited this page Apr 22, 2021 · 3 revisions

DeepSpeed "ZeRO Infinity"

Offload parameters and optimizer to CPU and/or NVM-E drive

warning: This stuff is experimental. If you have issues let us know in the Issues section. so we can help you fix it or figure it out.

Also - many of these options will not work well or at all on anything other than deepspeed "stage 3". DeepSpeed is sort of a tough install - and stage 3 is often unsupported on GPUs other than the V100 and A100. There are cards which are similar enough in architecture - the RTX2000 and RTX3000 series of cards, that could work, but currently have a tough time with it.

Dependencies:

  • llvm-9-dev
  • cmake
  • gcc
  • python3.8.x
  • deepspeed
  • libaio-dev
  • cudatoolkit=10.2 or 11.1 # Doesn't work on 11.2 unfortunately.
  • pytorch=1.8.*

Debian

apt install -y libaio-dev gcc cmake llvm-9-dev python -V # Check your version # For CUDA 11.1 - change if you have a different version. CUDA 11.2 not supported. pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html pip3 install deepspeed pip3 install dalle-pytorch 

Pop!_OS 20.04 (see notes about 20.10)

At the time of this writing, 20.04 still uses system76-cuda-10.2 and system76-cuddn-10.2 in their "latest" release. On 20.10 system76-cuda-latest will give you cuda-toolkit-11.2. As such if you're on Pop!_OS version 20.10 (not 20.04), then you should be sure to install system76-cuda-11.1 and system76-cudnn-11.1 instead.

sudo apt install system76-cuda-latest sudo apt install system76-cudnn-latest sudo update-alternatives --config cuda # Choose the most recent version of cuda-toolkit-you see here.  # After you're done - to switch back to your original cuda-toolkit version, just run: sudo update-alternatives --config cuda

Stage 3 Barebones configuration template

In your train_dalle.py there is a dictionary "deepspeed_config" which you need to change. There are far more parameters to tinker with. You can find those at the DeepSpeed ZeRO json config documentation

deepspeed_config = { "zero_optimization": { "stage": 3, # Offload the model parameters If you have an nvme drive - you should use the nvme option. # Otherwise, use 'cpu' and remove the `nvme_path` line "offload_param": { "device": "nvme", "nvme_path": "/path/to/nvme/folder", }, # Offload the optimizer of choice. If you have an nvme drive - you should use the nvme option. # Otherwise, use 'cpu' and remove the `nvme_path` line "offload_optimizer": { "device": "nvme", # options are 'none', 'cpu', 'nvme' "nvme_path": "/path/to/nvme/folder", }, }, # Override pytorch's Adam optim with `FusedAdam` (just called Adam here). Can  "optimizer": { "type": "Adam", # You can also use AdamW here "params": { "lr": LEARNING_RATE, }, }, 'train_batch_size': BATCH_SIZE, 'gradient_clipping': GRAD_CLIP_NORM, 'fp16': { 'enabled': args.fp16, }, }

Clone this wiki locally