Deepspeed ZeRO Infinity

DeepSpeed "ZeRO Infinity"

Offload parameters and optimizer to CPU and/or NVM-E drive

warning: This stuff is experimental. If you have issues let us know in the Issues section. so we can help you fix it or figure it out.

Also - many of these options will not work well or at all on anything other than deepspeed "stage 3". DeepSpeed is sort of a tough install - and stage 3 is often unsupported on GPUs other than the V100 and A100. There are cards which are similar enough in architecture - the RTX2000 and RTX3000 series of cards, that could work, but currently have a tough time with it.

Dependencies:

llvm-9-dev
cmake
gcc
python3.8.x
deepspeed
libaio-dev
cudatoolkit=10.2 or 11.1 # Doesn't work on 11.2 unfortunately.
pytorch=1.8.*

Debian

apt install -y libaio-dev gcc cmake llvm-9-dev python -V # Check your version # For CUDA 11.1 - change if you have a different version. CUDA 11.2 not supported. pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html pip3 install deepspeed pip3 install dalle-pytorch

Pop!_OS 20.04 (see notes about 20.10)

At the time of this writing, 20.04 still uses system76-cuda-10.2 and system76-cuddn-10.2 in their "latest" release. On 20.10 system76-cuda-latest will give you cuda-toolkit-11.2. As such if you're on Pop!_OS version 20.10 (not 20.04), then you should be sure to install system76-cuda-11.1 and system76-cudnn-11.1 instead.

sudo apt install system76-cuda-latest sudo apt install system76-cudnn-latest sudo update-alternatives --config cuda # Choose the most recent version of cuda-toolkit-you see here.  # After you're done - to switch back to your original cuda-toolkit version, just run: sudo update-alternatives --config cuda

Stage 3 Barebones configuration template

In your train_dalle.py there is a dictionary "deepspeed_config" which you need to change. There are far more parameters to tinker with. You can find those at the DeepSpeed ZeRO json config documentation

deepspeed_config = { "zero_optimization": { "stage": 3, # Offload the model parameters If you have an nvme drive - you should use the nvme option. # Otherwise, use 'cpu' and remove the `nvme_path` line "offload_param": { "device": "nvme", "nvme_path": "/path/to/nvme/folder", }, # Offload the optimizer of choice. If you have an nvme drive - you should use the nvme option. # Otherwise, use 'cpu' and remove the `nvme_path` line "offload_optimizer": { "device": "nvme", # options are 'none', 'cpu', 'nvme' "nvme_path": "/path/to/nvme/folder", }, }, # Override pytorch's Adam optim with `FusedAdam` (just called Adam here). Can  "optimizer": { "type": "Adam", # You can also use AdamW here "params": { "lr": LEARNING_RATE, }, }, 'train_batch_size': BATCH_SIZE, 'gradient_clipping': GRAD_CLIP_NORM, 'fp16': { 'enabled': args.fp16, }, }

lord krishna with arjun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepspeed ZeRO Infinity

DeepSpeed "ZeRO Infinity"

Offload parameters and optimizer to CPU and/or NVM-E drive

Debian

Pop!_OS 20.04 (see notes about 20.10)

Stage 3 Barebones configuration template

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally