Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.
Transformer used in this repository attempts to improve the transformer using the additional modules below.
| Name | Description | Link |
|---|---|---|
| Rezero | Rezero Is All You Need | link |
| Explicit Sparse Transformer | Concentrated Attention Through Explicit Selection | link |
| Macaron Architecture | Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View | link |
| RealFormer | Residual Attention | link |
| ALiBi Position Embedding | effective relative positional encoding |
| model_name | n_params | n_layer | d_model | n_heads | vocab_size | max_seq_len | learning_rate |
|---|---|---|---|---|---|---|---|
| GPT-X 1B | 1B | 20 | 2048 | 16 | 22000 | 1024 | 2.0 x 10^-4 |
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 | | N/A 42C P0 44W / 250W | 16076MiB / 16130MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 | | N/A 45C P0 168W / 250W | 16060MiB / 16130MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 29525 C /home/ubuntu/anaconda3/bin/python 16065MiB | | 1 29528 C /home/ubuntu/anaconda3/bin/python 16049MiB | +-----------------------------------------------------------------------------+ [2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1 [2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M) [2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M) [2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters stage=0 layers=11 0: Embedding 1: ReZeroSparseTopkDecoder 2: ReZeroSparseTopkDecoder 3: ReZeroSparseTopkDecoder 4: ReZeroSparseTopkDecoder 5: ReZeroSparseTopkDecoder 6: ReZeroSparseTopkDecoder 7: ReZeroSparseTopkDecoder 8: ReZeroSparseTopkDecoder 9: ReZeroSparseTopkDecoder 10: ReZeroSparseTopkDecoder stage=1 layers=12 11: ReZeroSparseTopkDecoder 12: ReZeroSparseTopkDecoder 13: ReZeroSparseTopkDecoder 14: ReZeroSparseTopkDecoder 15: ReZeroSparseTopkDecoder 16: ReZeroSparseTopkDecoder 17: ReZeroSparseTopkDecoder 18: ReZeroSparseTopkDecoder 19: ReZeroSparseTopkDecoder 20: ReZeroSparseTopkDecoder 21: LayerNorm 22: Linear loss: cross_entropy -
ReZero -
RealFormer, Residual Attention -
Macaron architectures -
Macaron architectures - layer Scale 0.5 -
Explicit Sparse Transformer -
torch lightning -
Deepspeed train on single GPU - apply wandb
- Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory
GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.
| model_name | n_params | n_layer | d_model | n_heads | d_head | batch_size | learning_rate |
|---|---|---|---|---|---|---|---|
| GPT-3 175B | 175B | 96 | 12288 | 96 | 128 | 3.2M | 0.6 x 10^-4 |
| GPT-3 13B | 13B | 40 | 5140 | 40 | 128 | 2M | 1.0 x 10^-4 |
| GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 128 | 2M | 1.2 x 10^-4 |
| GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | 80 | 1M | 1.6 x 10^-4 |
| GPT-3 1.3B | 1.3B | 24 | 2048 | 24 | 128 | 1M | 2.0 x 10^-4 |
-
AttributeError: module 'deepspeed' has no attribute 'zero': reinstall deepspeed -
userwarning: cuda initialization: the nvidia driver on your system is too old: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
-
can't find CUDA_HOME path: reinstall cuda
Transformer
DeepSpeed
ReZero
Explicit Sparse Transformer
Macaron Architecrue
RealFormer Residual Attention
DeepSpeed
Pipeline Parallelism