GPT-X

Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.

Improved Transformer

Transformer used in this repository attempts to improve the transformer using the additional modules below.

Name	Description	Link
Rezero	Rezero Is All You Need	link
Explicit Sparse Transformer	Concentrated Attention Through Explicit Selection	link
Macaron Architecture	Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View	link
RealFormer	Residual Attention	link
ALiBi Position Embedding	effective relative positional encoding

Model Description

model_name	n_params	n_layer	d_model	n_heads	vocab_size	max_seq_len	learning_rate
GPT-X 1B	1B	20	2048	16	22000	1024	2.0 x 10^-4

DeepSpeed

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.

Piepline Parallelism

You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).

GPU Usage

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 | | N/A 42C P0 44W / 250W | 16076MiB / 16130MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 | | N/A 45C P0 168W / 250W | 16060MiB / 16130MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 29525 C /home/ubuntu/anaconda3/bin/python 16065MiB | | 1 29528 C /home/ubuntu/anaconda3/bin/python 16049MiB | +-----------------------------------------------------------------------------+

Pipeline Parallelism Log

[2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1 [2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M) [2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)

[2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters stage=0 layers=11 0: Embedding 1: ReZeroSparseTopkDecoder 2: ReZeroSparseTopkDecoder 3: ReZeroSparseTopkDecoder 4: ReZeroSparseTopkDecoder 5: ReZeroSparseTopkDecoder 6: ReZeroSparseTopkDecoder 7: ReZeroSparseTopkDecoder 8: ReZeroSparseTopkDecoder 9: ReZeroSparseTopkDecoder 10: ReZeroSparseTopkDecoder stage=1 layers=12 11: ReZeroSparseTopkDecoder 12: ReZeroSparseTopkDecoder 13: ReZeroSparseTopkDecoder 14: ReZeroSparseTopkDecoder 15: ReZeroSparseTopkDecoder 16: ReZeroSparseTopkDecoder 17: ReZeroSparseTopkDecoder 18: ReZeroSparseTopkDecoder 19: ReZeroSparseTopkDecoder 20: ReZeroSparseTopkDecoder 21: LayerNorm 22: Linear loss: cross_entropy

TODO

~~ReZero~~
~~RealFormer, Residual Attention~~
~~Macaron architectures~~
~~Macaron architectures - layer Scale 0.5~~
~~Explicit Sparse Transformer~~
~~torch lightning~~
~~Deepspeed train on single GPU~~
apply wandb
Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.

GPT-3 Config

model_name	n_params	n_layer	d_model	n_heads	d_head	batch_size	learning_rate
GPT-3 175B	175B	96	12288	96	128	3.2M	0.6 x 10^-4
GPT-3 13B	13B	40	5140	40	128	2M	1.0 x 10^-4
GPT-3 6.7B	6.7B	32	4096	32	128	2M	1.2 x 10^-4
GPT-3 2.7B	2.7B	32	2560	32	80	1M	1.6 x 10^-4
GPT-3 1.3B	1.3B	24	2048	24	128	1M	2.0 x 10^-4

Issue

AttributeError: module 'deepspeed' has no attribute 'zero': reinstall deepspeed
userwarning: cuda initialization: the nvidia driver on your system is too old: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1
```
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
```
can't find CUDA_HOME path: reinstall cuda

References

Transformer

lucidrains/x-transformers

DeepSpeed

ReZero

/majumderb/rezero

Explicit Sparse Transformer

x-transformer: explicit_sparse_transformer

Macaron Architecrue

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

RealFormer Residual Attention

cloneofsimo/RealFormer-pytorch

DeepSpeed

PyTorch lightning DeepSpeed

Pipeline Parallelism

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
common		common
data		data
images		images
model		model
train		train
train_deepspeed		train_deepspeed
train_pl		train_pl
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-X

Improved Transformer

Model Description

DeepSpeed

Piepline Parallelism

GPU Usage

Pipeline Parallelism Log

TODO

Parameter For Few-shot

GPT-3 Config

Issue

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nawnoes/pytorch-gpt-x

Folders and files

Latest commit

History

Repository files navigation

GPT-X

Improved Transformer

Model Description

DeepSpeed

Piepline Parallelism

GPU Usage

Pipeline Parallelism Log

TODO

Parameter For Few-shot

GPT-3 Config

Issue

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages