Skip to content

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 30, 2023

Fixes: #12 (Use bigcode-project/transformers#5)

  • Add pre_allocate_cache cache option. Big speedup on GPU but doesn't help with the CPU bottleneck.
  • Use torch inference mode. Speeds up CPU calls, marginal GPU speedup.
python3 src/main.py --hidden_size=2048 --n_head=16 --n_layer=24 --pipeline_class=HF_Pipeline --model_class=GPT2 --dtype=float16 --device=cuda --cycles=5 --batch_size=256 --max_new_tokens=100 --n_positions=512 --attention_type=[1/2] --max_log_outputs=1 --activation_function=gelu_new_python [--pre_allocate_cache] [--profile] 

MQA, before: e2e = 1242 ms, cuda = 762 ms
MQA, pre-allocate: e2e = 1248 ms, cuda = 693 ms (no e2e speedup for this case but there will be for cases that aren't cpu-bottlenecked)
MQA, inference mode: e2e = 1130 ms, cuda = 690 ms
MQA, inference mode, no pre-allocate: e2e = 1121 ms, cuda = 760 ms

MHA, before: e2e = 2241 ms, cuda = 2201 ms
MHA, pre-allocate: e2e = 1442 ms, cuda = 1048 ms
MHA, inference mode: e2e = 1249 ms, cuda = 1046 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants