Skip to content

Conversation

@AnMakc
Copy link
Contributor

@AnMakc AnMakc commented Oct 29, 2025

Add inference optimisations

  • Use torch SDPA
  • Do not collect tokenizer metrics during inference
  • Remove cuda cache emptying on each step as memory leak is not reproducible with this PR
  • Reduce memory allocations in autoregressive loop

Test fixtures updated after switching to SDPA. No regressions observed, just new zero baseline set.

On A100 with 512 context and 32 batch size (num samples) throughput improved 3.34 -> 4.58 batch/sec (107 -> 147 sample/sec).

I will probably look into adding KV cache in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant