Improve inference throughput #174

AnMakc · 2025-10-29T16:00:52Z

Add inference optimisations

Use torch SDPA
Do not collect tokenizer metrics during inference
Remove cuda cache emptying on each step as memory leak is not reproducible with this PR
Reduce memory allocations in autoregressive loop

Test fixtures updated after switching to SDPA. No regressions observed, just new zero baseline set.

On A100 with 512 context and 32 batch size (num samples) throughput improved 3.34 -> 4.58 batch/sec (107 -> 147 sample/sec).

I will probably look into adding KV cache in a separate PR.

…now.

…pu-gpy syncs.

AnMakc added 4 commits October 29, 2025 12:04

Use pytorch SDPA implementation

dccfa76

Do not collect tokenizer metrics during inference

21eb36a

Remove unnecessary CUDA cache clearing - no memory leak reproducible …

6456913

…now.

Refactor auto_regressive_inference to reduce memory allocations and c…

b62f780

…pu-gpy syncs.

AnMakc force-pushed the master branch from c408129 to b62f780 Compare October 29, 2025 20:38

Provide feedback