[XPU] Implemented 32bit optimizers in triton #1710

YangKai0616 · 2025-07-16T11:13:50Z

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

Pure Torch implementation: utilizing torch.compile
Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms BNB step: 0.516ms Torch step (eager): 1.058ms BNB step: 0.517ms Torch step (eager): 1.080ms BNB step: 0.527ms Torch step (eager): 1.069ms BNB step: 0.539ms Torch step (eager): 1.034ms BNB step: 0.526ms

Pure Triton implementation:

Torch step (eager): 1.034ms BNB step: 0.524ms Torch step (eager): 1.054ms BNB step: 0.488ms Torch step (eager): 1.031ms BNB step: 0.526ms Torch step (eager): 1.047ms BNB step: 0.538ms Torch step (eager): 1.045ms BNB step: 0.489ms

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms BNB step: 0.335ms Torch step (eager): 0.354ms BNB step: 0.226ms Torch step (eager): 0.347ms BNB step: 0.227ms Torch step (eager): 0.358ms BNB step: 0.232ms Torch step (eager): 0.349ms BNB step: 0.225ms

Pure Triton implementation:

Torch step (eager): 0.346ms BNB step: 0.226ms Torch step (eager): 0.337ms BNB step: 0.216ms Torch step (eager): 0.338ms BNB step: 0.215ms Torch step (eager): 0.333ms BNB step: 0.226ms Torch step (eager): 0.349ms BNB step: 0.235ms

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.

bitsandbytes/_ops.py

bitsandbytes/backends/triton/kernels_optim.py

bitsandbytes/functional.py

jiqing-feng · 2025-07-23T02:15:48Z

Hi @matthewdouglas , would you please review this PR? Thanks!

yao-matrix · 2025-08-18T23:10:09Z

@matthewdouglas , could you pls help review this PR and #1692, thx very much

matthewdouglas

Looks good to me, thank you!

…ch implementation

YangKai0616 changed the title ~~[XPU] Implemented 32bit optimizers in triton~~ [Draft][XPU] Implemented 32bit optimizers in triton Jul 16, 2025

YangKai0616 changed the title ~~[Draft][XPU] Implemented 32bit optimizers in triton~~ [XPU] Implemented 32bit optimizers in triton Jul 17, 2025

YangKai0616 marked this pull request as ready for review July 17, 2025 10:55

jiqing-feng reviewed Jul 18, 2025

View reviewed changes

bitsandbytes/_ops.py Outdated Show resolved Hide resolved

bitsandbytes/backends/triton/kernels_optim.py Outdated Show resolved Hide resolved

bitsandbytes/functional.py Outdated Show resolved Hide resolved

christoph-koehncke added Intel Optimizers Issues or feature requests relating to optimizers labels Jul 29, 2025

matthewdouglas added this to the v0.49.0 milestone Sep 2, 2025

matthewdouglas modified the milestones: v0.49.0, v0.48.0 Sep 15, 2025

matthewdouglas approved these changes Sep 15, 2025

View reviewed changes

YangKai0616 and others added 5 commits September 15, 2025 10:28

Implemented 32bit optimizers in triton

b8a8a17

Modify Comments

5b784a3

Optimizing pure torch implementation

4e40c7f

Restore the order of parameters and modify the position of pure pytor…

06279af

…ch implementation

Restore files permissions

810e8cb

matthewdouglas force-pushed the 32bit_optimizer branch from 77cce6e to 810e8cb Compare September 15, 2025 14:28

matthewdouglas merged commit 275671b into bitsandbytes-foundation:main Sep 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

[XPU] Implemented 32bit optimizers in triton #1710

Uh oh!

YangKai0616 commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Jul 23, 2025

yao-matrix commented Aug 18, 2025

matthewdouglas left a comment

Labels

6 participants

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

[XPU] Implemented 32bit optimizers in triton #1710

Uh oh!

Conversation

YangKai0616 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Jul 23, 2025

yao-matrix commented Aug 18, 2025

matthewdouglas left a comment

Choose a reason for hiding this comment

Labels

6 participants

YangKai0616 commented Jul 16, 2025 •

edited

Loading