Skip to content

Conversation

@YangKai0616
Copy link
Contributor

@YangKai0616 YangKai0616 commented Jul 16, 2025

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

  1. Pure Torch implementation: utilizing torch.compile
  2. Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms BNB step: 0.516ms Torch step (eager): 1.058ms BNB step: 0.517ms Torch step (eager): 1.080ms BNB step: 0.527ms Torch step (eager): 1.069ms BNB step: 0.539ms Torch step (eager): 1.034ms BNB step: 0.526ms 

Pure Triton implementation:

Torch step (eager): 1.034ms BNB step: 0.524ms Torch step (eager): 1.054ms BNB step: 0.488ms Torch step (eager): 1.031ms BNB step: 0.526ms Torch step (eager): 1.047ms BNB step: 0.538ms Torch step (eager): 1.045ms BNB step: 0.489ms 

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms BNB step: 0.335ms Torch step (eager): 0.354ms BNB step: 0.226ms Torch step (eager): 0.347ms BNB step: 0.227ms Torch step (eager): 0.358ms BNB step: 0.232ms Torch step (eager): 0.349ms BNB step: 0.225ms 

Pure Triton implementation:

Torch step (eager): 0.346ms BNB step: 0.226ms Torch step (eager): 0.337ms BNB step: 0.216ms Torch step (eager): 0.338ms BNB step: 0.215ms Torch step (eager): 0.333ms BNB step: 0.226ms Torch step (eager): 0.349ms BNB step: 0.235ms 

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.

@YangKai0616 YangKai0616 changed the title [XPU] Implemented 32bit optimizers in triton [Draft][XPU] Implemented 32bit optimizers in triton Jul 16, 2025
@YangKai0616 YangKai0616 changed the title [Draft][XPU] Implemented 32bit optimizers in triton [XPU] Implemented 32bit optimizers in triton Jul 17, 2025
@YangKai0616 YangKai0616 marked this pull request as ready for review July 17, 2025 10:55
@jiqing-feng
Copy link
Contributor

Hi @matthewdouglas , would you please review this PR? Thanks!

@christoph-koehncke christoph-koehncke added Intel Optimizers Issues or feature requests relating to optimizers labels Jul 29, 2025
@yao-matrix
Copy link

@matthewdouglas , could you pls help review this PR and #1692, thx very much

@matthewdouglas matthewdouglas added this to the v0.49.0 milestone Sep 2, 2025
@matthewdouglas matthewdouglas modified the milestones: v0.49.0, v0.48.0 Sep 15, 2025
Copy link
Member

@matthewdouglas matthewdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@matthewdouglas matthewdouglas merged commit 275671b into bitsandbytes-foundation:main Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Intel Optimizers Issues or feature requests relating to optimizers

6 participants