[XPU] Implemented 32bit optimizers in triton #1710
   Merged  
   Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.    
 
Depends on #1692.
Implemented 32bit optimizers in triton to use of XPU devices.
The PR includes two implementations:
For the benchmarking on
4096*4096shapes, the results are as follows:Pure Torch implementation:
Pure Triton implementation:
For the benchmarking on
1024*1024shapes, the results are as follows:Pure Torch implementation:
Pure Triton implementation:
The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.
Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.
Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in
tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.