[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664

Nicoshev · 2025-11-12T19:08:49Z

Summary:
RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec.

These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics.

Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used.

Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves.

Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv

We can see a much better instruction sequence is achieved in the new implementation.

Test Plan: AdRanker ServiceLab

Differential Revision: D86805648

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

pytorch-bot · 2025-11-12T19:08:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167664

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c23615b with merge base 0d7ba97 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-12T19:08:56Z

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86805648.

Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: AdRanker ServiceLab Reviewed By: mcfi Differential Revision: D86805648

Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648

Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648

facebook-github-bot · 2025-11-19T22:19:22Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-11-19T22:21:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 12, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 12, 2025

Nicoshev force-pushed the export-D86805648 branch from a595fb5 to d9aa729 Compare November 13, 2025 19:16

Nicoshev added topic: improvements topic category ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow release notes: not needed labels Nov 13, 2025

Nicoshev force-pushed the export-D86805648 branch from d9aa729 to 6f164b8 Compare November 14, 2025 17:57

Nicoshev force-pushed the export-D86805648 branch from 6f164b8 to 14dd41f Compare November 14, 2025 21:56

Nicoshev force-pushed the export-D86805648 branch from 14dd41f to b034270 Compare November 17, 2025 14:02

Nicoshev force-pushed the export-D86805648 branch 2 times, most recently from 997a550 to 76aac10 Compare November 18, 2025 17:40

Nicoshev force-pushed the export-D86805648 branch from 76aac10 to b900fd5 Compare November 18, 2025 17:41

mcfi requested review from CaoE, mcfi and mingfeima November 18, 2025 18:13

mcfi approved these changes Nov 18, 2025

View reviewed changes

Nicoshev force-pushed the export-D86805648 branch 2 times, most recently from 1da2777 to 94ff965 Compare November 19, 2025 14:53

Nicoshev force-pushed the export-D86805648 branch from 94ff965 to d3b6d28 Compare November 19, 2025 15:33

Nicoshev force-pushed the export-D86805648 branch from d3b6d28 to c23615b Compare November 19, 2025 16:40

pytorchmergebot added the merging label Nov 19, 2025

pytorchmergebot closed this in 3ecc137 Nov 19, 2025

pytorchmergebot added Merged and removed merging labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664

[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664

Uh oh!

Nicoshev commented Nov 12, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 12, 2025 •

edited

Loading

meta-codesync bot commented Nov 12, 2025

facebook-github-bot commented Nov 19, 2025

pytorchmergebot commented Nov 19, 2025

Labels

4 participants

[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664

[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664

Uh oh!

Conversation

Nicoshev commented Nov 12, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pytorch-bot bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167664

✅ No Failures

meta-codesync bot commented Nov 12, 2025

facebook-github-bot commented Nov 19, 2025

pytorchmergebot commented Nov 19, 2025

Merge started

Labels

4 participants

Nicoshev commented Nov 12, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 12, 2025 •

edited

Loading