Skip to content

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Nov 12, 2025

Summary:
RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec.

These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics.

Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used.

Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves.

Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv

We can see a much better instruction sequence is achieved in the new implementation.

Test Plan: AdRanker ServiceLab

Differential Revision: D86805648

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167664

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c23615b with merge base 0d7ba97 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 12, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 12, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86805648.

Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 13, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: AdRanker ServiceLab Reviewed By: mcfi Differential Revision: D86805648
@Nicoshev Nicoshev added topic: improvements topic category ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow release notes: not needed labels Nov 13, 2025
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 14, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: AdRanker ServiceLab Reviewed By: mcfi Differential Revision: D86805648
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 14, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
pytorch-bot bot pushed a commit that referenced this pull request Nov 17, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 17, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
@Nicoshev Nicoshev force-pushed the export-D86805648 branch 2 times, most recently from 997a550 to 76aac10 Compare November 18, 2025 17:40
pytorch-bot bot pushed a commit that referenced this pull request Nov 18, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 18, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
@mcfi mcfi requested review from CaoE, mcfi and mingfeima November 18, 2025 18:13
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 18, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
@Nicoshev Nicoshev force-pushed the export-D86805648 branch 2 times, most recently from 1da2777 to 94ff965 Compare November 19, 2025 14:53
pytorch-bot bot pushed a commit that referenced this pull request Nov 19, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 19, 2025
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: cpu CPU specific problem (e.g., perf, algorithm) release notes: not needed topic: improvements topic category

4 participants