- Notifications
You must be signed in to change notification settings - Fork 26.2k
[Caffe2] Improve AddMomentsVec and UpdateMomentsVec #167664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167664
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c23615b with merge base 0d7ba97 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
a595fb5 to d9aa729 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: AdRanker ServiceLab Reviewed By: mcfi Differential Revision: D86805648
d9aa729 to 6f164b8 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: AdRanker ServiceLab Reviewed By: mcfi Differential Revision: D86805648
6f164b8 to 14dd41f Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
14dd41f to b034270 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
997a550 to 76aac10 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
76aac10 to b900fd5 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Reviewed By: mcfi Differential Revision: D86805648
1da2777 to 94ff965 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
94ff965 to d3b6d28 Compare Summary: RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec. These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics. Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used. Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves. Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv We can see a much better instruction sequence is achieved in the new implementation. Test Plan: We ran three AdRanker ServiceLabs, one x86-based and two on aarch64. We are also launching an x86 prod canary. The three ServiceLabs showed reduced cpu time on RowwiseMoments, as well as no numerical errors. https://www.internalfb.com/servicelab/experiment/6402421173/complete https://www.internalfb.com/servicelab/experiment/6102396760/complete https://www.internalfb.com/servicelab/experiment/6202396546/complete Prod canary: https://www.internalfb.com/intern/ads/canary/473290286595708266 Reviewed By: mcfi Differential Revision: D86805648
d3b6d28 to c23615b Compare | @pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
RowwiseMomentsImpl accounts for about 0.4% cpu time of AdRanker: https://fburl.com/strobelight/ywf79nw3. It primarily calls AddMomentsVec and UpdateMomentsVec.
These two routines are written using Pytorch's VecLib, meaning the utilized operators translate into intrinsics.
Unfortunately, the compiler makes less transformations and optimizations when intrinsics are used.
Therefore, if we carefully decouple and re-order operations, the emitted instruction sequence improves.
Here we can see the dissassembly for the old and new AddMomentsVec: https://godbolt.org/z/83fxYvKfv
We can see a much better instruction sequence is achieved in the new implementation.
Test Plan: AdRanker ServiceLab
Differential Revision: D86805648
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01