Skip to content

Conversation

@sneaxiy
Copy link
Collaborator

@sneaxiy sneaxiy commented Feb 24, 2022

PR types

Performance optimization

PR changes

OPs

Describe

Use MultiTensorApply to improve the L2-Norm calculation in DistributedFusedLamb optimizer.

优化前,DistributedFusedLamb在计算Parameter L2-Norm和Trust Ratio Div L2-Norm的时候调用的cub::DeviceSegmentedReduce,改用MultiTensorApply的方式,并优化每个launch kernel时的最大Tensor数量、chunk数量等参数。

BERT Large(batch_size = 56, max_seq_len = 512, pure_fp16):

  • Paddle Baseline(使用cub::DeviceSegmentedReduce
总调用次数 总时间 单个batch调用次数 单个batch总时间
cub::DeviceSegmentedReduce 2648 7182050678 2 5424509.576
  • NV的性能数据MaxTensorNumPerLaunch=110MaxChunkNumPerLaunch=320
总调用次数 总时间 单个batch调用次数 单个batch总时间
MultiTensorApply Kernel 1962 + 1962 90753225 + 67747444 6 242355.763
Cleanup Kernel 1308 7537783 2 11525.6621
合计 - - - 253881.425
  • Paddle的性能数据MaxTensorNumPerLaunch=110MaxChunkNumPerLaunch=320(对齐NV配置),优于NV 0.7%-2.2%,基本可以认为是持平。相比Paddle Baseline提升约95%。
总调用次数 总时间 单个batch调用次数 单个batch总时间
GPU 0 MultiTensorApply Kernel 3972 158703257 6 239733.017
GPU 0 Cleanup Kernel 1324 6489006 2 9802.12387
GPU 0 合计 - - - 249535.14
GPU 1 MultiTensorApply Kernel 3972 158083541 6 238796.89
GPU 1 Cleanup Kernel 1324 6246289 2 9435.48187
GPU 1 合计 - - - 248232.372
GPU 2 MultiTensorApply Kernel 3972 158550442 6 239502.178
GPU 2 Cleanup Kernel 1324 6351557 2 9594.49698
GPU 2 合计 - - - 249096.675
GPU 3 MultiTensorApply Kernel 3972 160482464 6 242420.64
GPU 3 Cleanup Kernel 1324 6396205 2 9661.94109
GPU 3 合计 - - - 252082.582
GPU 4 MultiTensorApply Kernel 3972 157911387 6 238536.838
GPU 4 Cleanup Kernel 1324 6313354 2 9536.78852
GPU 4 合计 - - - 248073.627
GPU 5 MultiTensorApply Kernel 3972 158206590 6 238982.764
GPU 5 Cleanup Kernel 1324 6352998 2 9596.67372
GPU 5 合计 - - - 248579.438
GPU 6 MultiTensorApply Kernel 3972 158026604 6 238710.882
GPU 6 Cleanup Kernel 1324 6412573 2 9686.66616
GPU 6 合计 - - - 248397.548
GPU 7 MultiTensorApply Kernel 3972 158762293 6 239822.195
GPU 7 Cleanup Kernel 1324 6391204 2 9654.38671
GPU 7 合计 - - - 249476.582
  • Paddle的性能数据MaxTensorNumPerLaunch=50MaxChunkNumPerLaunch=680,优于NV 14%。相比Paddle Baseline提升约96%。
总调用次数 总时间 单个batch调用次数 单个batch总时间
GPU 0 MultiTensorApply Kernel 1324 137146700 2 207170.242
GPU 0 Cleanup Kernel 1324 6688593 2 10103.6148
GPU 0 合计 - - - 217273.856
GPU 1 MultiTensorApply Kernel 1324 137585724 2 207833.42
GPU 1 Cleanup Kernel 1324 6483174 2 9793.3142
GPU 1 合计 - - - 217626.734
GPU 2 MultiTensorApply Kernel 1324 137302011 2 207404.85
GPU 2 Cleanup Kernel 1324 6562869 2 9913.6994
GPU 2 合计 - - - 217318.55
GPU 3 MultiTensorApply Kernel 1324 137531462 2 207751.453
GPU 3 Cleanup Kernel 1324 6615911 2 9993.82326
GPU 3 合计 - - - 217745.276
GPU 4 MultiTensorApply Kernel 1324 137312205 2 207420.249
GPU 4 Cleanup Kernel 1324 6511524 2 9836.13897
GPU 4 合计 - - - 217256.388
GPU 5 MultiTensorApply Kernel 1324 137457972 2 207640.441
GPU 5 Cleanup Kernel 1324 6539170 2 9877.9003
GPU 5 合计 - - - 217518.341
GPU 6 MultiTensorApply Kernel 1324 137166650 2 207200.378
GPU 6 Cleanup Kernel 1324 6617388 2 9996.05438
GPU 6 合计 - - - 217196.432
GPU 7 MultiTensorApply Kernel 1324 137427733 2 207594.763
GPU 7 Cleanup Kernel 1324 6563153 2 9914.1284
GPU 7 合计 - - - 217508.891
@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@sneaxiy sneaxiy changed the title [WIP] Add MultiTensorApply to calculate L2-Norm Add MultiTensorApply to calculate L2-Norm Feb 24, 2022
@sneaxiy sneaxiy requested a review from limin2021 February 25, 2022 02:47
@sneaxiy sneaxiy changed the title Add MultiTensorApply to calculate L2-Norm Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer Feb 25, 2022
Copy link
Contributor

@limin2021 limin2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sneaxiy sneaxiy merged commit d32a010 into PaddlePaddle:develop Feb 25, 2022
@sneaxiy sneaxiy deleted the add_multi_tensor_apply_l2_norm branch February 25, 2022 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants