Skip to content

Conversation

@dzhwinter
Copy link
Contributor

fix #7862

@dzhwinter dzhwinter changed the title "add reduce functor" "accelerate elementwise_add_grad, add reduce functor" Jan 30, 2018
@tonyyang-svail
Copy link

@dzhwinter could you briefly explain why the original version is slow?

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 2, 2018

The previous version use broadcast, it is a quite low efficient operation.
After enhance, the result shows, the elementwise_add_grad lower than convolution.

-------------------------> Profiling Report <------------------------- Place: CPU	Total Time:41762.5ms	Total Memory:17689.2MB	Sorted by total time in descending order in the same thread Event Calls Total Min. Max. Ave. Total Memory.Min Memory. Max Memory. thread0::conv2d_grad 13 15602.1 276.188 5248.83 1200.16 9069.18 0.0078125 392.148 thread0::conv2d 13 8975.5 219.889 3366.09 690.423 338.152 12.2539 392.004 thread0::dropout 10 3036.16 0.300329 1213.68 303.616 1906.18 0.132812 784.008 thread0::elementwise_add_grad 16 2279.28 0.030858 544.775 142.455 8960.14 0.0195312 392.008 thread0::batch_norm_grad 14 1926.31 0.390424 462.51 137.594 8961.79 0.0742188 392.012 thread0::relu_grad 14 1689.34 0.087101 397.12 120.667 8961.71 0.0664062 392.004 thread0::pool2d_grad 5 1247.38 22.8108 617.521 249.475 9020.14 12.2539 392.004 thread0::batch_norm 14 1125.46 0.435152 264.698 80.39 1122.16 0.0742188 392.012 thread0::elementwise_add 16 991.63 0.031354 248.922 61.9769 730.156 0.015625 392.004 thread0::dropout_grad 10 808.702 0.048284 379.893 80.8702 8961.64 0.0664062 392.004 thread0::relu 14 795.466 0.042214 189.24 56.819 1514.17 0.0664062 392.004 thread0::adam 60 516.043 0.013363 238.108 8.60072 17688.7 0 0 thread0::fill_zeros_like 66 445.392 0.003328 185.057 6.74837 8961.57 0.00390625 392.004 thread0::pool2d 5 342.792 7.17049 180.343 68.5584 4258.21 3.06641 98.0039 thread0::mul_grad 3 74.9344 0.593411 71.8969 24.9781 8960.16 0.269531 52.0703 thread0::mul 3 44.4806 0.180829 43.4023 14.8269 8959.48 0.015625 0.0664062 thread0::elementwise_mul 60 0.363544 0.004298 0.022641 0.00605907 17688.7 0.00390625 0.00390625 thread0::softmax 1 0.278477 0.278477 0.278477 0.278477 8960.05 0.015625 0.015625 thread0::fill_constant 61 0.260437 0.002431 0.025904 0.00426946 8960.11 0.00390625 0.00390625
@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 24, 2018

This solution involves in the reduce primitive, its CPU version implement is quite easy, as the test result shown above, which demonstrate the performance well. But the GPU kernel is hard to implement because of GPU threads overwriting problem.

I had dig into the reduce kernel in https://github.com/zchee/cuda-sample/blob/master/6_Advanced/reduction/reduction_kernel.cu , but in my local machine, the reduce kernel is still not implemented correctly, so this PR has been delayed such a long time.

Currently, this issue partly has been fixed in without using the reduce kernel. Please check the detail in #8402

@paddle-bot-old paddle-bot-old bot closed this May 22, 2020
@paddle-bot-old
Copy link

Since you haven't replied for a long time, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您长期未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants