Speed/sequence op1 #9217

dzhwinter · 2018-03-19T17:28:18Z

fix #9099
every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration.
for example,
the sequence_pool op enhanced from 0.815583 -> 0.119373
the sequence_pool_grad enhanced from 0.579614 -> 0.0830757

before optimize

Event Calls Total Min. Max. Ave. thread0::sum 72772 6579.74 0.013088 3.43046 0.0904158 thread0::mul_grad 25928 4135.29 0.049952 4.4888 0.159491 thread0::sequence_softmax_grad 2344 3067.88 0.05872 95.0493 1.30882 thread0::sequence_softmax 2344 2617.72 0.04976 17.122 1.11677 thread0::mul 25928 2260.75 0.038624 8.36944 0.0871933 thread0::sequence_pool 2380 1941.09 0.045984 89.9217 0.815583 thread0::sequence_expand_grad 2344 1730.34 0.05296 8.10054 0.738201 thread0::sequence_pool_grad 2380 1379.48 0.03824 137.793 0.579614

after optimize

thread0::sigmoid_grad 7035 304.461 0.024032 89.5161 0.0432781 thread0::sequence_pool 2381 284.226 0.053984 56.5243 0.119373 thread0::sigmoid 7035 214.732 0.02448 3.57146 0.0305233 thread0::tanh 7071 214.441 0.023712 1.59734 0.0303269 thread0::tanh_grad 7071 206.762 0.023328 0.1432 0.0292408 thread0::sequence_pool_grad 2381 197.803 0.057408 0.934464 0.0830757 thread0::adam 936 187.986 0.024384 1.28653 0.20084

QiJune · 2018-03-26T08:24:41Z

Please add benchmark details between there two versions.

luotao1

请问8倍的加速比是对GPU来说的吧，CPU上维持不变？

luotao1 · 2018-03-26T12:11:56Z

paddle/fluid/operators/math/sequence_pooling.cu

+ if (i == index[tid]) {
+ in_grad[item_dim * i + tid] = out_grad[tid];
+ } else {
+ in_grad[item_dim * i + tid] = static_cast<T>(0);


如果先都赋值为0，再根据if条件，对in_grad[item_dim * i + tid] = out_grad[tid]，还会更快一点么？LastPool和FirstPool类似。

不会。因为先都赋值为0将会多一次cuda kernel call。减少kernel call 会大大加快运行速度。

luotao1 · 2018-03-26T12:12:20Z

python/paddle/fluid/tests/unittests/test_seq_pool.py

+ # return x, lod, out
+
+ # def compute(self, x, lod, out):
+ # self.attrs = {'pooltype': "FIRST"}


32-42行不用的代码可以删去

luotao1 · 2018-03-26T12:13:34Z

python/paddle/fluid/tests/unittests/test_seq_pool.py

+ self.attrs = {'pooltype': "SUM"}
+ for i in range(4):
+ sub_x = x[lod[0][i]:lod[0][i + 1], :]
+ out[i] = sub_x.sum(axis=0)


test_seq_pool.py单测只是换了一些单测的顺序么？

luotao1 · 2018-03-26T12:17:23Z

paddle/fluid/operators/math/sequence_pooling.cu

+ T, MaxPoolGradFunctor<T>><<<grid, threads, 0, context.stream()>>>(
+ MaxPoolGradFunctor<T>(), out_grad.data<T>(),
+ lod.CUDAData(context.GetPlace()), lod.size(), item_dim,
+ in_grad->mutable_data<T>(context.GetPlace()), index->data<int>());


lod.CUDAData(context.GetPlace())和in_grad->mutable_data<T>(context.GetPlace())等可以先用临时变量定义在if条件前面么，这样303-338行的代码可以简短点。141-178行类似。

I don't think that is a good way to save lines of code. One rule in the Google style is that keep the declaration as close as it used place. If we forward declared with a temporary variable, user has to find it when he read the following code.

luotao1 · 2018-03-29T08:23:36Z

LGTM @qingqing01 Do you have any suggestions?

* commit '33b8b3d22034423455a493712955e419aac7b19b': (251 commits) Remove redundant commands in build.sh and build_doc.sh Add dependencies Move v2/api/fluid to fluid/api and Adjust doc build commands Plain LRN op throws an exception when is_test is set in backward pass fix compiler error of profiler_test in ONLY_CPU mode fix server shutdown Translation for Model Configuration (PaddlePaddle#9513) Fix data transform when inplace (PaddlePaddle#9450) refine parallel add FAQ (PaddlePaddle#9494) Fix dist error with lr decay layer (PaddlePaddle#9489) add prefetch_op (PaddlePaddle#9495) Fix some errors (PaddlePaddle#9403) hookup WITH_FLUID_ONLY in TeamCity build.sh (PaddlePaddle#9509) Fix the order of reads and write from buffered channel (PaddlePaddle#9423) change WITH_FLUID to WITH_FLUID_ONLY (PaddlePaddle#9427) fix block num Revert "make append activation in place by default (PaddlePaddle#9417)" Speed/sequence op1 (PaddlePaddle#9217) fix a compile error ...

PaddlePaddle#9217) * [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss * add enable_delay_scale_loss flag for auto_parallel * fix ut * Update ci_case_auto.sh

dzhwinter added 4 commits March 24, 2018 20:12

"add functors"

afcd94f

"remove old code"

2b17df7

"fix"

867d3cb

"fix ci"

2524003

dzhwinter force-pushed the speed/sequence_op1 branch from 22f414c to 2524003 Compare March 25, 2018 11:56

dzhwinter added 5 commits March 25, 2018 22:42

"add details"

738b3c6

"fix ci"

5adaa92

"fix ci"

74eafc8

"fix ci"

53354c6

"fix ci"

9ff3f1c

luotao1 reviewed Mar 26, 2018

View reviewed changes

"remove unused code"

9249ef9

luotao1 approved these changes Mar 29, 2018

View reviewed changes

dzhwinter merged commit 8425c2c into PaddlePaddle:develop Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed/sequence op1 #9217

Speed/sequence op1 #9217

Uh oh!

dzhwinter commented Mar 19, 2018 •

edited

Loading

QiJune commented Mar 26, 2018

luotao1 left a comment

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 Mar 26, 2018

dzhwinter Mar 29, 2018

luotao1 commented Mar 29, 2018

Labels

3 participants

Speed/sequence op1 #9217

Speed/sequence op1 #9217

Uh oh!

Conversation

dzhwinter commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QiJune commented Mar 26, 2018

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 commented Mar 29, 2018

Labels

3 participants

dzhwinter commented Mar 19, 2018 •

edited

Loading