Skip to content

Conversation

@dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Mar 19, 2018

fix #9099
every minibatch sequence_pool and sequence_pool_grad operator have a ~8x time acceleration.
for example,
the sequence_pool op enhanced from 0.815583 -> 0.119373
the sequence_pool_grad enhanced from 0.579614 -> 0.0830757

before optimize

Event Calls Total Min. Max. Ave. thread0::sum 72772 6579.74 0.013088 3.43046 0.0904158 thread0::mul_grad 25928 4135.29 0.049952 4.4888 0.159491 thread0::sequence_softmax_grad 2344 3067.88 0.05872 95.0493 1.30882 thread0::sequence_softmax 2344 2617.72 0.04976 17.122 1.11677 thread0::mul 25928 2260.75 0.038624 8.36944 0.0871933 thread0::sequence_pool 2380 1941.09 0.045984 89.9217 0.815583 thread0::sequence_expand_grad 2344 1730.34 0.05296 8.10054 0.738201 thread0::sequence_pool_grad 2380 1379.48 0.03824 137.793 0.579614 

after optimize

thread0::sigmoid_grad 7035 304.461 0.024032 89.5161 0.0432781 thread0::sequence_pool 2381 284.226 0.053984 56.5243 0.119373 thread0::sigmoid 7035 214.732 0.02448 3.57146 0.0305233 thread0::tanh 7071 214.441 0.023712 1.59734 0.0303269 thread0::tanh_grad 7071 206.762 0.023328 0.1432 0.0292408 thread0::sequence_pool_grad 2381 197.803 0.057408 0.934464 0.0830757 thread0::adam 936 187.986 0.024384 1.28653 0.20084 
@dzhwinter dzhwinter force-pushed the speed/sequence_op1 branch from 22f414c to 2524003 Compare March 25, 2018 11:56
@QiJune
Copy link
Member

QiJune commented Mar 26, 2018

Please add benchmark details between there two versions.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问8倍的加速比是对GPU来说的吧,CPU上维持不变?

if (i == index[tid]) {
in_grad[item_dim * i + tid] = out_grad[tid];
} else {
in_grad[item_dim * i + tid] = static_cast<T>(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果先都赋值为0,再根据if条件,对in_grad[item_dim * i + tid] = out_grad[tid],还会更快一点么?LastPool和FirstPool类似。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会。因为先都赋值为0将会多一次cuda kernel call。减少kernel call 会大大加快运行速度。

# return x, lod, out

# def compute(self, x, lod, out):
# self.attrs = {'pooltype': "FIRST"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-42行不用的代码可以删去

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

self.attrs = {'pooltype': "SUM"}
for i in range(4):
sub_x = x[lod[0][i]:lod[0][i + 1], :]
out[i] = sub_x.sum(axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_seq_pool.py单测只是换了一些单测的顺序么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

T, MaxPoolGradFunctor<T>><<<grid, threads, 0, context.stream()>>>(
MaxPoolGradFunctor<T>(), out_grad.data<T>(),
lod.CUDAData(context.GetPlace()), lod.size(), item_dim,
in_grad->mutable_data<T>(context.GetPlace()), index->data<int>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lod.CUDAData(context.GetPlace())in_grad->mutable_data<T>(context.GetPlace())等可以先用临时变量定义在if条件前面么,这样303-338行的代码可以简短点。141-178行类似。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is a good way to save lines of code. One rule in the Google style is that keep the declaration as close as it used place. If we forward declared with a temporary variable, user has to find it when he read the following code.

@luotao1
Copy link
Contributor

luotao1 commented Mar 29, 2018

LGTM @qingqing01 Do you have any suggestions?

@dzhwinter dzhwinter merged commit 8425c2c into PaddlePaddle:develop Mar 29, 2018
mikeseven added a commit to mikeseven/Paddle that referenced this pull request Mar 30, 2018
* commit '33b8b3d22034423455a493712955e419aac7b19b': (251 commits) Remove redundant commands in build.sh and build_doc.sh Add dependencies Move v2/api/fluid to fluid/api and Adjust doc build commands Plain LRN op throws an exception when is_test is set in backward pass fix compiler error of profiler_test in ONLY_CPU mode fix server shutdown Translation for Model Configuration (PaddlePaddle#9513) Fix data transform when inplace (PaddlePaddle#9450) refine parallel add FAQ (PaddlePaddle#9494) Fix dist error with lr decay layer (PaddlePaddle#9489) add prefetch_op (PaddlePaddle#9495) Fix some errors (PaddlePaddle#9403) hookup WITH_FLUID_ONLY in TeamCity build.sh (PaddlePaddle#9509) Fix the order of reads and write from buffered channel (PaddlePaddle#9423) change WITH_FLUID to WITH_FLUID_ONLY (PaddlePaddle#9427) fix block num Revert "make append activation in place by default (PaddlePaddle#9417)" Speed/sequence op1 (PaddlePaddle#9217) fix a compile error ...
blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/Paddle that referenced this pull request Nov 22, 2024
PaddlePaddle#9217) * [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bugs for enable_delay_scale_loss * add enable_delay_scale_loss flag for auto_parallel * fix ut * Update ci_case_auto.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants