Skip to content

Conversation

@zyfncg
Copy link
Contributor

@zyfncg zyfncg commented Jan 5, 2022

PR types

Others

PR changes

Others

Describe

迁移dot和matmul反向一阶、二阶和三阶计算kernel到pten中。

为了完成PTen反向计算kernel与框架的适配,本PR中还包括了以下几项调整:

  1. 原Op体系中反向Op没有OpProto信息,与前向Op的处理有所不同,因此本PR中调整了相应的处理逻辑并为迁移的每个反向kernel对应的Op配置GetExpectedPtenKernelArgs,该解决方案后续有可能会替换。
  2. 反向kernel的部分输入DenseTensor存在为空的情况,并且在kernel内部有相应的判断分支逻辑,为了处理这里的判断条件,使用了paddle::optional<const DenseTensor&>来包裹此类可能为空的输入变量。为此也在pten中增加了对paddle::optional<const DenseTensor&>输入类型的支持。
  3. 增加了kernel输出DenseTensor可能为NULL的适配支持。
  4. 为动态图执行调用PTen反向kernel增加复数转换逻辑。
  5. DenseTensor新增移动赋值函数DenseTensor& operator=(DenseTensor&& other)
@paddle-bot-old
Copy link

paddle-bot-old bot commented Jan 5, 2022

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.


// TODO(chenweihang): add debug flags later
// TODO(chenweihang): deal with complex cases later
if (framework::IsComplexType(kernel_type.data_type_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是否可以使用pten_kernel的data type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于传入的KernelSignatureKernel数据结构都不具有data_type信息,所以需要使用kernel_type的数据

Comment on lines +1893 to +1898
if (current_vector_size > start_idx) {
pt_kernel_context_->SetOutputWithoutSetRange(start_idx, {nullptr});
} else {
pt_kernel_context_->EmplaceBackOutputWithoutSetRange({nullptr});
}
end_idx = start_idx + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加点注释吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 360 to 365
} else {
kernel_ctx->SetOutputWithoutSetRange(
start_idx + offset,
experimental::MakePtenTensorBaseFromVar(
outs_vector[offset]->MutableVar(), out_def));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支有用到吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

动态图模式下会执行到

Comment on lines 374 to 381
} else {
if (current_vector_size > start_idx) {
kernel_ctx->SetOutputWithoutSetRange(start_idx, {nullptr});
} else {
kernel_ctx->EmplaceBackOutputWithoutSetRange(
experimental::MakePtenTensorBaseFromVar(
outs_vector[offset]->MutableVar(), out_def));
kernel_ctx->EmplaceBackOutputWithoutSetRange({nullptr});
}
kernel_ctx->AssignOutputRange(std::make_pair(start_idx, start_idx + 1),
i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议将这段逻辑挪到开头,使用iter == outs.end判断执行后直接continue,这样可以优化代码结构,减少if else逻辑嵌套便于代码维护与理解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

paddle::platform::complex<float>,
paddle::platform::complex<double>) {}

PT_REGISTER_CTX_KERNEL(matmul_grad_grad,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议命名与函数一致:matmul_double_grad,alias_name也如此

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zyfncg zyfncg merged commit be81771 into PaddlePaddle:develop Jan 11, 2022
@Xreki
Copy link
Contributor

Xreki commented Jan 13, 2022

怀疑该PR导致了linear反向性能下降一倍:

  1. 1月6日的OP Benchmark数据:
    image

linear_2的nvprof结果如下:

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof Type Time(%) Time Calls Avg Min Max Name GPU activities: 36.07% 199.88ms 2000 99.938us 93.696us 136.29us volta_sgemm_64x32_sliced1x4_tn 30.52% 169.10ms 2000 84.548us 81.408us 92.960us volta_sgemm_64x32_sliced1x4_nn 27.48% 152.30ms 2000 76.148us 71.040us 86.752us volta_sgemm_128x32_nt 1.96% 10.845ms 2000 5.4220us 5.1520us 10.528us void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*) 1.41% 7.8399ms 2000 3.9190us 3.7430us 10.912us void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<bool, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float) 1.34% 7.4067ms 2000 3.7030us 3.5510us 9.0560us void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig) 1.22% 6.7504ms 2000 3.3750us 3.2000us 9.4730us [CUDA memcpy DtoD] total gpu_time: 554.1447 ms 
  1. 1月12日的OP Benchmark数据:
    image

linear_2的nvprof结果如下:

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof Type Time(%) Time Calls Avg Min Max Name GPU activities: 33.13% 275.88ms 4000 68.968us 5.6000us 136.93us void paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long) 24.55% 204.44ms 2000 102.22us 96.480us 125.73us volta_sgemm_64x32_sliced1x4_tn 20.20% 168.18ms 2000 84.089us 80.863us 93.632us volta_sgemm_64x32_sliced1x4_nn 18.17% 151.31ms 2000 75.654us 70.720us 81.568us volta_sgemm_128x32_nt 1.31% 10.906ms 2000 5.4530us 5.1510us 11.425us void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*) 0.94% 7.8099ms 2000 3.9040us 3.7110us 8.8640us void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<int, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float) 0.89% 7.3781ms 2000 3.6890us 3.5190us 9.3440us void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig) 0.82% 6.8048ms 2000 3.4020us 3.2000us 8.9600us [CUDA memcpy DtoD] total gpu_time: 832.7196 ms 

新的linear反向计算多了一个paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long)函数调用,但是所有linear配置都不是复数的,请check一下matmul的计算逻辑。

@zyfncg
Copy link
Contributor Author

zyfncg commented Jan 13, 2022

收到,我排查一下

@zyfncg zyfncg deleted the pten_matmul_grad branch January 13, 2022 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants