Skip to content

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

@qingqing01

Description

@qingqing01

Compiling time comparison between NVCC and G++

  1. Conclusion:

    • NVCC is slower than G++, more than 1 min. For example, in elementwise_mul_op, the comiper time is 13s (G++) vs 1m41s (NVCC).
    • more cuda gencodes(gencode: sm_xx), more slower NVCC compiling
  2. Experiment 1: elementwise_mul_op, this op uses Eigen to compute

    • G++

      • compile :
      time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++ -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -o CMakeFiles/elementwise_mul_op.dir/elementwise_mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cc
      • time:
      real	0m13.116s user	0m12.264s sys	0m0.849s
    • NVCC

      • gencode: sm_30, sm_35, sm_50,sm_52
      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include 
      • time:
      real	1m41.708s user	1m30.757s sys 0m10.992s
    • gencode: only sm_35

      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_35,code=sm_35 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      • time:
       real	0m34.035s user	0m30.629s sys	0m3.414s 
  3. Experiment 2: mul_op, this op uses math::matmul to compute.

    • G++
      • compile :
       time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++ -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc
      • time:
      real	0m11.383s user	0m10.568s sys	0m0.825s
    • NVCC: gencode: sm_30, sm_35, sm_50,sm_52
      • compile:
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include 
      • time
      real	1m31.902s user	1m21.839s sys	0m10.026s

The .cu operators which can be compiled by G++

Following .cu operators can be compiled by G++, since some the dependent CUDA kernels have been compiled in math libraries (paddle/operator/math/ file). And the cuDNN can also be compiled by G++.

batch_norm_op.cu concat_op.cu conv2d_transpose_cudnn_op.cu conv_cudnn_op.cu conv_op.cu conv_transpose_op.cu fill_constant_batch_size_like_op.cu fill_constant_op.cu fill_zeros_like_op.cu gru_op.cu linear_chain_crf_op.cu lstm_op.cu matmul_op.cu mul_op.cu nccl_op.cu nccl_op_test.cu pool_cudnn_op.cu pool_op.cu pool_with_index_op.cu sequence_conv_op.cu reshape_op.cu sequence_concat_op.cu sequence_softmax_op.cu softmax_op.cu split_op.cu 

But different compiling rules for different operators are a little confused for developers.

Also ralated to #5413

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions