- Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.
The optimization uses MKL VML's v?Add operation that performs elementwise addition:
https://software.intel.com/en-us/mkl-developer-reference-c-v-add
When elementwise_add is performed on GPU and/or x and y are of different dimensions the algorithm falls back to the default implementation.
To implement the optimization, I extended an interface of PaddlePaddle BLAS code:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h
with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.
Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?
x = ctx.Input<T>('X') y = ctx.Input<T>('Y') z = ctx.Output<T>('Z') if (ctx.is_cpu_place() and x.dims() == y.dims()) { flatten(x); flatten(y); flatten(z); if (MKL_is_used()) { VADD(x->numel(), x, y, z); } else { // SAXPY implements y = alpha * x + y // so first content of y is copied to z // and x is added to z VCOPY(y, z); SAXPY(x->numel(), 1.0 /*alpha*/, x, z) } } else { // fall back to default implentation }