Skip to content

Optimizing elementwise_add for CPU with MKL #10786

@tpatejko

Description

@tpatejko

I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.

The optimization uses MKL VML's v?Add operation that performs elementwise addition:
https://software.intel.com/en-us/mkl-developer-reference-c-v-add

When elementwise_add is performed on GPU and/or x and y are of different dimensions the algorithm falls back to the default implementation.

To implement the optimization, I extended an interface of PaddlePaddle BLAS code:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h

with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.

Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?

x = ctx.Input<T>('X') y = ctx.Input<T>('Y') z = ctx.Output<T>('Z') if (ctx.is_cpu_place() and x.dims() == y.dims()) { flatten(x); flatten(y); flatten(z); if (MKL_is_used()) { VADD(x->numel(), x, y, z); } else { // SAXPY implements y = alpha * x + y // so first content of y is copied to z // and x is added to z VCOPY(y, z); SAXPY(x->numel(), 1.0 /*alpha*/, x, z) } } else { // fall back to default implentation } 

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions