MKLDNN design doc #3241

tensor-tang · 2017-08-04T08:15:25Z

resolve #3200

luotao1

需要补充下新方案的原因，解决了什么问题，目前阶段不解决什么问题。评论里的文字可以重新组织。

luotao1 · 2017-08-04T08:44:36Z

doc/design/mkldnn/README.MD

+
+## KeyPoints
+
+为了更好的符合PaddlePaddle的代码风格，同时又尽可能少的牺牲MKLDNN的性能。


这里需要详细写一下原因：可以放上 #3096 链接。然后总结下这个PR讨论的几点：

原来这个PR是怎么做的：

为了不牺牲MKLDNN的性能，是采用每个layer需要知道next_layer的方式。

为什么存在牺牲MKLDNN性能的地方，因为MKLDNN采用的内存排布和paddle的NCHW不一样，这样每个mkldnnlayer需要计算前先把paddle的格式转成mkldnn的，计算后再转成paddle的，有两次转换开销。

同样是第三方库，cudnn/cuda为什么不存在mkldnn这样的问题，也需要简单说一下。

原来这个PR和paddle的冲突：

paddle中，无论是重构前的layer还是重构后的op，都不会知道next layer/op的信息。

嗯，你总结的已经非常好了！thx

luotao1 · 2017-08-04T09:02:18Z

doc/design/mkldnn/README.MD

+4. 创建**MkldnnBase**，定义一些除了layer和memory相关的类和函数。包括MKLDNN会用到Stream和CpuEngine，和未来可能还会用到FPGAEngine等。
+5. 在**Argument**里添加两个MkldnnMatrixPtr，取名为mkldnnValue和mkldnnGrad，用于存放MkldnnLayer会用到的memory buffer。 并且添加函数cvt(会修改为一个更加合适的函数名)，用于处理"CPU device"和"MKLDNN device"之间memory的相互转化。
+6. 在父类Layer中的**getOutput**函数中添加一段逻辑，用于判断`deviceId`，并针对device在MKLDNN和CPU之间不统一的情况，做一个前期转换。 也就是调用`Argument`的cvt函数把output统一到需要的device上。
+7. 在原来的`FLAGS`中添加一个`use_mkldnn`的flag，用于选择是否使用MKLDNN的相关功能。


需要再写一下新方案解决了问题：

解决了每次进入一个mkldnnlayer时，两次格式转换的性能开销问题。

mkldnn有很多种数据格式，但对paddle程序透明（不对的地方请指正）。

新方案首先解决全部是mkldnnlayer的网络，暂不解决mkldnnlayer+普通layer的网络？即cvt函数只做一种形式的转换？

mkldnn有很多种数据格式，但对paddle程序透明（不对的地方请指正）。

这里的具体格式内容不准备写在文档中，会说下包含在MkldnnMatrix中。

新方案首先解决全部是mkldnnlayer的网络，暂不解决mkldnnlayer+普通layer的网络？即cvt函数只做一种形式的转换？

新方案就是都一起解决。只不过提PR的时候会先分开提。

hedaoyuan · 2017-08-04T09:21:10Z

doc/design/mkldnn/README.MD

+### Layer
+所有的layer相关的C++代码，都会在按照PaddlePaddle的目录结构存放在
+`paddle\gserver\layers`中，文件名以*Mkldnn*开头。
+并且有可能会在Layer.h和Layer.cpp里面添加少量的code，用宏定义`PADDLE_USE_MKLDNN`隔开。


并且有可能会在Layer.h和Layer.cpp里面添加少量的code，用宏定义PADDLE_USE_MKLDNN隔开。

我觉的这句话可以先删掉，是否能够添加取决于具体的PR里的逻辑逻辑。另外，我觉的是应该尽量避免在Layer.h/cpp里面引入需要用PADDLE_USE_MKLDNN隔开的代码。

同意，这部分其实是之前老的方案，其实如果可以，我们本来觉得是可以用删除符号划掉的。

不需要用删除符号，直接删了就行。

hedaoyuan · 2017-08-04T09:25:53Z

doc/design/mkldnn/README.MD

+会考虑添加部分逻辑在`benchmark\paddle\image\run.sh`，添加使用mkldnn的测试。
+
+### Others
+1. 如果在使用MKLDNN的情况下，会把CPU的Buffer对齐为64。


MKLDNN是需要用自己的allocator，还是可以用posix_memalign，然后修改这里?

目前应该还没到需要自己的allocator的时候，是可以直接posix_memalign的。但是以后要是有FPGA的engine，就需要到时候再看了。

其实如果可以的话，我也希望直接把

CHECK_EQ(posix_memalign(&ptr, 32ul, size), 0);

里的32 换为64。
不过也可以做到别的地方。

但是，我看到caffe2里面用的是dnnAllocateBuffer

那个其实不是MKL-DNN，而是MKL大包里面的dnn，具体可以看这里

噢，我明白了，多谢~

luotao1

LGTM

wangkuiyi · 2017-08-04T17:01:12Z

doc/design/mkldnn/README.MD

+## Details
+我们把集成方案大致分为了如下几个方面。
+
+### Cmake


Cmake => CMake

CMake这个产品的名字在 cmake.org 里有

wangkuiyi · 2017-08-04T17:01:39Z

doc/design/mkldnn/README.MD

+- [Others](#others)
+- [KeyPoints](#keypoints)
+
+## Overall


Overall => Overview

wangkuiyi · 2017-08-04T17:01:52Z

doc/design/mkldnn/README.MD

+Figure 1. PaddlePaddle on IA.
+</div>
+
+## Details


Details => Actions

wangkuiyi · 2017-08-04T17:02:14Z

doc/design/mkldnn/README.MD

+我们把集成方案大致分为了如下几个方面。
+
+### Cmake
+我们会在`CMakeLists.txt`中会添加`WITH_MKLDNN`的选项，当设置这个值为`ON`的时候会启用编译MKLDNN功能。同时会自动开启`OpenMP`用于提高MKLDNN的性能。


`OpenMP` => OpenMP

wangkuiyi · 2017-08-04T17:02:30Z

doc/design/mkldnn/README.MD

+### Cmake
+我们会在`CMakeLists.txt`中会添加`WITH_MKLDNN`的选项，当设置这个值为`ON`的时候会启用编译MKLDNN功能。同时会自动开启`OpenMP`用于提高MKLDNN的性能。
+
+为了让PaddlePaddle更好的发挥MKLDNN的性能，我们还会引入`WITH_MKLML`的选项，用于选择是否用MKLDNN自带的MKLML的安装包。这个安装包可以独立于MKLDNN使用，但是建议在开启MKLDNN的同时也打开MKLML的开关，这样才能发挥最好的性能。


“为了。。。。”这个短句不是加一个CMake选项的原因。可以删了。

wangkuiyi · 2017-08-04T17:11:45Z

doc/design/mkldnn/README.MD

+
+会在`v1_api_demo`目录下添加一个`mkldnn`的文件夹，里面放入一些用于mkldnn测试的demo脚本。
+
+### Benchmark


Benchmark => Benchmarking

wangkuiyi · 2017-08-04T17:12:41Z

doc/design/mkldnn/README.MD

+1. 如果在使用MKLDNN的情况下，会把CPU的Buffer对齐为64。
+2. 深入PaddlePaddle，寻找有没有其他可以优化的可能，进一步优化。比如可能会用`OpenMP`改进SGD的更新性能。
+
+## KeyPoints


英语里没有一个叫 KeyPoints 的单词。这个标题从下面的内容看应该叫 Design Concerns

wangkuiyi · 2017-08-04T17:13:06Z

doc/design/mkldnn/README.MD

+
+1. 使用**deviceId_**。为了尽可能少的在父类Layer中添加变量或者函数，我们决定使用已有的`deviceId_`变量来区分layer的属性，定义`-2`为**MkldnnLayer**特有的设备ID。
+2. 重写父类Layer的**init**函数，修改`deviceId_`为`-2`，代表这个layer是用于跑在MKLDNN的环境下。
+3. 创建**MkldnnMatrix**，用于管理MKLDNN会用到的相关memory函数、接口以及会用的到格式信息。


MkldnnMatrix => MKLDNNMatrix

是否真需要创建这个class？在目前正在重构中的PaddlePaddle会删掉Matrix这个class，以及所有相关的class，而是使用一个新的class 叫 Tensor。 Tensor只负责管理内存和访问元素，不管计算。这样不同的operators、layers可以调用不同的计算库（MLKDNN、CUDNN、等）来做计算。如果MKLDNN的功能通过加入一个Matrix-dervied class加入PaddlePaddle，一旦新版本在一两个月里上线，这些代码就要被删掉了。

MkldnnMatrix => MKLDNNMatrix

这里，我们本来的意思写code中名字，参考的是CpuMatrix，然后又想加粗，所以才会这样。我还是统一为code格式吧。

是否真需要创建这个class？在目前正在重构中的PaddlePaddle会删掉Matrix这个class，以及所有相关的class，而是使用一个新的class 叫 Tensor。

那到时候会有区分GPU的tensor还是CPU的tensor吗？还是Layer自己去区分？
因为MkldnnMatrix 的存在就是为了实现区分不同设备，以及负责设备之间buffer的转换。假如把MKL-DNN也看做是某种设备的话，这样PaddlePaddle总共有3种设备：MKL-DNN、CPU和GPU。

经过上次和几位Baidu同事的讨论，这个方法是在目前架构下比较可行的。
由于我们对新的框架不是很了解，不过我想原有的CpuMatrix和GpuMatrix那些类应该也会有相应的替代吧，或者，请问有没有新框架的说明，我们也好有一个提前准备。

那到时候会有区分GPU的tensor还是CPU的tensor吗？还是Layer自己去区分？

Tensor自带了place信息，即Tensor本身可以区分GPU/CPU/其他设备。

wangkuiyi · 2017-08-04T17:15:52Z

doc/design/mkldnn/README.MD

+1. 使用**deviceId_**。为了尽可能少的在父类Layer中添加变量或者函数，我们决定使用已有的`deviceId_`变量来区分layer的属性，定义`-2`为**MkldnnLayer**特有的设备ID。
+2. 重写父类Layer的**init**函数，修改`deviceId_`为`-2`，代表这个layer是用于跑在MKLDNN的环境下。
+3. 创建**MkldnnMatrix**，用于管理MKLDNN会用到的相关memory函数、接口以及会用的到格式信息。
+4. 创建**MkldnnBase**，定义一些除了layer和memory相关的类和函数。包括MKLDNN会用到Stream和CpuEngine，和未来可能还会用到FPGAEngine等。


MkldnnBase => MKLDNN

同上，thx

wangkuiyi · 2017-08-04T17:16:06Z

doc/design/mkldnn/README.MD

+1. 使用**deviceId_**。为了尽可能少的在父类Layer中添加变量或者函数，我们决定使用已有的`deviceId_`变量来区分layer的属性，定义`-2`为**MkldnnLayer**特有的设备ID。
+2. 重写父类Layer的**init**函数，修改`deviceId_`为`-2`，代表这个layer是用于跑在MKLDNN的环境下。
+3. 创建**MkldnnMatrix**，用于管理MKLDNN会用到的相关memory函数、接口以及会用的到格式信息。
+4. 创建**MkldnnBase**，定义一些除了layer和memory相关的类和函数。包括MKLDNN会用到Stream和CpuEngine，和未来可能还会用到FPGAEngine等。


CpuEngine => CPUEngine 英语里缩写要大写

下面的代码命名里有不少这样的问题。

谢谢，其实我是想写CpuEngine，参考的是CpuMatrix和CpuEngine。

会统一改掉，谢谢~

tensor-tang · 2017-08-06T09:55:37Z

感谢 @wangkuiyi 的宝贵意见。
做了相应的改动，请查阅。

luotao1 · 2017-08-07T01:33:09Z

请merge最新的develop分支，来过teamcity

tensor-tang · 2017-08-07T01:44:12Z

Done, thx

tensor-tang added 3 commits August 3, 2017 21:52

add mkldnn design doc

fee7c8b

refine doc

0fb0484

refine and add overview

450ac64

tensor-tang requested a review from luotao1 August 4, 2017 08:15

luotao1 requested review from Xreki, hedaoyuan and kuke August 4, 2017 08:31

luotao1 reviewed Aug 4, 2017

View reviewed changes

hedaoyuan reviewed Aug 4, 2017

View reviewed changes

add why new design counts

6ee16af

luotao1 approved these changes Aug 4, 2017

View reviewed changes

wangkuiyi requested changes Aug 4, 2017

View reviewed changes

change some names and refine

6405438

tensor-tang force-pushed the merge-doc branch from 54c8c00 to 6405438 Compare August 6, 2017 09:52

Merge remote-tracking branch 'upstream/develop' into merge-doc

4bc1dc2

wangkuiyi approved these changes Aug 7, 2017

View reviewed changes

wangkuiyi merged commit 2db001d into PaddlePaddle:develop Aug 7, 2017

tensor-tang deleted the merge-doc branch August 15, 2017 03:49

tensor-tang mentioned this pull request Nov 30, 2017

polish mkldnn doc #5910

Merged


		## KeyPoints

		为了更好的符合PaddlePaddle的代码风格，同时又尽可能少的牺牲MKLDNN的性能。


		会在`v1_api_demo`目录下添加一个`mkldnn`的文件夹，里面放入一些用于mkldnn测试的demo脚本。

		### Benchmark

MKLDNN design doc #3241

MKLDNN design doc #3241

Conversation

tensor-tang commented Aug 4, 2017

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Aug 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang commented Aug 6, 2017

luotao1 commented Aug 7, 2017

tensor-tang commented Aug 7, 2017

Labels

4 participants

luotao1 Aug 4, 2017 •

edited

Loading