[Paddle-Inference] Matmul_int8_convert: tensor*tensor #37285
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR types
Others
PR changes
Others
Describe
增加matmul int8 量化的推理 op_convert 和 plugin:通过调用nvidia 显卡的 Tensor Core提高矩阵乘的计算速度,plugin 的实现包括 int8、fp16、fp32;通过将alpha传入plugin内与矩阵乘一起进行计算,实现matmul+scale的融合,加速推理;增加 dynload 动态加载 libcublasLt.so 的实现;增加对应量化的单测
性能测试:A(1, 28, 256, 1024)*B(1, 28, 1024, 256)
kernel(matmul和scale融合)的执行时间:
单OP(matmul和scale融合)网络的执行时间:(int8 的matmul 需要对输入数据重新排布来支持 tensor core,反而会增加耗时,只有在矩阵规模十分庞大时,才能体现矩阵计算的加速效果;本op的实现中可根据对tensor的预分析,自动判断选择性能最佳的 int8、fp16、fp32的plugin)
kernel的执行时间:
单OP网络的执行时间:
总结:当矩阵较大时,matmul int8 op的加速性能较为明显;当存在scale的op融合时,加速性能比较明显
另:matmul int8的显存会有约 5% 的略微减小