Update deep_ep intranode & internode kernels #74284
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR Category
Communication Library
PR Types
Performance
Description
将 intranode & internode 的底层 kernel 更新至官方commit:deepseek-ai/DeepEP@079c5a4 (7月14日)
该 commit 已包含 TMA 优化 internode 性能
本PR修改内容
将
intranode.cu、internode.cu、configs.cuh、ibgda_device.cuh直接拷贝过来将
launch.cuh、utils.cuh拷贝过来,但保留 low_latency 仍然依赖的 deprecated 的函数(low_latency 由推理同学维护,不做修改)将
runtime.cu和layout.cu拷贝过来,合并成一个runtime.cu(之前也是这样合并的)将
api.cuh中 intranode & internode 的部分拷贝过来对
deep_ep.hpp中 Buffer 的成员变量做小幅修改对
deep_ep.cpp中 Buffer 的构造函数和 sync 方法,以及涉及 intranode & internode 调用的地方做了修改,正确设置新增的成员变量,适配新的 CUDA 层接口在
types.h里增加一个 helper 方法正确性测试
使用 test_intranode.py 和 test_internode.py(2、4、8机)进行了单测,均通过
使用DeepseekV3进行了多种PP、EP配置的端到端收敛性测试,均通过
性能变化
新版的优势在于可以用更少的SM达到相同的通信带宽,从而为计算分配更多的SM
例如在DeepseekV3上,deepep sm 20->14, deepgemm sm 112->118,端到端提升 1-2%
Pcard-85711