Support hybrid_parallel_topo_order for auto parallel Llama #8011
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR types
New features
PR changes
Models
Description
静半llama模型适配
hybrid_parallel_topo_order参数,混合并行默认拓扑顺序修改为["pp", "dp", "mp"],和动态图hybrid_parallel_topo_order=="pp_first"的情况对齐,仅在设置hybrid_parallel_topo_order=="sharding_first"时才保留原来的顺序["dp", "pp", "mp"]。【调整拓扑顺序影响精度问题】
此修改非预期地触发了静半旧组网和动静统一组网用于随机初始化参数的种子改变,从而导致了运行loss改变。
静半旧组网: 使用PaddleNLP里_get_distributed_seeds方法生成随机种子,传入topo顺序写死了
[dp,pp]。本PR对用于初始化随机种子的Topology类型进行了升级,从只支持["dp", "pp", "sharding", "mp", "sep"]升级为支持传入任意拓扑顺序,以保持不同拓扑顺序下的loss精度不变。动静统一组网: 使用框架里determinate_rng方法生成随机种子,该方法依赖mesh的全局自增id构造随机种子。取pp维mesh的操作接口
get_mesh_with_dim引入mesh的全局自增id偏移,导致随机种子改变。本PR配合框架PR PaddlePaddle/Paddle#62125 对get_mesh_with_dim操作进行改写,避免全局自增id的改变导致loss改变。更多细节详见框架PR描述。【动静统一组网的收敛性验证与CI监控loss更新】
静半新组网
get_mesh_with_dim接口改写后,调换拓扑顺序可保证mesh自增id相同,loss不改变,但CI上动半模型之前基于旧id跑出的baseline loss需要修改,在本PR中一并进行更新。针对当前CI上监控的case:
选取动半fp32 dp2、dp2-mp2、dp2-mp2-pp2 以下3组任务进行收敛性验证(旧组网下修复后精度不改变、不需要验旧组网,相关改动不影响fp16逻辑,只需要验fp32):
收敛曲线如下:


dp2-mp2-pp2:
dp2-mp2:


dp2:


在本PR测试期间,有两个PR引入预期内的精度改变,没有被正确拦截到,在本PR中一并对受影响的case进行loss更新。
SwiGLU:#8038
影响case:
master_grad修改:PaddlePaddle/Paddle#62276
影响case: