Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
PR Category
Auto Parallel
PR Types
Bug fixes
Description
动半pp的save_load精度对齐,目前热启动存在两个问题需要修复,描述如下:

报错直接原因是:
热启动的时候,在
float16精度下(bf16也会如此),为了防止数据下溢,反向过程中,对loss和梯度做scale,在参数更新时,即opt阶段,需要做unscale将数据缩放回来,此时会调用auto_parallel.api中的unscale_method方法,在该方法中,只要param的grad不为None,就会被添加到处理列表中,而对于非本rank的梯度,此时处于定义了但未初始化的状态,因此是未分配内存的,此时访问这些grad则会报非法访问的内存错误。报错直接原因是:
在save的时候,没有用到state_dict,而是直接保存了模型参数,未保存optimizer的参数,一方面保存的checkpoint中没有optimizer参数的信息,另一方面,导致加载时,key的名称对不上,以state_dict保存会有model.和optimizer.的前缀,后者没有。