-  Couldn't load subscription status. 
- Fork 575
[WIP] pd: support dpa2/dpa3 inference #4846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for DPA2/DPA3 C++ inference with LAMMPS by implementing Paddle-based communication operations. The implementation addresses missing MPI communication functionality required for these models in distributed inference scenarios.
- Implements new communication operations for border exchange between MPI processes
- Adds C++ Paddle extensions with custom operators for message passing
- Updates C++ inference API to handle DPA2/DPA3 model requirements with proper tensor naming
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description | 
|---|---|
| source/op/pd/comm.cc | New custom Paddle operator implementation for MPI border communication | 
| source/op/pd/CMakeLists.txt | Build configuration for Paddle custom operators with MPI support | 
| source/api_cc/src/DeepPotPD.cc | Updates to C++ inference API for DPA2/DPA3 tensor handling | 
| source/api_cc/include/DeepPotPD.h | Header additions for communication tensor management | 
| source/CMakeLists.txt | Integration of Paddle operator build system | 
| deepmd/pd/model/descriptor/*.py | Python model updates to support communication dictionaries | 
| deepmd/pd/entrypoints/main.py | Model freezing updates for new tensor specifications | 
Comments suppressed due to low confidence (1)
   source/op/pd/comm.cc  Outdated    
 | #include <mpi-ext.h> | ||
| #endif | ||
| #endif | ||
| // #include <paddle/include/paddle_inference_api.h> | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out include directive or add a comment explaining why it's preserved.
| // #include <paddle/include/paddle_inference_api.h> | |
| // #include <paddle/include/paddle_inference_api.h> // Preserved for potential future use in inference-related functionality. | 
   source/op/pd/comm.cc  Outdated    
 | std::vector<int64_t> g1_tensor_shape, | ||
| std::vector<int64_t> communicator_tensor_shape, | ||
| std::vector<int64_t> nlocal_tensor_shape, | ||
| std::vector<int64_t> nghost_tenso_shape) { | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter name has a typo: 'nghost_tenso_shape' should be 'nghost_tensor_shape'.
| std::vector<int64_t> nghost_tenso_shape) { | |
| std::vector<int64_t> nghost_tensor_shape) { | 
   source/op/pd/comm.cc  Outdated    
 | paddle::DataType g1_tensor_dtype, | ||
| paddle::DataType communicator_tensor_dtype, | ||
| paddle::DataType nlocal_tensor_dtype, | ||
| paddle::DataType nghost_tenso_dtype) { | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter name has a typo: 'nghost_tenso_dtype' should be 'nghost_tensor_dtype'.
| paddle::DataType nghost_tenso_dtype) { | |
| paddle::DataType nghost_tensor_dtype) { | 
   source/op/pd/comm.cc  Outdated    
 | const paddle::Tensor& nghost_tensor, | ||
| paddle::Tensor& recv_g1_tensor_grad) { | ||
| bool type_flag = | ||
| (sendlist_tensor.dtype() == paddle::DataType::FLOAT64) ? true : false; | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type checking uses sendlist_tensor.dtype() instead of recv_g1_tensor_grad.dtype(), which is inconsistent with the template instantiation logic.
| (sendlist_tensor.dtype() == paddle::DataType::FLOAT64) ? true : false; | |
| (recv_g1_tensor_grad.dtype() == paddle::DataType::FLOAT64) ? true : false; | 
   source/op/pd/CMakeLists.txt  Outdated    
 | file(GLOB OP_SRC comm.cc) | ||
| add_library(deepmd_op_pd MODULE ${OP_SRC}) | ||
|  | ||
| # ========== 新增:Paddle 配置(路径由用户外部传入) ========== | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be in English to maintain consistency with the rest of the codebase.
| # ========== 新增:Paddle 配置(路径由用户外部传入) ========== | |
| # ========== New: Paddle configuration (path provided externally by the user) ========== | 
|  | ||
| # Paddle lib | ||
| set(PADDLE_INFER_LIB ${PADDLE_LIB}/paddle/lib/libpaddle_inference.so) | ||
| # Math lib(默认用MKL) | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments should be in English to maintain consistency with the rest of the codebase.
| # Math lib(默认用MKL) | |
| # Math lib (default: MKL) | 
   deepmd/pt/train/training.py  Outdated    
 | # load | ||
| self.wrapper.load_state_dict(torch.load("./wrapper_dict.pt")) | ||
| print("model loaded") | ||
| inp = np.load("./input_dict.npz", allow_pickle=True) | ||
| for k, v in inp.items(): | ||
| if isinstance(v, np.ndarray): | ||
| # print(k, type(v), v.shape, v.dtype) | ||
| try: | ||
| input_dict[k] = torch.tensor(v) | ||
| except TypeError: | ||
| pass | ||
| if isinstance(input_dict[k], torch.Tensor): | ||
| input_dict[k] = input_dict[k].cuda() | ||
| print("input_dict loaded") | ||
| lab = np.load("./label_dict.npz", allow_pickle=True) | ||
| for k, v in lab.items(): | ||
| if isinstance(v, np.ndarray): | ||
| # print(k, type(v), v.shape, v.dtype) | ||
| try: | ||
| label_dict[k] = torch.tensor(v) | ||
| except TypeError: | ||
| pass | ||
| if isinstance(label_dict[k], torch.Tensor): | ||
| label_dict[k] = label_dict[k].cuda() | ||
| print("label_dict loaded") | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug code should be removed before merging to production. This includes hard-coded file loading and early exit statements.
| # load | |
| self.wrapper.load_state_dict(torch.load("./wrapper_dict.pt")) | |
| print("model loaded") | |
| inp = np.load("./input_dict.npz", allow_pickle=True) | |
| for k, v in inp.items(): | |
| if isinstance(v, np.ndarray): | |
| # print(k, type(v), v.shape, v.dtype) | |
| try: | |
| input_dict[k] = torch.tensor(v) | |
| except TypeError: | |
| pass | |
| if isinstance(input_dict[k], torch.Tensor): | |
| input_dict[k] = input_dict[k].cuda() | |
| print("input_dict loaded") | |
| lab = np.load("./label_dict.npz", allow_pickle=True) | |
| for k, v in lab.items(): | |
| if isinstance(v, np.ndarray): | |
| # print(k, type(v), v.shape, v.dtype) | |
| try: | |
| label_dict[k] = torch.tensor(v) | |
| except TypeError: | |
| pass | |
| if isinstance(label_dict[k], torch.Tensor): | |
| label_dict[k] = label_dict[k].cuda() | |
| print("label_dict loaded") | |
| # Load model state | |
| model_path = config.get("model_path", "./wrapper_dict.pt") | |
| self.wrapper.load_state_dict(torch.load(model_path)) | |
| log.info("Model state loaded from %s", model_path) | |
| # Load input dictionary | |
| input_path = config.get("input_path", "./input_dict.npz") | |
| inp = np.load(input_path, allow_pickle=True) | |
| for k, v in inp.items(): | |
| if isinstance(v, np.ndarray): | |
| try: | |
| input_dict[k] = torch.tensor(v) | |
| except TypeError: | |
| pass | |
| if isinstance(input_dict[k], torch.Tensor): | |
| input_dict[k] = input_dict[k].cuda() | |
| log.info("Input dictionary loaded from %s", input_path) | |
| # Load label dictionary | |
| label_path = config.get("label_path", "./label_dict.npz") | |
| lab = np.load(label_path, allow_pickle=True) | |
| for k, v in lab.items(): | |
| if isinstance(v, np.ndarray): | |
| try: | |
| label_dict[k] = torch.tensor(v) | |
| except TypeError: | |
| pass | |
| if isinstance(label_dict[k], torch.Tensor): | |
| label_dict[k] = label_dict[k].cuda() | |
| log.info("Label dictionary loaded from %s", label_path) | 
   deepmd/pd/train/training.py  Outdated    
 | with nvprof_context(enable_profiling, "Backward pass"): | ||
| loss.backward() | ||
|  | ||
| exit() | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug code should be removed before merging to production. This early exit will prevent normal training execution.
| exit() | 
| extended_virial = extended_virial.view(list(extended_virial.shape[:-2]) + [9]) # noqa:RUF005 | ||
| else: | ||
| extended_virial = None | ||
| print( | 
     Copilot AI  Jul 22, 2025  
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed before merging to production.
| 📝 WalkthroughWalkthroughThis update introduces a custom PaddlePaddle operator ( Changes
 Sequence Diagram(s)sequenceDiagram participant Python participant Paddle participant CustomOp (border_op) participant MPI Python->>Paddle: Load model/descriptor with comm_dict (list of tensors) Paddle->>CustomOp: Call border_op with send/recv lists, communicator, etc. CustomOp->>MPI: Perform send/recv (CPU/GPU, CUDA-aware if possible) CustomOp-->>Paddle: Return communicated tensor Paddle-->>Python: Continue forward/backward pass with updated data Estimated code review effort4 (~90 minutes) Possibly related PRs
 Suggested labels
 Suggested reviewers
 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
 ✅ Files skipped from review due to trivial changes (1)
 🚧 Files skipped from review as they are similar to previous changes (1)
 ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
 ✨ Finishing Touches
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 CodeRabbit Configuration File ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
♻️ Duplicate comments (1)
deepmd/pd/model/descriptor/repflows.py (1)
72-75: Improve error message clarity
🧹 Nitpick comments (17)
.pre-commit-config.yaml (2)
68-75: Avoid blanket disabling of the Prettier hookCommenting-out the entire Prettier block removes automated formatting for Markdown/YAML/CSS across the repo, which quickly drifts style consistency—especially with multiple contributors.
If the alpha release is the blocker, prefer downgrading/pinning to a stable tag or narrowing theexclude/filespattern instead of disabling altogether.Example fix (pin to stable 3.x and keep existing exclusions):
- # - repo: https://github.com/pre-commit/mirrors-prettier - # rev: v4.0.0-alpha.8 - # hooks: - # - id: prettier - # types_or: [markdown, yaml, css] - # exclude: ^(source/3rdparty|\.github/workflows|\.clang-format) + - repo: https://github.com/pre-commit/mirrors-prettier + rev: 3.2.5 + hooks: + - id: prettier + types_or: [markdown, yaml, css] + exclude: ^(source/3rdparty|\.github/workflows|\.clang-format)
86-104: Re-evaluate removal of bibtex-tidyDisabling
bibtex-tidymeans bibliography files will no longer be normalized, increasing the chance of merge conflicts and inconsistent field ordering.
Unless the hook is demonstrably blocking work, consider re-enabling with a tighter pattern (e.g., only underdocs/refs/*.bib) or pinning to a version known to work.No immediate code change required, but please confirm this is an intentional long-term decision.
deepmd/pd/model/descriptor/repformer_layer.py (1)
113-113: Remove commented debugging code before finalizing the PR.The commented print statement appears to be leftover debugging code that should be cleaned up before this WIP PR is finalized.
- # print(g1_ext.shape, index.shape)deepmd/pt/train/wrapper.py (1)
178-178: Remove debugging print statement before finalizing the PR.This print statement appears to be debugging code that should be removed before this WIP PR is finalized. Debug prints in the forward pass will clutter logs and impact performance.
- print(self.loss)deepmd/pd/train/wrapper.py (1)
176-176: Remove debugging print statement before finalizing the PR.This print statement appears to be debugging code that should be removed before this WIP PR is finalized. Debug prints in the forward pass will clutter logs and impact performance.
- print(self.loss)deepmd/pt/model/model/transform_output.py (1)
99-101: Remove debugging print statement before finalizing the PR.This print statement outputs detailed tensor statistics and appears to be debugging code. It should be removed before this WIP PR is finalized as it will generate excessive log output and impact performance.
- print( - f"extended_force: {extended_force.min().item():.10f} {extended_force.max().item():.10f} {extended_force.mean().item():.10f} {extended_force.std().item():.10f}" - )source/op/pd/CMakeLists.txt (6)
4-4: Replace Chinese comments with English for consistency.Chinese comments are inconsistent with the rest of the codebase which uses English comments.
-# ========== 新增:Paddle 配置(路径由用户外部传入) ========== +# ========== Paddle Configuration (path provided externally by user) ==========
19-28: Remove commented-out legacy code.Large blocks of commented-out code should be removed to improve readability. If this code might be needed later, it should be documented in a separate issue or commit message.
-# link_directories(${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib) -# link_directories(${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib) -# link_directories(${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib) -# link_directories(${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib) -# link_directories(${PADDLE_LIB}/paddle/lib) target_link_libraries(deepmd_op_pd -# PRIVATE ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a -# ${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a -# ${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a -# ${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a dl pthread)
31-31: Replace Chinese comment with English.-# Math lib(默认用MKL) +# Math lib (default MKL)
35-45: Remove large commented-out legacy code block.This extensive commented-out code should be removed to improve maintainability.
-# ========== 原有逻辑:保持不动 ========== -# target_link_libraries(deepmd_op_pd PRIVATE ${PADDLE_LIBRARIES}) -# if(${OP_CXX_ABI_PT} EQUAL ${OP_CXX_ABI}) target_link_libraries(deepmd_op_pd -# PRIVATE ${LIB_DEEPMD}) else() target_link_libraries(deepmd_op_pd PRIVATE -# ${LIB_DEEPMD}_compat_cxxabi) endif() -# remove_definitions(-D_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI}) -# target_compile_definitions( deepmd_op_pd PUBLIC -# "$<$<COMPILE_LANGUAGE:CXX>:_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI_PT}>") -# if(APPLE) set_target_properties(deepmd_op_pd PROPERTIES INSTALL_RPATH -# "@loader_path") else() set_target_properties(deepmd_op_pd PROPERTIES -# INSTALL_RPATH "$ORIGIN") endif()
66-66: Replace Chinese comment with English.-# ========== 新增:链接 Paddle 所需库 ========== +# ========== Link Required Paddle Libraries ==========
77-77: Replace Chinese comment with English.-# ========== 保持原有安装路径逻辑 ========== +# ========== Maintain Original Installation Path Logic ==========deepmd/pt/train/training.py (2)
761-764: Use contextlib.suppress for cleaner exception handling.Following the static analysis suggestion to improve code readability.
+import contextlib + - try: - input_dict[k] = torch.tensor(v) - except TypeError: - pass + with contextlib.suppress(TypeError): + input_dict[k] = torch.tensor(v)
772-775: Use contextlib.suppress for cleaner exception handling.Following the static analysis suggestion to improve code readability.
- try: - label_dict[k] = torch.tensor(v) - except TypeError: - pass + with contextlib.suppress(TypeError): + label_dict[k] = torch.tensor(v)deepmd/pd/model/descriptor/repformers.py (2)
68-71: Improve error message clarityThe current error message mentions "when freezing the model" which might be confusing. Consider making it clearer about the build requirement.
raise NotImplementedError( - "border_op is not available since customized Paddle OP library is not built when freezing the model. " - "See documentation for DPA3 for details." + "border_op is not available because the customized Paddle OP library was not built. " + "Please rebuild with Paddle custom op support enabled. " + "See documentation for distributed training setup." )
490-495: Simplify tensor padding logicThe current implementation manually constructs shapes and concatenates tensors. This can be simplified.
- _shapes = g1.shape[1:] - _shapes[1] = n_padding - g1 = paddle.concat( - [g1.squeeze(0), paddle.zeros(_shapes, dtype=g1.dtype)], - axis=1, - ) + g1 = paddle.nn.functional.pad( + g1.squeeze(0), [0, 0, 0, n_padding], value=0.0 + )source/op/pd/comm.cc (1)
56-56: Clarify unusual pointer arithmeticThe expression
(sendlist_tensor + 0).data<int>()is unusual. If this is intentional, please add a comment explaining why. Otherwise, simplify it.- int** sendlist = reinterpret_cast<int**>((sendlist_tensor + 0).data<int>()); + int** sendlist = reinterpret_cast<int**>(sendlist_tensor.data<int>());
📜 Review details
Configuration used: CodeRabbit UI
 Review profile: CHILL
 Plan: Pro
📒 Files selected for processing (17)
- .pre-commit-config.yaml(2 hunks)
- deepmd/pd/entrypoints/main.py(1 hunks)
- deepmd/pd/model/descriptor/dpa2.py(1 hunks)
- deepmd/pd/model/descriptor/repflows.py(8 hunks)
- deepmd/pd/model/descriptor/repformer_layer.py(1 hunks)
- deepmd/pd/model/descriptor/repformers.py(7 hunks)
- deepmd/pd/train/training.py(2 hunks)
- deepmd/pd/train/wrapper.py(1 hunks)
- deepmd/pd/utils/env.py(1 hunks)
- deepmd/pt/model/model/transform_output.py(1 hunks)
- deepmd/pt/train/training.py(1 hunks)
- deepmd/pt/train/wrapper.py(1 hunks)
- source/CMakeLists.txt(1 hunks)
- source/api_cc/include/DeepPotPD.h(1 hunks)
- source/api_cc/src/DeepPotPD.cc(2 hunks)
- source/op/pd/CMakeLists.txt(1 hunks)
- source/op/pd/comm.cc(1 hunks)
🧠 Learnings (9)
📓 Common learnings
Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-10-08T15:32:11.479Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-09-19T04:25:12.408Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4226 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202 Timestamp: 2024-10-16T21:49:57.401Z Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR. deepmd/pd/model/descriptor/repformer_layer.py (2)
Learnt from: njzjz
 PR: #4160
 File: deepmd/dpmodel/utils/env_mat.py:52-64
 Timestamp: 2024-09-24T01:59:37.973Z
 Learning: Negative indices in nlist are properly handled by masking later in the computation, so they do not cause issues in indexing operations.
Learnt from: njzjz
 PR: #4160
 File: deepmd/dpmodel/utils/env_mat.py:52-64
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Negative indices in nlist are properly handled by masking later in the computation, so they do not cause issues in indexing operations.
deepmd/pd/model/descriptor/dpa2.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pt/train/training.py (3)
Learnt from: njzjz
 PR: #4302
 File: deepmd/pd/infer/inference.py:35-38
 Timestamp: 2024-11-25T07:42:55.735Z
 Learning: In the file deepmd/pd/infer/inference.py, when loading the model checkpoint in the Tester class, it's acceptable to not include additional error handling for loading the model state dictionary.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pd/model/descriptor/repflows.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
source/api_cc/src/DeepPotPD.cc (4)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-05T03:11:02.922Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
deepmd/pd/model/descriptor/repformers.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pd/train/training.py (2)
Learnt from: njzjz
 PR: #4302
 File: deepmd/pd/infer/inference.py:35-38
 Timestamp: 2024-11-25T07:42:55.735Z
 Learning: In the file deepmd/pd/infer/inference.py, when loading the model checkpoint in the Tester class, it's acceptable to not include additional error handling for loading the model state dictionary.
Learnt from: HydrogenSulfate
 PR: #4414
 File: deepmd/pd/train/training.py:66-66
 Timestamp: 2024-11-29T12:15:22.226Z
 Learning: The function nvprof_context is defined in deepmd/pd/utils/utils.py, so importing it in deepmd/pd/train/training.py is correct.
source/api_cc/include/DeepPotPD.h (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
🧬 Code Graph Analysis (2)
deepmd/pt/model/model/transform_output.py (1)
source/api_cc/include/DeepTensor.h (1)
std(139-338)
deepmd/pd/model/descriptor/dpa2.py (2)
source/api_cc/src/DeepPotPD.cc (1)
mapping(245-245)source/api_cc/include/DeepPotPD.h (1)
numel(285-293)
🪛 Ruff (0.12.2)
deepmd/pt/train/training.py
761-764: Use contextlib.suppress(TypeError) instead of try-except-pass
Replace with contextlib.suppress(TypeError)
(SIM105)
772-775: Use contextlib.suppress(TypeError) instead of try-except-pass
Replace with contextlib.suppress(TypeError)
(SIM105)
deepmd/pd/train/training.py
769-773: Use contextlib.suppress(Exception) instead of try-except-pass
(SIM105)
781-785: Use contextlib.suppress(Exception) instead of try-except-pass
(SIM105)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-10-08T15:32:11.479Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-09-19T04:25:12.408Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4226 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202 Timestamp: 2024-10-16T21:49:57.401Z Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR. deepmd/pd/model/descriptor/repformer_layer.py (2)
Learnt from: njzjz
 PR: #4160
 File: deepmd/dpmodel/utils/env_mat.py:52-64
 Timestamp: 2024-09-24T01:59:37.973Z
 Learning: Negative indices in nlist are properly handled by masking later in the computation, so they do not cause issues in indexing operations.
Learnt from: njzjz
 PR: #4160
 File: deepmd/dpmodel/utils/env_mat.py:52-64
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Negative indices in nlist are properly handled by masking later in the computation, so they do not cause issues in indexing operations.
deepmd/pd/model/descriptor/dpa2.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pt/train/training.py (3)
Learnt from: njzjz
 PR: #4302
 File: deepmd/pd/infer/inference.py:35-38
 Timestamp: 2024-11-25T07:42:55.735Z
 Learning: In the file deepmd/pd/infer/inference.py, when loading the model checkpoint in the Tester class, it's acceptable to not include additional error handling for loading the model state dictionary.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pd/model/descriptor/repflows.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
source/api_cc/src/DeepPotPD.cc (4)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-05T03:11:02.922Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
deepmd/pd/model/descriptor/repformers.py (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
deepmd/pd/train/training.py (2)
Learnt from: njzjz
 PR: #4302
 File: deepmd/pd/infer/inference.py:35-38
 Timestamp: 2024-11-25T07:42:55.735Z
 Learning: In the file deepmd/pd/infer/inference.py, when loading the model checkpoint in the Tester class, it's acceptable to not include additional error handling for loading the model state dictionary.
Learnt from: HydrogenSulfate
 PR: #4414
 File: deepmd/pd/train/training.py:66-66
 Timestamp: 2024-11-29T12:15:22.226Z
 Learning: The function nvprof_context is defined in deepmd/pd/utils/utils.py, so importing it in deepmd/pd/train/training.py is correct.
source/api_cc/include/DeepPotPD.h (2)
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-09-19T04:25:12.408Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
Learnt from: njzjz
 PR: #4144
 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: Refactoring between test classes TestInferDeepPotDpaPt and TestInferDeepPotDpaPtNopbc is addressed in PR #3905.
🧬 Code Graph Analysis (2)
deepmd/pt/model/model/transform_output.py (1)
source/api_cc/include/DeepTensor.h (1)
std(139-338)
deepmd/pd/model/descriptor/dpa2.py (2)
source/api_cc/src/DeepPotPD.cc (1)
mapping(245-245)source/api_cc/include/DeepPotPD.h (1)
numel(285-293)
🪛 Ruff (0.12.2)
deepmd/pt/train/training.py
761-764: Use contextlib.suppress(TypeError) instead of try-except-pass
Replace with contextlib.suppress(TypeError)
(SIM105)
772-775: Use contextlib.suppress(TypeError) instead of try-except-pass
Replace with contextlib.suppress(TypeError)
(SIM105)
deepmd/pd/train/training.py
769-773: Use contextlib.suppress(Exception) instead of try-except-pass
(SIM105)
781-785: Use contextlib.suppress(Exception) instead of try-except-pass
(SIM105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (6)
source/CMakeLists.txt (1)
529-531: LGTM!The Paddle backend subdirectory is correctly added following the same pattern as TensorFlow and PyTorch backends.
deepmd/pd/model/descriptor/dpa2.py (1)
801-802: Improved validation logic.Good improvements to the validation:
- Treating empty
comm_dictthe same asNoneis more robust- Checking
mapping.numel() > 0ensures the tensor contains elements, not just that it existssource/api_cc/include/DeepPotPD.h (1)
397-404: Clear documentation of comm_vec structure.The comment clearly documents the purpose and contents of
comm_vec. This helps future maintainers understand the communication tensor structure.deepmd/pd/entrypoints/main.py (1)
391-409: Improved input specifications for static graph conversion.The changes enhance clarity and type safety:
- Explicit naming of inputs ("extended_coord", "extended_atype")
- Detailed specification of comm_dict components with proper types and shapes
This makes the interface more explicit and helps with validation.
source/api_cc/src/DeepPotPD.cc (2)
195-196: Input tensor names aligned with Python interface.The tensor names are correctly updated to match the Python interface changes in
deepmd/pd/entrypoints/main.py.Also applies to: 199-200
228-230: Verify the communicator tensor handling.The conditional copy and casting of
lmp_list.worldneeds verification:
- The cast from
void*toint*assumesworldcontains an integer value- The condition
lmp_list.world > 0treats a pointer as a boolean/integerPlease verify:
- What type is
lmp_list.worldand what does it contain?- Should this be checking if the pointer is non-null instead?
| assert len(comm_dict) >= 6 | ||
| has_spin = len(comm_dict) >= 7 | ||
| if not has_spin: | ||
| n_padding = nall - nloc | ||
| # node_ebd = paddle.nn.functional.pad( | ||
| # node_ebd.squeeze(0), [0, 0, 0, n_padding], value=0.0 | ||
| # ) | ||
| # [nframes, nloc, tebd_dim] | ||
| _shapes = node_ebd.shape[1:] | ||
| _shapes[1] = n_padding | ||
| node_ebd = paddle.concat( | ||
| [ | ||
| node_ebd.squeeze(0), | ||
| paddle.zeros(_shapes, dtype=node_ebd.dtype), | ||
| ], | ||
| axis=1, | ||
| ) | ||
| real_nloc = nloc | ||
| real_nall = nall | ||
| else: | ||
| # for spin | ||
| real_nloc = nloc // 2 | ||
| real_nall = nall // 2 | ||
| real_n_padding = real_nall - real_nloc | ||
| node_ebd_real, node_ebd_virtual = paddle.split( | ||
| node_ebd, [real_nloc, real_nloc], axis=1 | ||
| ) | ||
| # mix_node_ebd: nb x real_nloc x (n_dim * 2) | ||
| mix_node_ebd = paddle.concat( | ||
| [node_ebd_real, node_ebd_virtual], axis=2 | ||
| ) | ||
| # nb x real_nall x (n_dim * 2) | ||
| node_ebd = paddle.nn.functional.pad( | ||
| mix_node_ebd.squeeze(0), (0, 0, 0, real_n_padding), value=0.0 | ||
| ) | ||
|  | ||
| assert len(comm_dict) >= 6 | ||
| # assert "send_list" in comm_dict | ||
| # assert "send_proc" in comm_dict | ||
| # assert "recv_proc" in comm_dict | ||
| # assert "send_num" in comm_dict | ||
| # assert "recv_num" in comm_dict | ||
| # assert "communicator" in comm_dict | ||
| ret = paddle_ops_deepmd_border_op( | ||
| comm_dict[0], | ||
| comm_dict[1], | ||
| comm_dict[2], | ||
| comm_dict[3], | ||
| comm_dict[4], | ||
| node_ebd, | ||
| comm_dict[5], | ||
| paddle.to_tensor( | ||
| real_nloc, | ||
| dtype=paddle.int32, | ||
| place=paddle.CPUPlace(), | ||
| ), # should be int of c++, placed on cpu | ||
| paddle.to_tensor( | ||
| real_nall - real_nloc, | ||
| dtype=paddle.int32, | ||
| place=paddle.CPUPlace(), | ||
| ), # should be int of c++, placed on cpu | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace hardcoded indices with named keys for comm_dict
Similar to repformers.py, this implementation uses hardcoded numeric indices to access comm_dict, which is fragile and error-prone.
Apply the same pattern suggested for repformers.py - use named dictionary keys or a dataclass to make the code more maintainable and self-documenting. This will also make it easier to maintain consistency between the two descriptor implementations.
🤖 Prompt for AI Agents
In deepmd/pd/model/descriptor/repflows.py around lines 596 to 657, the code accesses comm_dict using hardcoded numeric indices, which is fragile and reduces readability. Refactor the code to replace these numeric indices with named dictionary keys or use a dataclass to represent comm_dict, similar to the pattern used in repformers.py. This will improve maintainability and make the code more self-documenting by clearly indicating what each accessed element represents. | has_spin = len(comm_dict) >= 7 | ||
| if not has_spin: | ||
| n_padding = nall - nloc | ||
| # g1 = paddle.nn.functional.pad( | ||
| # g1.squeeze(0), (0, 0, 0, n_padding), value=0.0 | ||
| # ) | ||
| _shapes = g1.shape[1:] | ||
| _shapes[1] = n_padding | ||
| g1 = paddle.concat( | ||
| [g1.squeeze(0), paddle.zeros(_shapes, dtype=g1.dtype)], | ||
| axis=1, | ||
| ) | ||
| real_nloc = nloc | ||
| real_nall = nall | ||
| else: | ||
| # for spin | ||
| real_nloc = nloc // 2 | ||
| real_nall = nall // 2 | ||
| real_n_padding = real_nall - real_nloc | ||
| g1_real, g1_virtual = paddle.split( | ||
| g1, [real_nloc, real_nloc], axis=1 | ||
| ) | ||
| # mix_g1: nb x real_nloc x (ng1 * 2) | ||
| mix_g1 = paddle.concat([g1_real, g1_virtual], axis=2) | ||
| # nb x real_nall x (ng1 * 2) | ||
| g1 = paddle.nn.functional.pad( | ||
| mix_g1.squeeze(0), (0, 0, 0, real_n_padding), value=0.0 | ||
| ) | ||
|  | ||
| # assert "send_list" in comm_dict | ||
| # assert "send_proc" in comm_dict | ||
| # assert "recv_proc" in comm_dict | ||
| # assert "send_num" in comm_dict | ||
| # assert "recv_num" in comm_dict | ||
| # assert "communicator" in comm_dict | ||
| # print(f"g1.shape = ", g1.shape) | ||
| ret = paddle_ops_deepmd_border_op( | ||
| comm_dict[0], | ||
| comm_dict[1], | ||
| comm_dict[2], | ||
| comm_dict[3], | ||
| comm_dict[4], | ||
| g1, | ||
| comm_dict[5], | ||
| paddle.to_tensor( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace hardcoded indices with named keys for comm_dict
The current implementation uses hardcoded numeric indices (0-5) to access comm_dict, which is fragile and error-prone. Additionally, the spin detection logic relies on the length of comm_dict (line 484), which is not robust.
Consider using a dictionary with named keys or a dataclass:
- has_spin = len(comm_dict) >= 7 + # comm_dict should contain named keys + has_spin = "spin_comm" in comm_dict if isinstance(comm_dict, dict) else len(comm_dict) >= 7 if not has_spin: # ... existing code ... - # assert "send_list" in comm_dict - # assert "send_proc" in comm_dict - # assert "recv_proc" in comm_dict - # assert "send_num" in comm_dict - # assert "recv_num" in comm_dict - # assert "communicator" in comm_dict - # print(f"g1.shape = ", g1.shape) + # Use named access if comm_dict is a dict + if isinstance(comm_dict, dict): + send_list = comm_dict["send_list"] + send_proc = comm_dict["send_proc"] + recv_proc = comm_dict["recv_proc"] + send_num = comm_dict["send_num"] + recv_num = comm_dict["recv_num"] + communicator = comm_dict["communicator"] + else: + # Legacy list-based access (deprecated) + send_list = comm_dict[0] + send_proc = comm_dict[1] + recv_proc = comm_dict[2] + send_num = comm_dict[3] + recv_num = comm_dict[4] + communicator = comm_dict[5] +  ret = paddle_ops_deepmd_border_op( - comm_dict[0], - comm_dict[1], - comm_dict[2], - comm_dict[3], - comm_dict[4], + send_list, + send_proc, + recv_proc, + send_num, + recv_num, g1, - comm_dict[5], + communicator,This makes the code more maintainable and self-documenting.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| has_spin = len(comm_dict) >= 7 | |
| if not has_spin: | |
| n_padding = nall - nloc | |
| # g1 = paddle.nn.functional.pad( | |
| # g1.squeeze(0), (0, 0, 0, n_padding), value=0.0 | |
| # ) | |
| _shapes = g1.shape[1:] | |
| _shapes[1] = n_padding | |
| g1 = paddle.concat( | |
| [g1.squeeze(0), paddle.zeros(_shapes, dtype=g1.dtype)], | |
| axis=1, | |
| ) | |
| real_nloc = nloc | |
| real_nall = nall | |
| else: | |
| # for spin | |
| real_nloc = nloc // 2 | |
| real_nall = nall // 2 | |
| real_n_padding = real_nall - real_nloc | |
| g1_real, g1_virtual = paddle.split( | |
| g1, [real_nloc, real_nloc], axis=1 | |
| ) | |
| # mix_g1: nb x real_nloc x (ng1 * 2) | |
| mix_g1 = paddle.concat([g1_real, g1_virtual], axis=2) | |
| # nb x real_nall x (ng1 * 2) | |
| g1 = paddle.nn.functional.pad( | |
| mix_g1.squeeze(0), (0, 0, 0, real_n_padding), value=0.0 | |
| ) | |
| # assert "send_list" in comm_dict | |
| # assert "send_proc" in comm_dict | |
| # assert "recv_proc" in comm_dict | |
| # assert "send_num" in comm_dict | |
| # assert "recv_num" in comm_dict | |
| # assert "communicator" in comm_dict | |
| # print(f"g1.shape = ", g1.shape) | |
| ret = paddle_ops_deepmd_border_op( | |
| comm_dict[0], | |
| comm_dict[1], | |
| comm_dict[2], | |
| comm_dict[3], | |
| comm_dict[4], | |
| g1, | |
| comm_dict[5], | |
| paddle.to_tensor( | |
| # comm_dict should contain named keys | |
| has_spin = "spin_comm" in comm_dict if isinstance(comm_dict, dict) else len(comm_dict) >= 7 | |
| if not has_spin: | |
| n_padding = nall - nloc | |
| # g1 = paddle.nn.functional.pad( | |
| # g1.squeeze(0), (0, 0, 0, n_padding), value=0.0 | |
| # ) | |
| _shapes = g1.shape[1:] | |
| _shapes[1] = n_padding | |
| g1 = paddle.concat( | |
| [g1.squeeze(0), paddle.zeros(_shapes, dtype=g1.dtype)], | |
| axis=1, | |
| ) | |
| real_nloc = nloc | |
| real_nall = nall | |
| else: | |
| # for spin | |
| real_nloc = nloc // 2 | |
| real_nall = nall // 2 | |
| real_n_padding = real_nall - real_nloc | |
| g1_real, g1_virtual = paddle.split( | |
| g1, [real_nloc, real_nloc], axis=1 | |
| ) | |
| # mix_g1: nb x real_nloc x (ng1 * 2) | |
| mix_g1 = paddle.concat([g1_real, g1_virtual], axis=2) | |
| # nb x real_nall x (ng1 * 2) | |
| g1 = paddle.nn.functional.pad( | |
| mix_g1.squeeze(0), (0, 0, 0, real_n_padding), value=0.0 | |
| ) | |
| # Use named access if comm_dict is a dict | |
| if isinstance(comm_dict, dict): | |
| send_list = comm_dict["send_list"] | |
| send_proc = comm_dict["send_proc"] | |
| recv_proc = comm_dict["recv_proc"] | |
| send_num = comm_dict["send_num"] | |
| recv_num = comm_dict["recv_num"] | |
| communicator = comm_dict["communicator"] | |
| else: | |
| # Legacy list-based access (deprecated) | |
| send_list = comm_dict[0] | |
| send_proc = comm_dict[1] | |
| recv_proc = comm_dict[2] | |
| send_num = comm_dict[3] | |
| recv_num = comm_dict[4] | |
| communicator = comm_dict[5] | |
| ret = paddle_ops_deepmd_border_op( | |
| send_list, | |
| send_proc, | |
| recv_proc, | |
| send_num, | |
| recv_num, | |
| g1, | |
| communicator, | |
| paddle.to_tensor( | 
🤖 Prompt for AI Agents
In deepmd/pd/model/descriptor/repformers.py around lines 484 to 528, the code accesses comm_dict using hardcoded numeric indices and determines spin presence by checking the length of comm_dict, which is fragile and unclear. Refactor comm_dict to be a dictionary with descriptive named keys or use a dataclass to represent its structure. Replace all numeric index accesses with these named keys to improve code clarity and maintainability, and update the spin detection logic to use explicit keys or flags instead of relying on the length of comm_dict. | "expand_grad", | ||
| "floor_grad", | ||
| "gather_grad", | ||
| # "gather_grad", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Document the reason for removing "gather_grad" from the blacklist.
Please add a comment explaining why "gather_grad" was removed from the blacklist and how this change supports the DPA2/DPA3 inference functionality. This helps future maintainers understand the rationale.
Additionally, please verify that this change improves performance or functionality for the target use case:
- # "gather_grad", + # "gather_grad", # Removed to support DPA2/DPA3 distributed communication operations🏁 Script executed:
#!/bin/bash # Description: Search for gather_grad usage in the codebase to understand the impact # Expected: Find references to gather_grad operations that might benefit from this change # Search for gather_grad usage patterns rg -A 3 -B 3 "gather_grad" # Search for gather operations that might be related  rg -A 2 "\.gather\(" # Look for communication or distributed operations that might use gather rg -A 2 -B 2 "comm_dict|border_op"Length of output: 55951
Document removal of “gather_grad” from the eager‐mode blacklist
Please add a brief comment in deepmd/pd/utils/env.py (line ~140) explaining why “gather_grad” was removed—namely, to allow the composite‐op implementation of the gather gradient (instead of the kernel fallback) when running distributed inference under DPA2/DPA3. This will help future maintainers understand that it’s intentional and tied to parallel communication support.
• File: deepmd/pd/utils/env.py
 Line: 140
Proposed diff:
- # "gather_grad", + # "gather_grad", # enabled composite gather_grad to support DPA2/DPA3 distributed inferenceAlso, please add or update a small test or benchmark to verify that using the composite path for gather_grad in eager mode does not regress performance or correctness in distributed setups.
🤖 Prompt for AI Agents
In deepmd/pd/utils/env.py at line 140, add a brief comment explaining that "gather_grad" was removed from the eager-mode blacklist to enable the composite-op implementation of the gather gradient during distributed inference with DPA2/DPA3, supporting parallel communication. Additionally, create or update a small test or benchmark to verify that using the composite path for gather_grad in eager mode maintains performance and correctness in distributed environments.    source/api_cc/src/DeepPotPD.cc  Outdated    
 | this->comm_vec.emplace_back(sendlist_tensor); | ||
| this->comm_vec.emplace_back(sendproc_tensor); | ||
| this->comm_vec.emplace_back(recvproc_tensor); | ||
| this->comm_vec.emplace_back(sendnum_tensor); | ||
| this->comm_vec.emplace_back(recvnum_tensor); | ||
| this->comm_vec.emplace_back(communicator_tensor); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential lifetime issue with tensor handles.
The tensor handles obtained from GetInputHandle are being stored in comm_vec, but these handles might not persist after the function scope ends. This could lead to undefined behavior.
Consider storing the tensors differently or ensuring their lifetime matches the usage pattern. You might need to use std::unique_ptr<paddle_infer::Tensor> or ensure the handles remain valid.
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPD.cc around lines 237 to 242, the tensor handles obtained from GetInputHandle are stored directly in comm_vec, but these handles may become invalid after the function scope ends, causing undefined behavior. To fix this, change comm_vec to store smart pointers such as std::unique_ptr<paddle_infer::Tensor> to manage the lifetime of the tensor handles properly, or otherwise ensure that the tensor handles remain valid for as long as comm_vec uses them by extending their lifetime appropriately. | if (lmp_list.mapping) { | ||
| std::vector<std::int64_t> mapping(nall_real); | ||
| for (size_t ii = 0; ii < nall_real; ii++) { | ||
| mapping[ii] = lmp_list.mapping[fwd_map[ii]]; | ||
| } | ||
| this->mapping_tensor->Reshape({1, nall_real}); | ||
| this->mapping_tensor->CopyFromCpu(mapping.data()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix usage of uninitialized mapping_tensor.
This code uses this->mapping_tensor which was problematically initialized in the header file. Combined with the initialization issue in DeepPotPD.h, this will likely crash.
After fixing the initialization in the header, ensure mapping_tensor is properly initialized before use here:
+ if (!this->mapping_tensor) { + this->mapping_tensor = predictor_fl->GetInputHandle("mapping"); + } std::vector<std::int64_t> mapping(nall_real); for (size_t ii = 0; ii < nall_real; ii++) { mapping[ii] = lmp_list.mapping[fwd_map[ii]]; } this->mapping_tensor->Reshape({1, nall_real}); this->mapping_tensor->CopyFromCpu(mapping.data());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (lmp_list.mapping) { | |
| std::vector<std::int64_t> mapping(nall_real); | |
| for (size_t ii = 0; ii < nall_real; ii++) { | |
| mapping[ii] = lmp_list.mapping[fwd_map[ii]]; | |
| } | |
| this->mapping_tensor->Reshape({1, nall_real}); | |
| this->mapping_tensor->CopyFromCpu(mapping.data()); | |
| if (lmp_list.mapping) { | |
| // Ensure mapping_tensor is initialized before use | |
| if (!this->mapping_tensor) { | |
| this->mapping_tensor = predictor_fl->GetInputHandle("mapping"); | |
| } | |
| std::vector<std::int64_t> mapping(nall_real); | |
| for (size_t ii = 0; ii < nall_real; ii++) { | |
| mapping[ii] = lmp_list.mapping[fwd_map[ii]]; | |
| } | |
| this->mapping_tensor->Reshape({1, nall_real}); | |
| this->mapping_tensor->CopyFromCpu(mapping.data()); | |
| } | 
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPD.cc around lines 244 to 250, the code uses this->mapping_tensor without ensuring it is properly initialized, which can cause a crash. After correcting the initialization in DeepPotPD.h, add a check or initialization step before this usage to confirm mapping_tensor is allocated and ready. This may involve allocating the tensor if it is null or resetting it appropriately before calling Reshape and CopyFromCpu. | target_link_libraries( | ||
| deepmd_op_pd | ||
| PRIVATE ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a | ||
| ${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a | ||
| ${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a | ||
| ${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a | ||
| ${PADDLE_INFER_LIB} | ||
| dl | ||
| pthread) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify hardcoded library paths exist during build.
The build script assumes specific library paths exist but doesn't validate them. Consider adding file existence checks for critical libraries.
🏁 Script executed:
#!/bin/bash # Check if similar hardcoded library path patterns exist elsewhere in the codebase rg -A 5 -B 5 "lib/lib.*\.a" --type cmakeLength of output: 2019
Ensure hardcoded library paths are validated in CMakeLists.txt
The target_link_libraries call in source/op/pd/CMakeLists.txt (lines 67–75) links directly to .a files under ${PADDLE_LIB_THIRD_PARTY_PATH} without any existence checks. If one of these paths is missing or incorrect, the build will fail with an obscure linker error. Add explicit guards—either if(EXISTS …) checks or find_library—to surface a clear, early error.
Files to update:
- source/op/pd/CMakeLists.txt (around lines 67–75)
Suggested diff:
 target_link_libraries( deepmd_op_pd - PRIVATE ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a - ${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a - ${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a - ${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a + PRIVATE + $<$<IF:$<BOOL:EXISTS:${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a>, + ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a, + $<ERROR:Missing glog library at ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib/libglog.a>> + $<$<IF:$<BOOL:EXISTS:${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a>, + ${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a, + $<ERROR:Missing gflags library at ${PADDLE_LIB_THIRD_PARTY_PATH}/gflags/lib/libgflags.a>> + $<$<IF:$<BOOL:EXISTS:${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a>, + ${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a, + $<ERROR:Missing xxhash library at ${PADDLE_LIB_THIRD_PARTY_PATH}/xxhash/lib/libxxhash.a>> + $<$<IF:$<BOOL:EXISTS:${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a>, + ${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a, + $<ERROR:Missing protobuf library at ${PADDLE_LIB_THIRD_PARTY_PATH}/protobuf/lib/libprotobuf.a>>Alternatively, you can use find_library(GLOG_LIB NAMES glog PATHS ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib NO_DEFAULT_PATH) (and similarly for the others) followed by:
if (NOT GLOG_LIB)  message(FATAL_ERROR "glog not found in ${PADDLE_LIB_THIRD_PARTY_PATH}/glog/lib") endif()This will ensure any missing or mis-configured third-party library is caught immediately.
🤖 Prompt for AI Agents
In source/op/pd/CMakeLists.txt around lines 67 to 75, the target_link_libraries command links directly to static library files without verifying their existence, which can cause unclear linker errors if paths are wrong. To fix this, replace the hardcoded paths with find_library calls for each library (glog, gflags, xxhash, protobuf) specifying the appropriate search paths under ${PADDLE_LIB_THIRD_PARTY_PATH}. After each find_library, add an if(NOT ...) check to produce a clear fatal error message if the library is not found. Then link the found library variables in target_link_libraries instead of the hardcoded paths. | if (sendproc[iswap] != me) { | ||
| if (nrecv) { | ||
| MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0, | ||
| world, &request); | ||
| } | ||
| if (nsend) { | ||
| MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0, | ||
| world); | ||
| } | ||
| if (nrecv) { | ||
| MPI_Wait(&request, MPI_STATUS_IGNORE); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for MPI operations
MPI operations can fail but the code doesn't check return values. Consider adding error checking to make the code more robust.
Example for MPI_Send:
 if (nsend) { - MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0, - world); + int mpi_err = MPI_Send(send_g1, nsend * tensor_size, mpi_type,  + sendproc[iswap], 0, world); + if (mpi_err != MPI_SUCCESS) { + PD_THROW("MPI_Send failed with error code: ", mpi_err); + } }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (sendproc[iswap] != me) { | |
| if (nrecv) { | |
| MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0, | |
| world, &request); | |
| } | |
| if (nsend) { | |
| MPI_Send(send_g1, nsend * tensor_size, mpi_type, sendproc[iswap], 0, | |
| world); | |
| } | |
| if (nrecv) { | |
| MPI_Wait(&request, MPI_STATUS_IGNORE); | |
| } | |
| if (sendproc[iswap] != me) { | |
| if (nrecv) { | |
| MPI_Irecv(recv_g1, nrecv * tensor_size, mpi_type, recvproc[iswap], 0, | |
| world, &request); | |
| } | |
| if (nsend) { | |
| int mpi_err = MPI_Send(send_g1, nsend * tensor_size, mpi_type, | |
| sendproc[iswap], 0, world); | |
| if (mpi_err != MPI_SUCCESS) { | |
| PD_THROW("MPI_Send failed with error code: ", mpi_err); | |
| } | |
| } | |
| if (nrecv) { | |
| MPI_Wait(&request, MPI_STATUS_IGNORE); | |
| } | 
🤖 Prompt for AI Agents
In source/op/pd/comm.cc around lines 120 to 131, the MPI calls (MPI_Irecv, MPI_Send, MPI_Wait) do not check their return values for errors. Modify the code to capture the return value of each MPI call and add conditional checks to handle errors appropriately, such as logging an error message or aborting the operation if an MPI call fails. This will make the code more robust by detecting and responding to MPI communication failures.    source/op/pd/comm.cc  Outdated    
 | bool type_flag = | ||
| (sendlist_tensor.dtype() == paddle::DataType::FLOAT64) ? true : false; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix dtype check to use the correct tensor
The dtype check is performed on sendlist_tensor which contains integer indices, but should check the floating-point gradient tensor instead.
 bool type_flag = - (sendlist_tensor.dtype() == paddle::DataType::FLOAT64) ? true : false; + (recv_g1_tensor_grad.dtype() == paddle::DataType::FLOAT64) ? true : false;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool type_flag = | |
| (sendlist_tensor.dtype() == paddle::DataType::FLOAT64) ? true : false; | |
| bool type_flag = | |
| (recv_g1_tensor_grad.dtype() == paddle::DataType::FLOAT64) ? true : false; | 
🤖 Prompt for AI Agents
In source/op/pd/comm.cc around lines 355 to 356, the dtype check is incorrectly performed on sendlist_tensor, which holds integer indices. Update the dtype check to use the floating-point gradient tensor instead to correctly determine if the data type is FLOAT64. | ✨ Finishing Touches
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 CodeRabbit Configuration File ( | 
85c8ad5 to 38fb926   Compare   There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
source/op/pd/CMakeLists.txt (1)
67-75: Hardcoded library paths need validation.The build script links directly to hardcoded library paths without validating their existence, which can cause unclear linker errors if paths are incorrect.
As noted in previous reviews, consider adding existence checks for these critical libraries:
find_library(GLOG_LIB NAMES glog PATHS ${PADDLE_INFERENCE_DIR_THIRD_PARTY_PATH}/glog/lib NO_DEFAULT_PATH) if(NOT GLOG_LIB) message(FATAL_ERROR "glog library not found in ${PADDLE_INFERENCE_DIR_THIRD_PARTY_PATH}/glog/lib") endif()Apply similar checks for gflags, xxhash, and protobuf libraries.
🧹 Nitpick comments (2)
source/op/pd/CMakeLists.txt (2)
23-23: Replace Chinese comments with English for consistency.Several comments are in Chinese and should be translated to English to maintain codebase consistency.
-# Math lib(默认用MKL) +# Math lib (default: MKL) -# link: libdeepmd paddle - 添加 CXX ABI 兼容性处理 if(${OP_CXX_ABI_PT} EQUAL +# link: libdeepmd paddle - Add CXX ABI compatibility handling if(${OP_CXX_ABI_PT} EQUAL -# 添加 CXX ABI 编译定义 remove_definitions(-D_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI}) +# Add CXX ABI compile definitions remove_definitions(-D_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI}) -# 添加 RPATH 设置 +# Add RPATH settings -# 链接 Paddle 所需库 +# Link required Paddle librariesAlso applies to: 28-36, 65-65
28-36: Clean up commented code.Remove or properly handle the commented-out CXX ABI compatibility code. If this functionality is needed, implement it properly; otherwise, remove the dead code.
-# link: libdeepmd paddle - 添加 CXX ABI 兼容性处理 if(${OP_CXX_ABI_PT} EQUAL -# ${OP_CXX_ABI}) target_link_libraries(deepmd_op_pd PRIVATE ${LIB_DEEPMD}) -# else() target_link_libraries(deepmd_op_pd PRIVATE ${LIB_DEEPMD}_compat_cxxabi) -# endif() - -# 添加 CXX ABI 编译定义 remove_definitions(-D_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI}) -# target_compile_definitions( deepmd_op_pd PUBLIC -# "$<$<COMPILE_LANGUAGE:CXX>:_GLIBCXX_USE_CXX11_ABI=${OP_CXX_ABI_PT}>")
📜 Review details
Configuration used: CodeRabbit UI
 Review profile: CHILL
 Plan: Pro
📒 Files selected for processing (17)
- deepmd/pd/cxx_op.py(1 hunks)
- deepmd/pd/model/atomic_model/base_atomic_model.py(2 hunks)
- deepmd/pd/model/atomic_model/dp_atomic_model.py(1 hunks)
- deepmd/pd/model/descriptor/dpa1.py(1 hunks)
- deepmd/pd/model/descriptor/dpa2.py(3 hunks)
- deepmd/pd/model/descriptor/dpa3.py(1 hunks)
- deepmd/pd/model/descriptor/repflows.py(8 hunks)
- deepmd/pd/model/descriptor/repformers.py(7 hunks)
- deepmd/pd/model/descriptor/se_a.py(1 hunks)
- deepmd/pd/model/descriptor/se_t_tebd.py(1 hunks)
- deepmd/pd/model/model/ener_model.py(1 hunks)
- deepmd/pd/model/model/make_model.py(1 hunks)
- source/CMakeLists.txt(1 hunks)
- source/api_cc/include/DeepPotPD.h(1 hunks)
- source/api_cc/src/DeepPotPD.cc(2 hunks)
- source/op/pd/CMakeLists.txt(1 hunks)
- source/op/pd/comm.cc(1 hunks)
🧠 Learnings (5)
📓 Common learnings
Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-10-08T15:32:11.479Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-09-19T04:25:12.408Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4226 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202 Timestamp: 2024-10-16T21:49:57.401Z Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR. deepmd/pd/model/model/ener_model.py (2)
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-05T03:11:02.922Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
source/op/pd/CMakeLists.txt (2)
Learnt from: njzjz
 PR: #3875
 File: doc/model/train-fitting-dos.md:107-107
 Timestamp: 2024-06-13T16:32:13.786Z
 Learning: For code blocks in doc/model/train-fitting-dos.md that display commands, use 'txt' as the language specification as per user njzjz's preference.
Learnt from: njzjz
 PR: #3875
 File: doc/model/train-fitting-dos.md:107-107
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: For code blocks in doc/model/train-fitting-dos.md that display commands, use 'txt' as the language specification as per user njzjz's preference.
deepmd/pd/model/atomic_model/dp_atomic_model.py (1)
Learnt from: njzjz
 PR: #4226
 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
 Timestamp: 2024-10-16T21:49:57.401Z
 Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in deepmd/dpmodel/atomic_model/base_atomic_model.py, do not suggest simplifying for kk in ret_dict.keys() to for kk in ret_dict unless it's relevant to the PR.
deepmd/pd/model/atomic_model/base_atomic_model.py (1)
Learnt from: njzjz
 PR: #4226
 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
 Timestamp: 2024-10-16T21:49:57.401Z
 Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in deepmd/dpmodel/atomic_model/base_atomic_model.py, do not suggest simplifying for kk in ret_dict.keys() to for kk in ret_dict unless it's relevant to the PR.
🧬 Code Graph Analysis (1)
deepmd/pd/model/atomic_model/dp_atomic_model.py (1)
deepmd/pd/model/network/network.py (1)
Tensor(30-33)
✅ Files skipped from review due to trivial changes (1)
- deepmd/pd/model/descriptor/se_t_tebd.py
🚧 Files skipped from review as they are similar to previous changes (7)
- source/CMakeLists.txt
- source/api_cc/include/DeepPotPD.h
- source/api_cc/src/DeepPotPD.cc
- deepmd/pd/model/descriptor/dpa2.py
- deepmd/pd/model/descriptor/repflows.py
- deepmd/pd/model/descriptor/repformers.py
- source/op/pd/comm.cc
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-10-08T15:32:11.479Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4144 File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246 Timestamp: 2024-09-19T04:25:12.408Z Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905. Learnt from: njzjz PR: deepmodeling/deepmd-kit#4226 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202 Timestamp: 2024-10-16T21:49:57.401Z Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR. deepmd/pd/model/model/ener_model.py (2)
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
Learnt from: 1azyking
 PR: #4169
 File: deepmd/pt/loss/ener_hess.py:341-348
 Timestamp: 2024-10-05T03:11:02.922Z
 Learning: In deepmd/pt/loss/ener_hess.py, the label uses the key "atom_ener" intentionally to maintain consistency with the forked version.
source/op/pd/CMakeLists.txt (2)
Learnt from: njzjz
 PR: #3875
 File: doc/model/train-fitting-dos.md:107-107
 Timestamp: 2024-06-13T16:32:13.786Z
 Learning: For code blocks in doc/model/train-fitting-dos.md that display commands, use 'txt' as the language specification as per user njzjz's preference.
Learnt from: njzjz
 PR: #3875
 File: doc/model/train-fitting-dos.md:107-107
 Timestamp: 2024-10-08T15:32:11.479Z
 Learning: For code blocks in doc/model/train-fitting-dos.md that display commands, use 'txt' as the language specification as per user njzjz's preference.
deepmd/pd/model/atomic_model/dp_atomic_model.py (1)
Learnt from: njzjz
 PR: #4226
 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
 Timestamp: 2024-10-16T21:49:57.401Z
 Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in deepmd/dpmodel/atomic_model/base_atomic_model.py, do not suggest simplifying for kk in ret_dict.keys() to for kk in ret_dict unless it's relevant to the PR.
deepmd/pd/model/atomic_model/base_atomic_model.py (1)
Learnt from: njzjz
 PR: #4226
 File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
 Timestamp: 2024-10-16T21:49:57.401Z
 Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in deepmd/dpmodel/atomic_model/base_atomic_model.py, do not suggest simplifying for kk in ret_dict.keys() to for kk in ret_dict unless it's relevant to the PR.
🧬 Code Graph Analysis (1)
deepmd/pd/model/atomic_model/dp_atomic_model.py (1)
deepmd/pd/model/network/network.py (1)
Tensor(30-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test C++ (false)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test C++ (true)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_x86_64
🔇 Additional comments (7)
deepmd/pd/model/model/make_model.py (1)
241-241: LGTM! Type annotation updated for Paddle PIR compatibility.The change from
Optional[dict[str, paddle.Tensor]]toOptional[list[paddle.Tensor]]aligns with the PR objective to work around Paddle PIR's lack of Dict support. This is part of a coordinated update across the codebase to standardize communication tensor handling for distributed inference.deepmd/pd/model/descriptor/dpa1.py (1)
599-599: LGTM! Consistent type annotation update for distributed communication.The
comm_dictparameter type change from dictionary to list maintains consistency with the broader codebase refactoring for Paddle PIR compatibility and distributed inference support.deepmd/pd/model/descriptor/se_a.py (1)
263-263: LGTM! Type annotation consistently updated.The
comm_dictparameter type change aligns with the coordinated refactoring across descriptor classes to support the new communication tensor interface for distributed inference.deepmd/pd/model/descriptor/dpa3.py (1)
460-460: LGTM! Type annotation updated consistently with other descriptors.The
comm_dictparameter type change from dictionary to list maintains consistency across descriptor classes and supports the new distributed communication interface for DPA3 inference.deepmd/pd/model/model/ener_model.py (1)
102-102: LGTM! Final consistent type annotation update.The
comm_dictparameter type change completes the coordinated refactoring across model classes, ensuring consistent communication tensor handling for distributed energy model inference.deepmd/pd/model/atomic_model/base_atomic_model.py (1)
194-194: LGTM: Type annotation changes are consistent with the broader refactor.The changes from
Optional[dict[str, paddle.Tensor]]toOptional[list[paddle.Tensor]]for thecomm_dictparameter align correctly with the codebase-wide transition to support the new distributed communication approach using the custom Paddle operator.Also applies to: 277-277
deepmd/pd/model/atomic_model/dp_atomic_model.py (1)
253-253: LGTM: Type annotation change is consistent.The change from
Optional[dict[str, paddle.Tensor]]toOptional[list[paddle.Tensor]]for thecomm_dictparameter correctly aligns with the broader codebase refactor to support the new communication structure.
   deepmd/pd/cxx_op.py  Outdated    
 | def load_library(module_name: str) -> tuple[bool, ModuleType]: | ||
| """Load OP library and return the module if success. | ||
| Parameters | ||
| ---------- | ||
| module_name : str | ||
| Name of the module | ||
| Returns | ||
| ------- | ||
| bool | ||
| Whether the library is loaded successfully | ||
| ModuleType | ||
| loaded custom operator module | ||
| """ | ||
| current_file = Path(__file__).resolve() | ||
| base_dir = current_file.parents[2] | ||
|  | ||
| module_file = base_dir / "source" / "op" / "pd" / "comm.cc" | ||
| paddle_ops_deepmd = load( | ||
| name="deepmd_op_pd", | ||
| sources=[str(module_file)], | ||
| ) | ||
| return True, paddle_ops_deepmd | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Address unused parameter and improve error handling.
The load_library function has several issues:
- The module_nameparameter is unused - the function hardcodes"deepmd_op_pd"
- Always returns Truefor success, even thoughpaddle.utils.cpp_extension.loadcould fail
- No error handling for compilation failures
-def load_library(module_name: str) -> tuple[bool, ModuleType]: +def load_library(module_name: str) -> tuple[bool, Optional[ModuleType]]: """Load OP library and return the module if success. Parameters ---------- module_name : str Name of the module Returns ------- bool Whether the library is loaded successfully - ModuleType + Optional[ModuleType] loaded custom operator module """ + try: current_file = Path(__file__).resolve() base_dir = current_file.parents[2] module_file = base_dir / "source" / "op" / "pd" / "comm.cc" paddle_ops_deepmd = load( - name="deepmd_op_pd", + name=module_name, sources=[str(module_file)], ) return True, paddle_ops_deepmd + except Exception as e: + import logging + logging.getLogger(__name__).warning(f"Failed to load custom operator {module_name}: {e}") + return False, NoneCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In deepmd/pd/cxx_op.py around lines 14 to 37, the load_library function does not use the module_name parameter, always returns True regardless of success, and lacks error handling for compilation failures. Modify the function to use the module_name parameter dynamically when calling load, wrap the load call in a try-except block to catch exceptions, return False and None if loading fails, and only return True and the module if loading succeeds. | return True, paddle_ops_deepmd | ||
|  | ||
|  | ||
| ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = load_library("deepmd_op_pd") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Handle potential loading failures gracefully.
The immediate loading on import could cause import failures if Paddle is not properly configured or if the C++ compilation fails.
-ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = load_library("deepmd_op_pd") +try: + ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = load_library("deepmd_op_pd") +except Exception: + ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = False, None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = load_library("deepmd_op_pd") | |
| try: | |
| ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = load_library("deepmd_op_pd") | |
| except Exception: | |
| ENABLE_CUSTOMIZED_OP, paddle_ops_deepmd = False, None | 
🤖 Prompt for AI Agents
In deepmd/pd/cxx_op.py at line 40, the code loads the custom Paddle library immediately on import, which can cause import failures if Paddle is not configured correctly or if the C++ compilation fails. Modify this to handle loading errors gracefully by wrapping the load_library call in a try-except block, catching exceptions, and providing a clear error message or fallback behavior to prevent the entire module from failing to import. d1369a7 to 03a09d6   Compare   | config->EnableNewIR(true); | ||
| config->EnableCustomPasses({"add_shadow_output_after_dead_parameter_pass"}, | ||
| true); | ||
| // config->SwitchIrOptim(false); | 
Check notice
Code scanning / CodeQL
Commented-out code Note
| config_fl->EnableNewIR(true); | ||
| config_fl->EnableCustomPasses({"add_shadow_output_after_dead_parameter_pass"}, | ||
| true); | ||
| // config_fl->SwitchIrOptim(false); | 
Check notice
Code scanning / CodeQL
Commented-out code Note
Support DPA2/DPA3 C++ inference with lammps
Note
we use a workaround as
Dictis not supported in paddle PIR and inference and is replaced byListwhen usingcomm_dict.other related work:
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores