[ML] Fix double-counting of inference memory in the assignment rebalancer #133919

valeriy42 · 2025-09-01T09:17:27Z

The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation.

This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

elasticsearchmachine · 2025-09-01T09:21:14Z

Hi @valeriy42, I've created a changelog YAML for you.

elasticsearchmachine · 2025-09-01T09:21:15Z

Pinging @elastic/ml-core (Team:ML)

jan-elastic · 2025-09-01T14:07:59Z

...n/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentRebalancer.java

- // We subtract native inference memory as the planner expects available memory for
- // native inference including current assignments.
- getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference(load),
+ load.getFreeMemoryExcludingPerNodeOverhead(),


I don't understand the method name getFreeMemoryExcludingPerNodeOverhead.

Shouldn't that just be getFreeMemory()?

It seems that those numbers should be the same most of the time.

Maybe there is a corner case where the ML node is used only for inference in Java code, so that we don't account for the 30MB of the native code overhead, and NodeLoad.assignedNativeCodeOverheadMemory is 0.

jan-elastic

LGTM

davidkyle

LGTM

…ncer (elastic#133919) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

elasticsearchmachine · 2025-09-03T12:46:40Z

💚 Backport successful

Status	Branch	Result
✅	9.1
✅	9.0
✅	8.18
✅	8.19

…ncer (elastic#133919) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

…ncer (#133919) (#134053) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

…ncer (#133919) (#134052) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

…ncer (#133919) (#134051) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

…ncer (#133919) (#134054) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

…ncer (elastic#133919) (elastic#134054) The static method TrainedModelAssignmentRebalancer.getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference was used to subtract load.getAssignedNativeInferenceMemory() from load.getFreeMemoryExcludingPerNodeOverhead(). However, in NodeLoad.getFreeMemoryExcludingPerNodeOverhead(), native inference memory was already subtracted as part of the getAssignedJobMemoryExcludingPerNodeOverhead() calculation. This led to double-counting of the native inference memory. Avoiding this double-counting allows us to remove the private method getNodeFreeMemoryExcludingPerNodeOverheadAndNativeInference() entirely.

remove double-counting of inference memory

ee92149

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 labels Sep 1, 2025

valeriy42 requested a review from davidkyle September 1, 2025 09:19

valeriy42 added >bug :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Sep 1, 2025

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 1, 2025

Update docs/changelog/133919.yaml

50a210a

valeriy42 added v9.1.4 v8.18.7 v8.19.4 auto-backport Automatically create backport pull requests when merged and removed Team:ML Meta label for the ML team labels Sep 1, 2025

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 1, 2025

valeriy42 mentioned this pull request Sep 1, 2025

[ML] Fix model assignment error handling and assignment explanation generation #133916

Merged

valeriy42 requested a review from jan-elastic September 1, 2025 12:07

valeriy42 added the v9.0.7 label Sep 1, 2025

jan-elastic reviewed Sep 1, 2025

View reviewed changes

jan-elastic approved these changes Sep 1, 2025

View reviewed changes

davidkyle approved these changes Sep 3, 2025

View reviewed changes

valeriy42 merged commit e49179c into elastic:main Sep 3, 2025
33 checks passed

valeriy42 mentioned this pull request Sep 3, 2025

[8.19] [ML] Fix double-counting of inference memory in the assignment rebalancer (#133919) #134054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix double-counting of inference memory in the assignment rebalancer #133919

[ML] Fix double-counting of inference memory in the assignment rebalancer #133919

Uh oh!

valeriy42 commented Sep 1, 2025 •

edited

Loading

elasticsearchmachine commented Sep 1, 2025

elasticsearchmachine commented Sep 1, 2025

jan-elastic Sep 1, 2025

valeriy42 Sep 1, 2025

jan-elastic left a comment

davidkyle left a comment

Uh oh!

elasticsearchmachine commented Sep 3, 2025

Labels

4 participants

[ML] Fix double-counting of inference memory in the assignment rebalancer #133919

[ML] Fix double-counting of inference memory in the assignment rebalancer #133919

Uh oh!

Conversation

valeriy42 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Sep 1, 2025

elasticsearchmachine commented Sep 1, 2025

jan-elastic Sep 1, 2025

Choose a reason for hiding this comment

valeriy42 Sep 1, 2025

Choose a reason for hiding this comment

jan-elastic left a comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 3, 2025

💚 Backport successful

Labels

4 participants

valeriy42 commented Sep 1, 2025 •

edited

Loading