-
Couldn't load subscription status.
- Fork 25.6k
Description
Elasticsearch Version
8.7.1 and above, 7.17.10 and above
Installed Plugins
No response
Java Version
JDK 20 or above.
The issue depends on the JDK version rather than the ES version.
For reference, these are the bundled JDK versions (see Dependencies and versions section in Elasticsearch docs per stack version):
Stack - JDK ----- --- 7.17.10 - 20.0.1+9 7.17.11 - 20.0.1+9 7.17.12 - 20.0.2+9 7.17.13 - 20.0.2+9 7.17.14 - 21+35 7.17.15 - 21.0.1+12 8.7.1 - 20.0.1+9 8.8.2 - 20.0.1+9 8.9.0 - 20.0.2+9 8.9.1 - 20.0.2+9 8.9.2 - 20.0.2+9 8.10.0 - 20.0.2+9 8.10.1 - 20.0.2+9 8.10.2 - 20.0.2+9 8.10.3 - 21+35 8.10.4 - 21+35 8.11.0 - 21.0.1+12 8.11.1 - 21.0.1+12 The last stack version that allows the escape hatch workaround of re-enabling the disabled JVM setting (-XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC - see Problem Description below) are:
- Elasticsearch v8.10.2 for 8.x branch
- Elasticsearch v7.17.13 for 7.x branch
since versions after this one bundle JDK21 or later that has the setting removed from the JVM entirely.
Adding or leaving the JVM setting in place and starting/upgrading to 8.10.3+ will fail to start the JVM due to the unknown setting in the later JDK versions.
Switch to non-bundled JDK < 20 (when possible) should also function as a workaround.
OS Version
N/A
Problem Description
In 8.7.1 the bundled JDK was changed to JDK 20.0.1 in this PR: #95373
When retrieving large documents from Elasticsearch we see high memory pressure on the data node returning the documents.
There seems to be a distinct difference in how allocated memory is cleaned up between JDK 19 and JDK 20.
The graphs below show memory usage when allocating many ~5mb byte arrays to transfer a ~400 mb pytorch model from a data node to an ML node. The pytorch model is chunked and stored in separate documents that are retrieved one at a time by the ML node. When repeatedly allocating these large arrays, we see memory pressure distinctly increase in JDK 20.0.1. The graphs show memory usage over time when repeatedly starting and stopping the pytorch model.
Memory pressure for 8.7.0
Memory pressure for 8.7.1
Here's one from using VisualVM to monitor heap usage using a local deployment
Memory pressure for 8.7.1 when using the -XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC options
If the data node is started with the JVM options enabled we see memory usage closer to what it looks like in JDK 19
Steps to Reproduce
The issue can be reproduced easily in cloud but I'll describe the steps for running elasticsearch locally too.
Setup
Cloud
In cloud deploy a cluster with
- 2 zones and 4 GB data nodes
- 2 zones and 4 GB ML nodes
- Enable monitoring so you can see the heap usage
Locally
Run two nodes (1 data node, and 1 ML node).
Download and install 8.7.1 https://www.elastic.co/downloads/past-releases/elasticsearch-8-7-1
Download and install 8.7.1 of kibana https://www.elastic.co/downloads/past-releases/kibana-8-7-1
Download and install eland https://github.com/elastic/eland
An easy way to run two nodes is simply to decompress the bundle in two places.
Configuration
- Create a file under
config/jvm.options.dand add the following JVM options for both the data node and ML node
-Xms4g -Xmx4g - Add the following settings to the data nodes
config/elasticsearch.ymlfile
node.roles: ["master", "data", "data_content", "ingest", "data_hot", "data_warm", "data_cold", "data_frozen", "transform"] xpack.security.enabled: true xpack.license.self_generated.type: "trial" - Add the following settings to the ml nodes
config/elasticsearch.ymlfile
node.roles: ["ml"] xpack.security.enabled: true xpack.license.self_generated.type: "trial" - Reset the
elasticpassword on the data node
From bin
./elasticsearch-reset-password -i -u elastic --url http://localhost:9200 - Create a service token for kibana
From bin
./elasticsearch-service-tokens create elastic/kibana <token name> - Add this token to kibana's
kibana.ymlfile
elasticsearch.serviceAccountToken: "<token>" and ensure that these settings are disabled elasticsearch.username and elasticsearch.password
- Start both elasticsearch nodes and kibana
- Connect VisualVM to the data node and observe memory usage over time
Reproducing the bug in cloud and locally
- Upload a pytorch model of around ~400 mb
Locally
docker run -it --rm --network host elastic/eland \ eland_import_hub_model \ --url http://elastic:changeme@host.docker.internal:9200/ \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous For cloud
docker run -it --rm elastic/eland \ eland_import_hub_model \ --url https://elastic:<cloud password>@<cloud es url>:9243 \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous - Repeatedly start and stop the uploaded model (20 - 30 times)
- Navigate to Machine Learning -> Trained models
- Click the start and stop buttons repeatedly
- Observe memory usage over time
Logs (if relevant)
No response



