High memory pressure for Elasticsearch versions using JDK 20+

Elasticsearch Version

8.7.1 and above, 7.17.10 and above

Installed Plugins

No response

Java Version

JDK 20 or above.

The issue depends on the JDK version rather than the ES version.

For reference, these are the bundled JDK versions (see Dependencies and versions section in Elasticsearch docs per stack version):

Stack - JDK ----- --- 7.17.10 - 20.0.1+9 7.17.11 - 20.0.1+9 7.17.12 - 20.0.2+9 7.17.13 - 20.0.2+9 7.17.14 - 21+35 7.17.15 - 21.0.1+12 8.7.1 - 20.0.1+9 8.8.2 - 20.0.1+9 8.9.0 - 20.0.2+9 8.9.1 - 20.0.2+9 8.9.2 - 20.0.2+9 8.10.0 - 20.0.2+9 8.10.1 - 20.0.2+9 8.10.2 - 20.0.2+9 8.10.3 - 21+35 8.10.4 - 21+35 8.11.0 - 21.0.1+12 8.11.1 - 21.0.1+12

The last stack version that allows the escape hatch workaround of re-enabling the disabled JVM setting (-XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC - see Problem Description below) are:

Elasticsearch v8.10.2 for 8.x branch
Elasticsearch v7.17.13 for 7.x branch

since versions after this one bundle JDK21 or later that has the setting removed from the JVM entirely.
Adding or leaving the JVM setting in place and starting/upgrading to 8.10.3+ will fail to start the JVM due to the unknown setting in the later JDK versions.

Switch to non-bundled JDK < 20 (when possible) should also function as a workaround.

OS Version

N/A

Problem Description

In 8.7.1 the bundled JDK was changed to JDK 20.0.1 in this PR: #95373

When retrieving large documents from Elasticsearch we see high memory pressure on the data node returning the documents.

There seems to be a distinct difference in how allocated memory is cleaned up between JDK 19 and JDK 20.

The graphs below show memory usage when allocating many ~5mb byte arrays to transfer a ~400 mb pytorch model from a data node to an ML node. The pytorch model is chunked and stored in separate documents that are retrieved one at a time by the ML node. When repeatedly allocating these large arrays, we see memory pressure distinctly increase in JDK 20.0.1. The graphs show memory usage over time when repeatedly starting and stopping the pytorch model.

Memory pressure for 8.7.0

Memory pressure for 8.7.1

Here's one from using VisualVM to monitor heap usage using a local deployment

Memory pressure for 8.7.1 when using the -XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC options

If the data node is started with the JVM options enabled we see memory usage closer to what it looks like in JDK 19

Steps to Reproduce

The issue can be reproduced easily in cloud but I'll describe the steps for running elasticsearch locally too.

Setup

Cloud
In cloud deploy a cluster with

2 zones and 4 GB data nodes
2 zones and 4 GB ML nodes
Enable monitoring so you can see the heap usage

Locally

Run two nodes (1 data node, and 1 ML node).

Download and install 8.7.1 https://www.elastic.co/downloads/past-releases/elasticsearch-8-7-1
Download and install 8.7.1 of kibana https://www.elastic.co/downloads/past-releases/kibana-8-7-1
Download and install eland https://github.com/elastic/eland

An easy way to run two nodes is simply to decompress the bundle in two places.

Configuration

Create a file under config/jvm.options.d and add the following JVM options for both the data node and ML node

-Xms4g -Xmx4g

Add the following settings to the data nodes config/elasticsearch.yml file

node.roles: ["master", "data", "data_content", "ingest", "data_hot", "data_warm", "data_cold", "data_frozen", "transform"] xpack.security.enabled: true xpack.license.self_generated.type: "trial"

Add the following settings to the ml nodes config/elasticsearch.yml file

node.roles: ["ml"] xpack.security.enabled: true xpack.license.self_generated.type: "trial"

Reset the elastic password on the data node

From bin

./elasticsearch-reset-password -i -u elastic --url http://localhost:9200

Create a service token for kibana

From bin

./elasticsearch-service-tokens create elastic/kibana <token name>

Add this token to kibana's kibana.yml file

elasticsearch.serviceAccountToken: "<token>"

and ensure that these settings are disabled elasticsearch.username and elasticsearch.password

Start both elasticsearch nodes and kibana
Connect VisualVM to the data node and observe memory usage over time

Reproducing the bug in cloud and locally

Upload a pytorch model of around ~400 mb

Locally

docker run -it --rm --network host elastic/eland \ eland_import_hub_model \ --url http://elastic:changeme@host.docker.internal:9200/ \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous

For cloud

docker run -it --rm elastic/eland \ eland_import_hub_model \ --url https://elastic:<cloud password>@<cloud es url>:9243 \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous

Repeatedly start and stop the uploaded model (20 - 30 times)
- Navigate to Machine Learning -> Trained models
- Click the start and stop buttons repeatedly
Observe memory usage over time

Logs (if relevant)

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

High memory pressure for Elasticsearch versions using JDK 20+ #99592

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Setup

Logs (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

High memory pressure for Elasticsearch versions using JDK 20+ #99592

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Setup

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions