Skip to content

High memory pressure for Elasticsearch versions using JDK 20+ #99592

@jonathan-buttner

Description

@jonathan-buttner

Elasticsearch Version

8.7.1 and above, 7.17.10 and above

Installed Plugins

No response

Java Version

JDK 20 or above.

The issue depends on the JDK version rather than the ES version.

For reference, these are the bundled JDK versions (see Dependencies and versions section in Elasticsearch docs per stack version):

Stack - JDK ----- --- 7.17.10 - 20.0.1+9 7.17.11 - 20.0.1+9 7.17.12 - 20.0.2+9 7.17.13 - 20.0.2+9 7.17.14 - 21+35 7.17.15 - 21.0.1+12 8.7.1 - 20.0.1+9 8.8.2 - 20.0.1+9 8.9.0 - 20.0.2+9 8.9.1 - 20.0.2+9 8.9.2 - 20.0.2+9 8.10.0 - 20.0.2+9 8.10.1 - 20.0.2+9 8.10.2 - 20.0.2+9 8.10.3 - 21+35 8.10.4 - 21+35 8.11.0 - 21.0.1+12 8.11.1 - 21.0.1+12 

The last stack version that allows the escape hatch workaround of re-enabling the disabled JVM setting (-XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC - see Problem Description below) are:

  • Elasticsearch v8.10.2 for 8.x branch
  • Elasticsearch v7.17.13 for 7.x branch

since versions after this one bundle JDK21 or later that has the setting removed from the JVM entirely.
Adding or leaving the JVM setting in place and starting/upgrading to 8.10.3+ will fail to start the JVM due to the unknown setting in the later JDK versions.

Switch to non-bundled JDK < 20 (when possible) should also function as a workaround.

OS Version

N/A

Problem Description

In 8.7.1 the bundled JDK was changed to JDK 20.0.1 in this PR: #95373

When retrieving large documents from Elasticsearch we see high memory pressure on the data node returning the documents.

There seems to be a distinct difference in how allocated memory is cleaned up between JDK 19 and JDK 20.

The graphs below show memory usage when allocating many ~5mb byte arrays to transfer a ~400 mb pytorch model from a data node to an ML node. The pytorch model is chunked and stored in separate documents that are retrieved one at a time by the ML node. When repeatedly allocating these large arrays, we see memory pressure distinctly increase in JDK 20.0.1. The graphs show memory usage over time when repeatedly starting and stopping the pytorch model.

Memory pressure for 8.7.0

image

Memory pressure for 8.7.1

image

Here's one from using VisualVM to monitor heap usage using a local deployment

image

Memory pressure for 8.7.1 when using the -XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC options

If the data node is started with the JVM options enabled we see memory usage closer to what it looks like in JDK 19

image

Steps to Reproduce

The issue can be reproduced easily in cloud but I'll describe the steps for running elasticsearch locally too.

Setup

Cloud
In cloud deploy a cluster with

  • 2 zones and 4 GB data nodes
  • 2 zones and 4 GB ML nodes
  • Enable monitoring so you can see the heap usage

Locally

Run two nodes (1 data node, and 1 ML node).

Download and install 8.7.1 https://www.elastic.co/downloads/past-releases/elasticsearch-8-7-1
Download and install 8.7.1 of kibana https://www.elastic.co/downloads/past-releases/kibana-8-7-1
Download and install eland https://github.com/elastic/eland

An easy way to run two nodes is simply to decompress the bundle in two places.

Configuration

  • Create a file under config/jvm.options.d and add the following JVM options for both the data node and ML node
-Xms4g -Xmx4g 
  • Add the following settings to the data nodes config/elasticsearch.yml file
node.roles: ["master", "data", "data_content", "ingest", "data_hot", "data_warm", "data_cold", "data_frozen", "transform"] xpack.security.enabled: true xpack.license.self_generated.type: "trial" 
  • Add the following settings to the ml nodes config/elasticsearch.yml file
node.roles: ["ml"] xpack.security.enabled: true xpack.license.self_generated.type: "trial" 
  • Reset the elastic password on the data node

From bin

./elasticsearch-reset-password -i -u elastic --url http://localhost:9200 
  • Create a service token for kibana

From bin

./elasticsearch-service-tokens create elastic/kibana <token name> 
  • Add this token to kibana's kibana.yml file
elasticsearch.serviceAccountToken: "<token>" 

and ensure that these settings are disabled elasticsearch.username and elasticsearch.password

  • Start both elasticsearch nodes and kibana
  • Connect VisualVM to the data node and observe memory usage over time

Reproducing the bug in cloud and locally

  • Upload a pytorch model of around ~400 mb

Locally

docker run -it --rm --network host elastic/eland \ eland_import_hub_model \ --url http://elastic:changeme@host.docker.internal:9200/ \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous 

For cloud

docker run -it --rm elastic/eland \ eland_import_hub_model \ --url https://elastic:<cloud password>@<cloud es url>:9243 \ --hub-model-id sentence-transformers/all-distilroberta-v1 \ --clear-previous 
  • Repeatedly start and stop the uploaded model (20 - 30 times)
    • Navigate to Machine Learning -> Trained models
    • Click the start and stop buttons repeatedly
  • Observe memory usage over time

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions