Change Elasticsearch timeouts to 429 response instead of 5xx #116026

rjernst · 2024-10-31T15:51:17Z

When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request.

Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 400 response code instead of the default 5xx.

When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request. Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 429 response code instead of the default 500.

elasticsearchmachine · 2024-10-31T15:51:41Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2024-10-31T15:52:04Z

Hi @rjernst, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

henningandersen

I wonder if we should include changing ProcessClusterEventTimeoutException here?

henningandersen · 2024-11-06T14:16:58Z

docs/changelog/116026.yaml

+breaking:
+ title: Change Elasticsearch timeouts to 429 response instead of 500
+ area: REST API
+ details: When a user supplied timeout is sent with a REST request, and the request


I wonder if we should be explicit about this also covering cases where the timeout is not explicitly sent, rather has a default in ES like for bulk requests?

server/src/main/java/org/elasticsearch/search/query/SearchTimeoutException.java

rjernst · 2024-11-12T23:25:59Z

@henningandersen I believe I've addressed your comments.

server/src/main/java/org/elasticsearch/search/query/SearchTimeoutException.java

henningandersen

LGTM.

henningandersen · 2024-11-13T17:26:33Z

docs/changelog/116026.yaml

+breaking:
+ title: Change Elasticsearch timeouts to 429 response instead of 5xx
+ area: REST API
+ details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 429 response code.


I wonder if we need to mention search as an exception (unless I misunderstood how that works)?

I can try to reword to exclude search requests, but I think that begs the question why search is different. @javanna having to explain the difference both here and presumably in docs (though I don't know if/where we have any docs explaining timeouts now) really complicates this change. Can you suggest some text/explanation for why "non-search REST requests" can't use a 4xx response for timeout?

prdoyle · 2024-11-18T15:51:55Z

@elasticmachine update branch

rudolf · 2024-11-22T14:07:14Z

docs/changelog/116026.yaml

+breaking:
+ title: Change Elasticsearch timeouts to 400 response instead of 5xx
+ area: REST API
+ details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 400 response code.


Can we enumerate all the API endpoints that are impacted?

I've tried several APIs with the ?timeout param but was only able to produce two 5xx so far:

curl -v -XGET "http://localhost:9200/_cluster/health/my-index?wait_for_status=green&timeout=0s --> status code: 408 res body: {"timed_out":true, ...} curl -v -XPUT "http://localhost:9200/target-index/_mapping?timeout=0s" -H "Content-Type: application/json" -u elastic:changeme -d '{"properties": {"field3": {"type": "text"}}}' status code: 200 res body: {"acknowledged": false} curl -v -XPUT "http://localhost:9200/target-index-1?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} curl -v -XPOST "http://localhost:9200/source-index/_clone/target-index?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} GET /_tasks?actions=*&wait_for_completion=true&timeout=0s&pretty status code: 200 res body: { "node_failures": [ { "type": "failed_node_exception", "reason": "Failed node [01Z7rp-1QAKosLmFu2jOTQ]", "node_id": "01Z7rp-1QAKosLmFu2jOTQ", "caused_by": { "type": "receive_timeout_transport_exception", "reason": "[instance-0000000090][172.22.238.78:19761][cluster:monitor/tasks/lists[n]] request_id [396673064] timed out after [0ms]" } }, ... ] } GET /_tasks/QkCi4Xy3RTaSAaELiMSpTQ:3023?wait_for_completion=true&timeout=0ms&pretty status code: 500 res body { "error" : { "root_cause" : [ { "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" } ], "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" }, "status" : 500 } POST /test-index/_update_by_query?wait_for_completion=true&timeout=1ms&pretty status code: 500 res body: { "took" : 85, "timed_out" : false, <-- misleading if you ask me "total" : 100, "updated" : 99, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ { "index" : "test-index", "id" : "jIFEVJMBIY_Yr_cUNKcL", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 } ] }

Right now as a client of Elasticsearch it's really hard to gain confidence that this breaking change isn't impacting Kibana and I suspect our users will have similar anxiety when reading the release notes.

rjernst · 2024-12-27T17:19:34Z

Thanks for taking a look from the Kibana perspective, @rudolf. I think your examples demonstrate how inconsistent ES is about timeouts right now. The purpose of this PR is not to convert all timeouts to a 400 response. Instead, the intent is to convert existing 5xx timeouts to a 400. I think the changelog entry describes that, but if you have ideas for how to tweak the text to make that purpose more clear I'm happy to change it.

Given that this is only to change 5xx to 400, do you still have concerns? Your examples also show me that Kibana must already be handling all these different status codes for timeouts, right?

rudolf · 2025-01-08T12:19:02Z

@rjernst To be clear, I don't have any concerns with the change itself, I just feel that the changelog entry isn't explicit enough to be useful.

It would be very valuable for Kibana and I think anyone integrating with Elasticsearch to have a list of all the endpoints whose response codes are changing as a result. Otherwise users have to test each API endpoint to discover if the behavior might have changed.

rjernst · 2025-01-08T22:18:23Z

An explicit list of endpoints would be cumbersome to generate, and far too large for a breaking change entry; we have hundreds of endpoints. Many of the examples you gave are not the timeout returning 200, but a result of the request not waiting for the resulting action to complete (hence why you see acknowledged: false). I believe cluster health is something special (though I don't know why).

The intent of this change is that any timeout should no longer return 5xx. The specific response code is less important. Whether the timeout produces a 400 or 500, it's still an error, and our docs only say that a breaching the timeout will result in an error (for example, master_timeout says If the master node is not available before the timeout expires, the request fails and returns an error.).

So, IMO the changelog is accurate. We are changing timeouts that previously responded with 5xx to 400.

rudolf · 2025-01-12T08:23:24Z

I'm not saying it's inaccurate, it's unhelpful to our users. What do I do after reading this breaking change entry? How do I ensure this change doesn't impact me?

rjernst · 2025-01-22T22:35:51Z

@rudolf I adjusted the changelog wording a little bit, can you take a look? The message still does not give an exhaustive list of the APIs that would change. However, I narrowed the wording to be more clear: this affects most APIs, and the change for users is that if you have code looking for 5xx to retry, look for 400.

dpifke-elastic · 2025-01-27T21:06:32Z

I realize I'm a little late here, but shouldn't this be 408 instead of 400?

https://datatracker.ietf.org/doc/html/rfc9110#section-15.5.9

DaveCTurner · 2025-01-27T21:36:19Z

shouldn't this be 408 instead of 400?

No, definitely not. The spec for 408 reads thusly:

The 408 (Request Timeout) status code indicates that the server did not receive a complete request message within the time that it was prepared to wait.

That's the other way round from what we're discussing here: the server received a complete request and started to process it, but did not respond within the time the client indicated (in the request) that it was prepared to wait.

dpifke-elastic · 2025-01-27T21:52:15Z

That's the other way round from what we're discussing here: the server received a complete request and started to process it, but did not respond within the time the client indicated (in the request) that it was prepared to wait.

400 also feels like an inexact fit, since an identical request (same timeout) may succeed at a different time when the server is less busy.

From a client perspective, it seems important to be able to distinguish a request that failed due to current load, versus a request that will never succeed because of a syntax error.

I agree that there isn't an exact match for this condition in the current HTTP spec. To me, it seems better to slightly misuse 408 than 400, since I believe Elasticsearch doesn't currently have a use for 408.

DaveCTurner · 2025-01-27T22:28:54Z

400 also feels like an inexact fit

Such is life with HTTP status codes. There's not many of them, and they don't map at all well onto the semantics we want. We just have to do the best we can.

it seems important to be able to distinguish a request that failed due to current load, versus a request that will never succeed because of a syntax error.

Sure, but that distinction cannot be made using the status code alone. That's not what HTTP status codes are for. You have to look at the response body for this sort of issue.

it seems better to slightly misuse 408

It's not a slight misuse, it's completely wrong to use 408 in this context.

dpifke-elastic · 2025-01-27T22:30:32Z

Sure, but that distinction cannot be made using the status code alone. That's not what HTTP status codes are for. You have to look at the response body for this sort of issue.

That distinction can be made if we leave this as 504.

DaveCTurner · 2025-01-27T23:00:57Z

That distinction can be made if we leave this as 504.

We can't use a 5xx code for timeouts which are under the client's control (which should be basically all of them in practice).

rjernst · 2025-01-28T17:34:08Z

After discussion, we've agreed to go back to 429. In followups we will also add an indication in the error response body, as well as a header, that the request timed out. But for this PR, the http code is changed to 429 as a first step, signaling intent in the ES 9.0 beta that timeouts will be handled and documented more exactly in the future.

simitt · 2025-01-29T08:14:49Z

@rjernst is it guaranteed in this case that the events sent were not indexed to Elasticsearch? For at most once delievery usecases we currently retry on 429, as it was safe to assume no events were indexed, but I am not sure this holds true in this case.

henningandersen · 2025-01-29T08:37:49Z

I am not sure this holds true in this case.

It should hold true also in this case. We've done preliminary checking and will do a bit more prior to actual release. We'll want to treat it as a bug if a timeout occurs and we report this back at a higher level than item level.

A couple potential conflicts is:

refresh: if the refresh times out we may need to report this back to the client as a 429 timed-out. We may also choose something else for this case. Today, it should be unmodified by this change.
retry on shard failure: if a shard fails we may retry, but then time out. This is also unmodified by this change. We may choose to distinguish the response code more in those cases.

All in all, this change should not affect 429 retry'ability for bulk, but subsequent work may need more care.

leemthompo · 2025-04-11T10:37:01Z

@rjernst is this PR relevant to the serverless changelog? [FYI this question is based on 9.0 breaking changes]

rjernst · 2025-04-11T13:13:23Z

Yes, it affects serverless, though note that in the context of serverless we do not consider this breaking.

leemthompo · 2025-04-14T15:14:34Z

@rjernst thanks, could you explain why this isn't considered breaking for serverless?

rjernst · 2025-04-21T16:18:34Z

It's a bit of a gray area. These requests were already errors, it's just a question of which error response code is returned. Errors are not a part of the request/response specs that make up the API surface area which we aim to maintain compatibility for. That's not to say that we freely change errors, but in this case the benefits (more clarity on the retryability, and better ability to distinguish user caused vs server side errors) made the change worthwhile. Arguably this shouldn't have been considered a breaking change in 9.0.

rjernst added >breaking :Core/Infra/Core Core issues without another label labels Oct 31, 2024

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Oct 31, 2024

elasticsearchmachine added the v9.0.0 label Oct 31, 2024

Update docs/changelog/116026.yaml

3ee1f22

rjernst added 2 commits October 31, 2024 09:55

update changelog

57cfc32

iter

23be30b

rjernst requested review from DaveCTurner, henningandersen and javanna November 1, 2024 14:03

henningandersen reviewed Nov 6, 2024

View reviewed changes

rjernst added 3 commits November 12, 2024 15:22

revert search timeout, switch cluster event timeout to 429

54d6642

Merge branch 'main' into api/timeout_4xx

f2e60cb

adjust changelog

9641133

javanna reviewed Nov 13, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/search/query/SearchTimeoutException.java Outdated Show resolved Hide resolved

rjernst added 2 commits November 13, 2024 07:49

revert javadoc too

9f564fe

Merge branch 'main' into api/timeout_4xx

41ef23c

henningandersen approved these changes Nov 13, 2024

View reviewed changes

rjernst added 3 commits November 20, 2024 14:03

Merge branch 'main' into api/timeout_4xx

5f42c83

update to 400

74953f1

update changelog

f2e45ff

rjernst changed the title ~~Change Elasticsearch timeouts to 429 response instead of 500~~ Change Elasticsearch timeouts to 400 response instead of 5xx Nov 20, 2024

rjernst requested review from a team as code owners November 20, 2024 22:39

Merge branch 'main' into api/timeout_4xx

5207e8e

rudolf reviewed Nov 22, 2024

View reviewed changes

Merge branch 'main' into api/timeout_4xx

0b88853

rjernst added 2 commits January 22, 2025 14:26

Merge branch 'main' into api/timeout_4xx

8d4cc5a

adjust changelog wording

0a42f46

Fix transforms to still retry timeouts

77b4f47

rjernst and others added 3 commits January 28, 2025 08:09

Merge branch 'main' into api/timeout_4xx

932d97d

update to 429 code

a638488

[CI] Auto commit changes from spotless

9e06b84

rjernst changed the title ~~Change Elasticsearch timeouts to 400 response instead of 5xx~~ Change Elasticsearch timeouts to 429 response instead of 5xx Jan 28, 2025

rjernst merged commit dc92c83 into elastic:main Jan 28, 2025
16 checks passed

rjernst deleted the api/timeout_4xx branch January 28, 2025 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change Elasticsearch timeouts to 429 response instead of 5xx #116026

Change Elasticsearch timeouts to 429 response instead of 5xx #116026

rjernst commented Oct 31, 2024 •

edited

Loading

elasticsearchmachine commented Oct 31, 2024

elasticsearchmachine commented Oct 31, 2024

henningandersen left a comment

henningandersen Nov 6, 2024

Uh oh!

rjernst commented Nov 12, 2024

Uh oh!

henningandersen left a comment

henningandersen Nov 13, 2024

rjernst Nov 13, 2024

prdoyle commented Nov 18, 2024

rudolf Nov 22, 2024

rjernst commented Dec 27, 2024

rudolf commented Jan 8, 2025

rjernst commented Jan 8, 2025

rudolf commented Jan 12, 2025 •

edited

Loading

rjernst commented Jan 22, 2025

dpifke-elastic commented Jan 27, 2025 •

edited

Loading

DaveCTurner commented Jan 27, 2025

dpifke-elastic commented Jan 27, 2025

DaveCTurner commented Jan 27, 2025

dpifke-elastic commented Jan 27, 2025

DaveCTurner commented Jan 27, 2025

rjernst commented Jan 28, 2025

Uh oh!

simitt commented Jan 29, 2025

henningandersen commented Jan 29, 2025

leemthompo commented Apr 11, 2025

rjernst commented Apr 11, 2025

leemthompo commented Apr 14, 2025

rjernst commented Apr 21, 2025

Labels

12 participants

Change Elasticsearch timeouts to 429 response instead of 5xx #116026

Change Elasticsearch timeouts to 429 response instead of 5xx #116026

Conversation

rjernst commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elasticsearchmachine commented Oct 31, 2024

elasticsearchmachine commented Oct 31, 2024

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

rjernst commented Nov 12, 2024

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Nov 13, 2024

Choose a reason for hiding this comment

rjernst Nov 13, 2024

Choose a reason for hiding this comment

prdoyle commented Nov 18, 2024

rudolf Nov 22, 2024

Choose a reason for hiding this comment

rjernst commented Dec 27, 2024

rudolf commented Jan 8, 2025

rjernst commented Jan 8, 2025

rudolf commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rjernst commented Jan 22, 2025

dpifke-elastic commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DaveCTurner commented Jan 27, 2025

dpifke-elastic commented Jan 27, 2025

DaveCTurner commented Jan 27, 2025

dpifke-elastic commented Jan 27, 2025

DaveCTurner commented Jan 27, 2025

rjernst commented Jan 28, 2025

Uh oh!

simitt commented Jan 29, 2025

henningandersen commented Jan 29, 2025

leemthompo commented Apr 11, 2025

rjernst commented Apr 11, 2025

leemthompo commented Apr 14, 2025

rjernst commented Apr 21, 2025

Labels

12 participants

rjernst commented Oct 31, 2024 •

edited

Loading

rudolf commented Jan 12, 2025 •

edited

Loading

dpifke-elastic commented Jan 27, 2025 •

edited

Loading