Skip to content

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Oct 31, 2024

When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request.

Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 400 response code instead of the default 5xx.

When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request. Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 429 response code instead of the default 500.
@rjernst rjernst added >breaking :Core/Infra/Core Core issues without another label labels Oct 31, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Oct 31, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @rjernst, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should include changing ProcessClusterEventTimeoutException here?

breaking:
title: Change Elasticsearch timeouts to 429 response instead of 500
area: REST API
details: When a user supplied timeout is sent with a REST request, and the request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be explicit about this also covering cases where the timeout is not explicitly sent, rather has a default in ES like for bulk requests?

@rjernst
Copy link
Member Author

rjernst commented Nov 12, 2024

@henningandersen I believe I've addressed your comments.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

breaking:
title: Change Elasticsearch timeouts to 429 response instead of 5xx
area: REST API
details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 429 response code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to mention search as an exception (unless I misunderstood how that works)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to reword to exclude search requests, but I think that begs the question why search is different. @javanna having to explain the difference both here and presumably in docs (though I don't know if/where we have any docs explaining timeouts now) really complicates this change. Can you suggest some text/explanation for why "non-search REST requests" can't use a 4xx response for timeout?

@prdoyle
Copy link
Contributor

prdoyle commented Nov 18, 2024

@elasticmachine update branch

@rjernst rjernst changed the title Change Elasticsearch timeouts to 429 response instead of 500 Change Elasticsearch timeouts to 400 response instead of 5xx Nov 20, 2024
@rjernst rjernst requested review from a team as code owners November 20, 2024 22:39
breaking:
title: Change Elasticsearch timeouts to 400 response instead of 5xx
area: REST API
details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 400 response code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enumerate all the API endpoints that are impacted?

I've tried several APIs with the ?timeout param but was only able to produce two 5xx so far:

curl -v -XGET "http://localhost:9200/_cluster/health/my-index?wait_for_status=green&timeout=0s --> status code: 408 res body: {"timed_out":true, ...} curl -v -XPUT "http://localhost:9200/target-index/_mapping?timeout=0s" -H "Content-Type: application/json" -u elastic:changeme -d '{"properties": {"field3": {"type": "text"}}}' status code: 200 res body: {"acknowledged": false} curl -v -XPUT "http://localhost:9200/target-index-1?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} curl -v -XPOST "http://localhost:9200/source-index/_clone/target-index?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} GET /_tasks?actions=*&wait_for_completion=true&timeout=0s&pretty status code: 200 res body: { "node_failures": [ { "type": "failed_node_exception", "reason": "Failed node [01Z7rp-1QAKosLmFu2jOTQ]", "node_id": "01Z7rp-1QAKosLmFu2jOTQ", "caused_by": { "type": "receive_timeout_transport_exception", "reason": "[instance-0000000090][172.22.238.78:19761][cluster:monitor/tasks/lists[n]] request_id [396673064] timed out after [0ms]" } }, ... ] } GET /_tasks/QkCi4Xy3RTaSAaELiMSpTQ:3023?wait_for_completion=true&timeout=0ms&pretty status code: 500 res body { "error" : { "root_cause" : [ { "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" } ], "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" }, "status" : 500 } POST /test-index/_update_by_query?wait_for_completion=true&timeout=1ms&pretty status code: 500 res body: { "took" : 85, "timed_out" : false, <-- misleading if you ask me "total" : 100, "updated" : 99, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ { "index" : "test-index", "id" : "jIFEVJMBIY_Yr_cUNKcL", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 } ] } 

Right now as a client of Elasticsearch it's really hard to gain confidence that this breaking change isn't impacting Kibana and I suspect our users will have similar anxiety when reading the release notes.

@rjernst
Copy link
Member Author

rjernst commented Dec 27, 2024

Thanks for taking a look from the Kibana perspective, @rudolf. I think your examples demonstrate how inconsistent ES is about timeouts right now. The purpose of this PR is not to convert all timeouts to a 400 response. Instead, the intent is to convert existing 5xx timeouts to a 400. I think the changelog entry describes that, but if you have ideas for how to tweak the text to make that purpose more clear I'm happy to change it.

Given that this is only to change 5xx to 400, do you still have concerns? Your examples also show me that Kibana must already be handling all these different status codes for timeouts, right?

@rudolf
Copy link
Contributor

rudolf commented Jan 8, 2025

@rjernst To be clear, I don't have any concerns with the change itself, I just feel that the changelog entry isn't explicit enough to be useful.

It would be very valuable for Kibana and I think anyone integrating with Elasticsearch to have a list of all the endpoints whose response codes are changing as a result. Otherwise users have to test each API endpoint to discover if the behavior might have changed.

@rjernst
Copy link
Member Author

rjernst commented Jan 8, 2025

An explicit list of endpoints would be cumbersome to generate, and far too large for a breaking change entry; we have hundreds of endpoints. Many of the examples you gave are not the timeout returning 200, but a result of the request not waiting for the resulting action to complete (hence why you see acknowledged: false). I believe cluster health is something special (though I don't know why).

The intent of this change is that any timeout should no longer return 5xx. The specific response code is less important. Whether the timeout produces a 400 or 500, it's still an error, and our docs only say that a breaching the timeout will result in an error (for example, master_timeout says If the master node is not available before the timeout expires, the request fails and returns an error.).

So, IMO the changelog is accurate. We are changing timeouts that previously responded with 5xx to 400.

@rudolf
Copy link
Contributor

rudolf commented Jan 12, 2025

I'm not saying it's inaccurate, it's unhelpful to our users. What do I do after reading this breaking change entry? How do I ensure this change doesn't impact me?

@rjernst
Copy link
Member Author

rjernst commented Jan 22, 2025

@rudolf I adjusted the changelog wording a little bit, can you take a look? The message still does not give an exhaustive list of the APIs that would change. However, I narrowed the wording to be more clear: this affects most APIs, and the change for users is that if you have code looking for 5xx to retry, look for 400.

@dpifke-elastic
Copy link
Contributor

dpifke-elastic commented Jan 27, 2025

I realize I'm a little late here, but shouldn't this be 408 instead of 400?

https://datatracker.ietf.org/doc/html/rfc9110#section-15.5.9

@DaveCTurner
Copy link
Contributor

shouldn't this be 408 instead of 400?

No, definitely not. The spec for 408 reads thusly:

The 408 (Request Timeout) status code indicates that the server did not receive a complete request message within the time that it was prepared to wait.

That's the other way round from what we're discussing here: the server received a complete request and started to process it, but did not respond within the time the client indicated (in the request) that it was prepared to wait.

@dpifke-elastic
Copy link
Contributor

That's the other way round from what we're discussing here: the server received a complete request and started to process it, but did not respond within the time the client indicated (in the request) that it was prepared to wait.

400 also feels like an inexact fit, since an identical request (same timeout) may succeed at a different time when the server is less busy.

From a client perspective, it seems important to be able to distinguish a request that failed due to current load, versus a request that will never succeed because of a syntax error.

I agree that there isn't an exact match for this condition in the current HTTP spec. To me, it seems better to slightly misuse 408 than 400, since I believe Elasticsearch doesn't currently have a use for 408.

@DaveCTurner
Copy link
Contributor

400 also feels like an inexact fit

Such is life with HTTP status codes. There's not many of them, and they don't map at all well onto the semantics we want. We just have to do the best we can.

it seems important to be able to distinguish a request that failed due to current load, versus a request that will never succeed because of a syntax error.

Sure, but that distinction cannot be made using the status code alone. That's not what HTTP status codes are for. You have to look at the response body for this sort of issue.

it seems better to slightly misuse 408

It's not a slight misuse, it's completely wrong to use 408 in this context.

@dpifke-elastic
Copy link
Contributor

Sure, but that distinction cannot be made using the status code alone. That's not what HTTP status codes are for. You have to look at the response body for this sort of issue.

That distinction can be made if we leave this as 504.

@DaveCTurner
Copy link
Contributor

That distinction can be made if we leave this as 504.

We can't use a 5xx code for timeouts which are under the client's control (which should be basically all of them in practice).

@rjernst rjernst changed the title Change Elasticsearch timeouts to 400 response instead of 5xx Change Elasticsearch timeouts to 429 response instead of 5xx Jan 28, 2025
@rjernst
Copy link
Member Author

rjernst commented Jan 28, 2025

After discussion, we've agreed to go back to 429. In followups we will also add an indication in the error response body, as well as a header, that the request timed out. But for this PR, the http code is changed to 429 as a first step, signaling intent in the ES 9.0 beta that timeouts will be handled and documented more exactly in the future.

@rjernst rjernst merged commit dc92c83 into elastic:main Jan 28, 2025
16 checks passed
@rjernst rjernst deleted the api/timeout_4xx branch January 28, 2025 17:35
@simitt
Copy link
Contributor

simitt commented Jan 29, 2025

@rjernst is it guaranteed in this case that the events sent were not indexed to Elasticsearch? For at most once delievery usecases we currently retry on 429, as it was safe to assume no events were indexed, but I am not sure this holds true in this case.

@henningandersen
Copy link
Contributor

I am not sure this holds true in this case.

It should hold true also in this case. We've done preliminary checking and will do a bit more prior to actual release. We'll want to treat it as a bug if a timeout occurs and we report this back at a higher level than item level.

A couple potential conflicts is:

  1. refresh: if the refresh times out we may need to report this back to the client as a 429 timed-out. We may also choose something else for this case. Today, it should be unmodified by this change.
  2. retry on shard failure: if a shard fails we may retry, but then time out. This is also unmodified by this change. We may choose to distinguish the response code more in those cases.

All in all, this change should not affect 429 retry'ability for bulk, but subsequent work may need more care.

@leemthompo
Copy link
Contributor

@rjernst is this PR relevant to the serverless changelog? [FYI this question is based on 9.0 breaking changes]

@rjernst
Copy link
Member Author

rjernst commented Apr 11, 2025

Yes, it affects serverless, though note that in the context of serverless we do not consider this breaking.

@leemthompo
Copy link
Contributor

@rjernst thanks, could you explain why this isn't considered breaking for serverless?

@rjernst
Copy link
Member Author

rjernst commented Apr 21, 2025

It's a bit of a gray area. These requests were already errors, it's just a question of which error response code is returned. Errors are not a part of the request/response specs that make up the API surface area which we aim to maintain compatibility for. That's not to say that we freely change errors, but in this case the benefits (more clarity on the retryability, and better ability to distinguish user caused vs server side errors) made the change worthwhile. Arguably this shouldn't have been considered a breaking change in 9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>breaking :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v9.0.0