- Notifications
You must be signed in to change notification settings - Fork 25.5k
Change Elasticsearch timeouts to 429 response instead of 5xx #116026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request. Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 429 response code instead of the default 500.
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @rjernst, I've created a changelog YAML for you. Note that since this PR is labelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should include changing ProcessClusterEventTimeoutException
here?
docs/changelog/116026.yaml Outdated
breaking: | ||
title: Change Elasticsearch timeouts to 429 response instead of 500 | ||
area: REST API | ||
details: When a user supplied timeout is sent with a REST request, and the request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be explicit about this also covering cases where the timeout is not explicitly sent, rather has a default in ES like for bulk requests?
server/src/main/java/org/elasticsearch/search/query/SearchTimeoutException.java Show resolved Hide resolved
@henningandersen I believe I've addressed your comments. |
server/src/main/java/org/elasticsearch/search/query/SearchTimeoutException.java Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
docs/changelog/116026.yaml Outdated
breaking: | ||
title: Change Elasticsearch timeouts to 429 response instead of 5xx | ||
area: REST API | ||
details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 429 response code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to mention search as an exception (unless I misunderstood how that works)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to reword to exclude search requests, but I think that begs the question why search is different. @javanna having to explain the difference both here and presumably in docs (though I don't know if/where we have any docs explaining timeouts now) really complicates this change. Can you suggest some text/explanation for why "non-search REST requests" can't use a 4xx response for timeout?
@elasticmachine update branch |
docs/changelog/116026.yaml Outdated
breaking: | ||
title: Change Elasticsearch timeouts to 400 response instead of 5xx | ||
area: REST API | ||
details: When a REST request times out, whether via a per-request timeout, or a system default, the request now returns a 400 response code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enumerate all the API endpoints that are impacted?
I've tried several APIs with the ?timeout
param but was only able to produce two 5xx so far:
curl -v -XGET "http://localhost:9200/_cluster/health/my-index?wait_for_status=green&timeout=0s --> status code: 408 res body: {"timed_out":true, ...} curl -v -XPUT "http://localhost:9200/target-index/_mapping?timeout=0s" -H "Content-Type: application/json" -u elastic:changeme -d '{"properties": {"field3": {"type": "text"}}}' status code: 200 res body: {"acknowledged": false} curl -v -XPUT "http://localhost:9200/target-index-1?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} curl -v -XPOST "http://localhost:9200/source-index/_clone/target-index?timeout=0" -H "Content-Type: application/json" -u elastic:changeme status code: 200 res body: {"acknowledged": false, "shards_acknowledged": false} GET /_tasks?actions=*&wait_for_completion=true&timeout=0s&pretty status code: 200 res body: { "node_failures": [ { "type": "failed_node_exception", "reason": "Failed node [01Z7rp-1QAKosLmFu2jOTQ]", "node_id": "01Z7rp-1QAKosLmFu2jOTQ", "caused_by": { "type": "receive_timeout_transport_exception", "reason": "[instance-0000000090][172.22.238.78:19761][cluster:monitor/tasks/lists[n]] request_id [396673064] timed out after [0ms]" } }, ... ] } GET /_tasks/QkCi4Xy3RTaSAaELiMSpTQ:3023?wait_for_completion=true&timeout=0ms&pretty status code: 500 res body { "error" : { "root_cause" : [ { "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" } ], "type" : "timeout_exception", "reason" : "timed out after [0s/0ms]" }, "status" : 500 } POST /test-index/_update_by_query?wait_for_completion=true&timeout=1ms&pretty status code: 500 res body: { "took" : 85, "timed_out" : false, <-- misleading if you ask me "total" : 100, "updated" : 99, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ { "index" : "test-index", "id" : "jIFEVJMBIY_Yr_cUNKcL", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 } ] }
Right now as a client of Elasticsearch it's really hard to gain confidence that this breaking change isn't impacting Kibana and I suspect our users will have similar anxiety when reading the release notes.
Thanks for taking a look from the Kibana perspective, @rudolf. I think your examples demonstrate how inconsistent ES is about timeouts right now. The purpose of this PR is not to convert all timeouts to a 400 response. Instead, the intent is to convert existing 5xx timeouts to a 400. I think the changelog entry describes that, but if you have ideas for how to tweak the text to make that purpose more clear I'm happy to change it. Given that this is only to change 5xx to 400, do you still have concerns? Your examples also show me that Kibana must already be handling all these different status codes for timeouts, right? |
@rjernst To be clear, I don't have any concerns with the change itself, I just feel that the changelog entry isn't explicit enough to be useful. It would be very valuable for Kibana and I think anyone integrating with Elasticsearch to have a list of all the endpoints whose response codes are changing as a result. Otherwise users have to test each API endpoint to discover if the behavior might have changed. |
An explicit list of endpoints would be cumbersome to generate, and far too large for a breaking change entry; we have hundreds of endpoints. Many of the examples you gave are not the timeout returning 200, but a result of the request not waiting for the resulting action to complete (hence why you see acknowledged: false). I believe cluster health is something special (though I don't know why). The intent of this change is that any timeout should no longer return 5xx. The specific response code is less important. Whether the timeout produces a 400 or 500, it's still an error, and our docs only say that a breaching the timeout will result in an error (for example, master_timeout says So, IMO the changelog is accurate. We are changing timeouts that previously responded with 5xx to 400. |
I'm not saying it's inaccurate, it's unhelpful to our users. What do I do after reading this breaking change entry? How do I ensure this change doesn't impact me? |
@rudolf I adjusted the changelog wording a little bit, can you take a look? The message still does not give an exhaustive list of the APIs that would change. However, I narrowed the wording to be more clear: this affects most APIs, and the change for users is that if you have code looking for 5xx to retry, look for 400. |
I realize I'm a little late here, but shouldn't this be 408 instead of 400? https://datatracker.ietf.org/doc/html/rfc9110#section-15.5.9 |
No, definitely not. The spec for 408 reads thusly:
That's the other way round from what we're discussing here: the server received a complete request and started to process it, but did not respond within the time the client indicated (in the request) that it was prepared to wait. |
400 also feels like an inexact fit, since an identical request (same timeout) may succeed at a different time when the server is less busy. From a client perspective, it seems important to be able to distinguish a request that failed due to current load, versus a request that will never succeed because of a syntax error. I agree that there isn't an exact match for this condition in the current HTTP spec. To me, it seems better to slightly misuse 408 than 400, since I believe Elasticsearch doesn't currently have a use for 408. |
Such is life with HTTP status codes. There's not many of them, and they don't map at all well onto the semantics we want. We just have to do the best we can.
Sure, but that distinction cannot be made using the status code alone. That's not what HTTP status codes are for. You have to look at the response body for this sort of issue.
It's not a slight misuse, it's completely wrong to use 408 in this context. |
That distinction can be made if we leave this as 504. |
We can't use a 5xx code for timeouts which are under the client's control (which should be basically all of them in practice). |
After discussion, we've agreed to go back to 429. In followups we will also add an indication in the error response body, as well as a header, that the request timed out. But for this PR, the http code is changed to 429 as a first step, signaling intent in the ES 9.0 beta that timeouts will be handled and documented more exactly in the future. |
@rjernst is it guaranteed in this case that the events sent were not indexed to Elasticsearch? For at most once delievery usecases we currently retry on |
It should hold true also in this case. We've done preliminary checking and will do a bit more prior to actual release. We'll want to treat it as a bug if a timeout occurs and we report this back at a higher level than item level. A couple potential conflicts is:
All in all, this change should not affect 429 retry'ability for bulk, but subsequent work may need more care. |
@rjernst is this PR relevant to the serverless changelog? [FYI this question is based on 9.0 breaking changes] |
Yes, it affects serverless, though note that in the context of serverless we do not consider this breaking. |
@rjernst thanks, could you explain why this isn't considered breaking for serverless? |
It's a bit of a gray area. These requests were already errors, it's just a question of which error response code is returned. Errors are not a part of the request/response specs that make up the API surface area which we aim to maintain compatibility for. That's not to say that we freely change errors, but in this case the benefits (more clarity on the retryability, and better ability to distinguish user caused vs server side errors) made the change worthwhile. Arguably this shouldn't have been considered a breaking change in 9.0. |
When a user passes a timeout to an Elasticsearch request, the user is telling the system to limit how long the request should take. If that timeout is breached, it is because the user selected a timeout less than the time the request needs. Therefore, it is up to the user to select a long enough time. If they don't, it's on the user to adjust their request.
Given the above, a breached timeout is an issue with the user request, and the http response code should refelect that. This commit changes timeout exceptions to give a 400 response code instead of the default 5xx.