[ML] Inference duration and error metrics #115876

prwhelan · 2024-10-29T19:43:46Z

Add es.inference.requests.time metric around infer API.

As recommended by OTel spec, errors are determined by the presence or absence of the error.type attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException).

Additional notes:

ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM.
Request count is now always recorded, even when there are failures loading the endpoint configuration.
Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream.

Example from local node to APM (redacted a bunch):

{ "_index": ".ds-metrics-apm.app.elasticsearch-default-2024.10.25-000001", "data_stream": { "dataset": "apm.app.elasticsearch", "namespace": "default", "type": "metrics" }, "es": { "inference": { "requests": { "time": { "values": [ 6992.5 ], "counts": [ 1 ] } } } }, "labels": { "model_id": "gpt-4o-mini", "otel_instrumentation_scope_name": "elasticsearch", "service": "openai", "task_type": "completion" }, "numeric_labels": { "status_code": 200 }, ... } }

Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream.

elasticsearchmachine · 2024-10-29T19:44:09Z

Hi @prwhelan, I've created a changelog YAML for you.

prwhelan · 2024-10-30T19:55:31Z

@elasticmachine update branch

elasticsearchmachine · 2024-10-31T12:34:57Z

Pinging @elastic/ml-core (Team:ML)

maxhniebergall

LGTM, just a couple of questions

maxhniebergall · 2024-10-31T15:45:10Z

...ference/src/main/java/org/elasticsearch/xpack/inference/action/TransportInferenceAction.java

+ listener.onFailure(e);
+ recordMetrics(model, timer, e);


I think we need to do listenener.onFailure(e); after we recordMetrics. I don't think that any code after calling listener.on is guaranteed to run.

maxhniebergall · 2024-10-31T15:47:37Z

...in/inference/src/main/java/org/elasticsearch/xpack/inference/common/DelegatingProcessor.java

 };
 }

+ protected void onCancel() {}


I'm curious why this is not an abstract method? Do we not want to implement this in all subclasses?

Most subclasses don't have any actions to take when the stream is canceled, they're mostly just forwarding the request/cancel and modifying the response payload, so I figured we should just continue to no-op them for onCancel

maxhniebergall

LGTM

prwhelan · 2024-11-05T13:18:41Z

@elasticmachine update branch

Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream.

jonathan-buttner · 2024-12-13T21:00:16Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream. (cherry picked from commit 26870ef)

* [ML] Inference duration and error metrics (#115876) Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream. (cherry picked from commit 26870ef) * fixing switch with class issue --------- Co-authored-by: Pat Whelan <pat.whelan@elastic.co>

…stic#118700) * [ML] Inference duration and error metrics (elastic#115876) Add `es.inference.requests.time` metric around `infer` API. As recommended by OTel spec, errors are determined by the presence or absence of the `error.type` attribute in the metric. "error.type" will be the http status code (as a string) if it is available, otherwise it will be the name of the exception (e.g. NullPointerException). Additional notes: - ApmInferenceStats is merged into InferenceStats. Originally we planned to have multiple implementations, but now we're only using APM. - Request count is now always recorded, even when there are failures loading the endpoint configuration. - Added a hook in streaming for cancel messages, so we can close the metrics when a user cancels the stream. (cherry picked from commit 26870ef) * fixing switch with class issue --------- Co-authored-by: Pat Whelan <pat.whelan@elastic.co>

prwhelan added >enhancement :ml Machine learning Team:ML Meta label for the ML team v9.0.0 labels Oct 29, 2024

Update docs/changelog/115876.yaml

07510f9

Merge branch 'main' into enhancement/1321

714cc6a

prwhelan marked this pull request as ready for review October 31, 2024 12:34

maxhniebergall reviewed Oct 31, 2024

View reviewed changes

Record metrics before calling listener

d39fb9f

maxhniebergall approved these changes Nov 1, 2024

View reviewed changes

Merge branch 'main' into enhancement/1321

461afd6

prwhelan merged commit 26870ef into elastic:main Nov 5, 2024
16 checks passed

jonathan-buttner mentioned this pull request Dec 13, 2024

[8.x] [ML] Inference duration and error metrics (#115876) #118700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Inference duration and error metrics #115876

[ML] Inference duration and error metrics #115876

Uh oh!

prwhelan commented Oct 29, 2024

elasticsearchmachine commented Oct 29, 2024

prwhelan commented Oct 30, 2024

elasticsearchmachine commented Oct 31, 2024

maxhniebergall left a comment

maxhniebergall Oct 31, 2024

maxhniebergall Oct 31, 2024

prwhelan Oct 31, 2024

maxhniebergall left a comment

prwhelan commented Nov 5, 2024

Uh oh!

jonathan-buttner commented Dec 13, 2024

Labels

5 participants

[ML] Inference duration and error metrics #115876

[ML] Inference duration and error metrics #115876

Uh oh!

Conversation

prwhelan commented Oct 29, 2024

elasticsearchmachine commented Oct 29, 2024

prwhelan commented Oct 30, 2024

elasticsearchmachine commented Oct 31, 2024

maxhniebergall left a comment

Choose a reason for hiding this comment

maxhniebergall Oct 31, 2024

Choose a reason for hiding this comment

maxhniebergall Oct 31, 2024

Choose a reason for hiding this comment

prwhelan Oct 31, 2024

Choose a reason for hiding this comment

maxhniebergall left a comment

Choose a reason for hiding this comment

prwhelan commented Nov 5, 2024

Uh oh!

jonathan-buttner commented Dec 13, 2024

💚 All backports created successfully

Questions ?

Labels

5 participants