Skip to content

Commit f6d4c0d

Browse files
author
Liudmila Molkova
committed
add migration section, update trade-offs and minitations, clean up
1 parent e3db414 commit f6d4c0d

File tree

1 file changed

+64
-47
lines changed

1 file changed

+64
-47
lines changed

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 64 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- [Guidance](#guidance)
77
* [Details](#details)
88
- [API changes](#api-changes)
9+
- [Migrating instrumentations](#migrating-instrumentations)
910
- [Examples](#examples)
1011
* [Logging errors from client library in a user application](#logging-errors-from-client-library-in-a-user-application)
1112
* [Logging errors inside the natively instrumented Library](#logging-errors-inside-the-natively-instrumented-library)
@@ -19,7 +20,7 @@
1920

2021
<!-- tocstop -->
2122

22-
This OTEP provides guidance on how to record errors using OpenTelemetry Logs
23+
This OTEP provides guidance on how to record errors using OpenTelemetry Logs,
2324
focusing on minimizing duplication and providing context to reduce the noise.
2425

2526
In the long term, errors recorded on logs **will replace span events**
@@ -35,7 +36,7 @@ In the long term, errors recorded on logs **will replace span events**
3536
3637
## Motivation
3738

38-
Today OTel supports recording *exceptions* using span events available through the Trace API. Outside the OTel world,
39+
Today, OTel supports recording *exceptions* using span events available through the Trace API. Outside the OTel world,
3940
*errors* are usually recorded by user apps and libraries by using logging libraries,
4041
and may be recorded as OTel logs via a logging bridge.
4142

@@ -48,20 +49,20 @@ Using logs to record errors has the following advantages over using span events:
4849

4950
Recording errors is essential for troubleshooting, but regardless of how they are recorded, they could be noisy:
5051

51-
- distributed applications experience transient errors at the rate proportional to their scale and
52-
errors in logs could be misleading - individual occurrences of transient errors
52+
- distributed applications experience transient errors at a rate proportional to their scale, and
53+
errors in logs could be misleading. Individual occurrences of transient errors
5354
are not necessarily indicative of a problem.
54-
- exception stack traces can be huge. The corresponding attribute value can frequently reach several KBs resulting in high costs
55+
- exception stack traces can be huge. The corresponding attribute value can frequently reach several KBs, resulting in high costs
5556
associated with ingesting and storing them. It's also common to log errors multiple times
56-
as they bubble up leading to duplication and aggravating the verbosity problem.
57+
as they bubble up, leading to duplication and aggravating the verbosity problem.
5758
- severity depends on the context and, in the general case, is not known at the time the error
5859
occurs since errors are frequently handled (suppressed, retried, ignored) by the caller.
5960

6061
In this OTEP, we'll provide guidance around recording errors that minimizes duplication,
6162
allows reducing noise with configuration, and allows capturing errors in the
6263
absence of a recorded span.
6364

64-
This guidance applies to general-purpose instrumentations including natively
65+
This guidance applies to general-purpose instrumentations, including natively
6566
instrumented libraries.
6667

6768
Application developers should consider following it as a starting point, but
@@ -75,7 +76,7 @@ Instrumentations SHOULD record error information along with relevant context as
7576
a log record with appropriate severity.
7677

7778
Instrumentations SHOULD set severity to `Error` or higher only when the log describes a
78-
problem affecting application functionality, availability, performance, security or
79+
problem affecting application functionality, availability, performance, security, or
7980
another aspect that is important for the given type of application.
8081

8182
When instrumentation records an exception, it SHOULD provide
@@ -109,14 +110,14 @@ be to record exception stack traces when logging exceptions at `Error` or higher
109110
severity not higher than `Info`.
110111

111112
Such errors can be used to control application logic and have a minor impact, if any,
112-
on application functionality, availability, or performance (beyond performance hit introduced
113-
if exception is used to control application logic).
113+
on application functionality, availability, or performance (beyond the performance hit introduced
114+
if an exception is used to control application logic).
114115

115116
Examples:
116117

117118
- an error is returned when checking optional dependency or resource existence.
118-
- an exception is thrown on the server when client disconnects before reading
119-
full response from the server
119+
- an exception is thrown on the server when the client disconnects before reading
120+
the full response from the server.
120121

121122
- Errors that are expected to be retried or handled by the caller or another
122123
layer of the component SHOULD be recorded with severity not higher than `Warn`.
@@ -127,11 +128,11 @@ be to record exception stack traces when logging exceptions at `Error` or higher
127128

128129
Examples:
129130

130-
- an attempt to connect to the required remote dependency times out
131-
- a remote dependency returns 401 "Unauthorized" response code
132-
- writing data to a file results in an IO exception
133-
- a remote dependency returned 503 "Service Unavailable" response for 5 times in a row,
134-
retry attempts are exhausted and the corresponding operation has failed.
131+
- an attempt to connect to the required remote dependency times out.
132+
- a remote dependency returns a 401 "Unauthorized" response code.
133+
- writing data to a file results in an IO exception.
134+
- a remote dependency returned a 503 "Service Unavailable" response for 5 times in a row,
135+
retry attempts are exhausted, and the corresponding operation has failed.
135136

136137
- Unhandled (by the application code) errors that don't result in application
137138
shutdown SHOULD be recorded with severity `Error`
@@ -148,8 +149,8 @@ be to record exception stack traces when logging exceptions at `Error` or higher
148149
Note: some frameworks use exceptions as a communication mechanism when a request fails. For example,
149150
Spring users can throw a [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html)
150151
exception to return an unsuccessful status code. Such exceptions represent errors already handled by the application code.
151-
Application code, in this case, is expected to log this at appropriate severity.
152-
General-purpose instrumentation MAY record such errors, but at severity not higher than `Warn`.
152+
Application code, in this case, is expected to log this at the appropriate severity.
153+
General-purpose instrumentation MAY record such errors, but at a severity not higher than `Warn`.
153154

154155
- Errors that result in application shutdown SHOULD be recorded with severity `Fatal`.
155156

@@ -169,33 +170,53 @@ be to record exception stack traces when logging exceptions at `Error` or higher
169170
records stack trace conditionally.
170171

171172
8. Instrumentation libraries that record exceptions using span events SHOULD gracefully migrate
172-
to log-based exceptions offering it as an opt-in feature first and then switching to log-based exceptions
173+
to log-based exceptions, offering it as an opt-in feature first and then switching to log-based exceptions
173174
in the next major version update.
174175

175176
## API changes
176177

177178
> [!NOTE]
178179
>
179-
> It should not be an instrumentation concern to decide whether exception stack trace
180+
> It should not be an instrumentation concern to decide whether an exception stack trace
180181
> should be recorded or not.
181182
>
182-
> A natively instrumented library may write logs providing exception instance
183+
> A natively instrumented library may write logs providing an exception instance
183184
> through a log bridge and not be aware of this guidance.
184185
>
185186
> It also may be desirable by some vendors/apps to record all exception details at all levels.
186187
187-
OTel Logs API SHOULD provide methods that enrich log record with exception details such as
188-
`setException(exception)` and similar to [RecordException](../specification/trace/api.md#record-exception) method on span.
188+
The OTel Logs API SHOULD provide methods that enrich log records with exception details such as
189+
`setException(exception)` and similar to the [RecordException](../specification/trace/api.md#record-exception) method on span.
189190

190-
OTel SDK, based on the log severity and configuration, SHOULD record exception details fully or partially.
191+
The OTel SDK, based on the log severity and configuration, SHOULD record exception details fully or partially.
191192

192193
The signature of the method is to be determined by each language
193-
and can be overloaded as appropriate including ability to customize stack trace
194+
and can be overloaded as appropriate, including the ability to customize stack trace
194195
collection.
195196

196197
It MUST be possible to efficiently set exception information on a log record based on configuration
197198
and without using the `setException` method.
198199

200+
## Migrating instrumentations
201+
202+
> [!NOTE]
203+
> New instrumentations or existing ones that do not record exceptions on span events SHOULD
204+
> NOT start recording exceptions on span events. They SHOULD NOT implement the migration plan
205+
> described below.
206+
>
207+
> This section covers migration recommendations for existing instrumentations that already
208+
> report exceptions using span events.
209+
210+
We will define a configuration option to let users choose if they want instrumentations to record exceptions
211+
on span events or logs.
212+
213+
Specific instrumentation SHOULD default to recording exceptions on span events in its current major version
214+
and record them on logs only when the user opts-in.
215+
216+
In the next major version, this instrumentation SHOULD stop recording exceptions on span events.
217+
218+
This is a simplified version of [stability opt-in migration](https://github.com/open-telemetry/semantic-conventions/blob/727700406f9e6cc3f4e4680a81c4c28f2eb71569/docs/http/README.md?plain=1#L13-L37) used in semantic conventions.
219+
199220
## Examples
200221

201222
### Logging errors from client library in a user application
@@ -236,7 +257,7 @@ try {
236257
### Logging errors inside the natively instrumented Library
237258

238259
It's a common practice to record errors using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
239-
leverage OTel Events/Logs API for their exception logging purposes.
260+
leverage the OTel Events/Logs API for their exception logging purposes.
240261

241262
```java
242263
public class StorageClient {
@@ -250,9 +271,9 @@ public class StorageClient {
250271
}
251272

252273
logger.logRecordBuilder()
253-
// In general we don't know if it's an error - we expect caller
254-
// to handle it and decide. So this is warning (at most).
255-
// If exception thrown below remains unhandled, it'd be logged by the global handler.
274+
// In general we don't know if it's an error - we expect the caller
275+
// to handle it and decide. So this is a warning (at most).
276+
// If the exception thrown below remains unhandled, it'd be logged by the global handler.
256277
.setSeverity(Severity.WARN)
257278
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
258279
.addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode())
@@ -268,7 +289,7 @@ public class StorageClient {
268289
}
269290
```
270291

271-
Network level errors are part of normal life, we should consider using low severity for them
292+
Network-level errors are part of normal life; we should consider using low severity for them.
272293

273294
```java
274295
public class NetworkClient {
@@ -281,7 +302,7 @@ public class NetworkClient {
281302
} catch (SocketException ex) {
282303
logger.logRecordBuilder()
283304
// we'll retry it, so it's info or lower.
284-
// we'll write a warn for overall operation if retries are exhausted.
305+
// we'll write a warn for the overall operation if retries are exhausted.
285306
.setSeverity(Severity.INFO)
286307
.addAttribute("connection.id", this.getId())
287308
.addException(ex)
@@ -318,9 +339,9 @@ this example:
318339
```java
319340
MessageContext context = retrieveNext();
320341
try {
321-
processMessage.accept(context)
342+
processMessage.accept(context);
322343
} catch (Throwable t) {
323-
// This natively instrumented library may use OTel log API or another logging library such as slf4j.
344+
// This natively instrumented library may use the OTel log API or another logging library such as slf4j.
324345
// Here we use Error severity since this exception was not handled by the application code.
325346
logger.atError()
326347
.addKeyValuePair("messaging.message.id", context.getMessageId())
@@ -336,7 +357,7 @@ span.
336357

337358
#### Instrumentation library
338359

339-
In this example we leverage Spring Kafka `RecordInterceptor` extensibility point that allows to
360+
In this example, we leverage the Spring Kafka `RecordInterceptor` extensibility point that allows us to
340361
listen to exceptions that remained unhandled.
341362

342363
```java
@@ -366,27 +387,23 @@ See [corresponding Java (tracing) instrumentation](https://github.com/open-telem
366387
1. Switching from recording exceptions as span events to log records is a breaking change
367388
for any component following existing [exception guidance](/specification/trace/exceptions.md).
368389

369-
**Mitigation:**
370-
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
371-
but that's not enough - instrumentations will have to change their behavior to report errors
372-
as logs with appropriate severity.
373-
- We should provide opt-in mechanism for existing instrumentations to switch to logs.
374-
375390
2. Recording exceptions as log-based events would result in UX degradation for users
376391
leveraging trace-only backends such as Jaeger.
377392

378-
**Mitigation:**
379-
- OpenTelemetry API and/or SDK may provide span events -> log events conversion.
380-
See also [Event vision OTEP](./0265-event-vision.md#relationship-to-span-events).
393+
**Mitigation:**
394+
395+
In addition to the plan outlined in the [Migration](#migrating-instrumentations) section, we
396+
should provide opt-in [log <-> span events conversion](https://github.com/open-telemetry/opentelemetry-specification/issues/4393)
397+
following [Event vision OTEP](./0265-event-vision.md#relationship-to-span-events).
381398

382399
## Prior art and alternatives
383400

384401
Alternatives:
385402

386403
1. Deduplicate exception info by marking exception instances as logged.
387-
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
388-
We should still provide optimal guidance for the greenfield applications and libraries,
389-
covering wider problem of recording errors.
404+
This can potentially mitigate the problem for existing applications when they log exceptions extensively.
405+
We should still provide optimal guidance for greenfield applications and libraries,
406+
covering the wider problem of recording errors.
390407

391408
## Open questions
392409

0 commit comments

Comments
 (0)