Skip to content

Commit da4ba16

Browse files
author
Liudmila Molkova
committed
more feedback, define error/exception
1 parent 4eb90e3 commit da4ba16

File tree

2 files changed

+93
-70
lines changed

2 files changed

+93
-70
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ release.
4242
### OTEPs
4343

4444
- [OTEP-4333](https://github.com/open-telemetry/opentelemetry-specification/pull/4333)
45-
Recording exceptions on logs.
45+
Recording exceptions and errors on logs.
4646

4747
## v1.47.0 (2025-07-18)
4848

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 92 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
* [Details](#details)
88
- [API changes](#api-changes)
99
- [Examples](#examples)
10-
* [Logging exception from client library in a user application](#logging-exception-from-client-library-in-a-user-application)
11-
* [Logging error inside the natively instrumented Library](#logging-error-inside-the-natively-instrumented-library)
10+
* [Logging errors from client library in a user application](#logging-errors-from-client-library-in-a-user-application)
11+
* [Logging errors inside the natively instrumented Library](#logging-errors-inside-the-natively-instrumented-library)
1212
* [Logging errors in messaging processor](#logging-errors-in-messaging-processor)
1313
+ [Natively instrumented library](#natively-instrumented-library)
1414
+ [Instrumentation library](#instrumentation-library)
@@ -19,84 +19,108 @@
1919

2020
<!-- tocstop -->
2121

22-
This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.
22+
This OTEP provides guidance on how to record errors using OpenTelemetry Logs
23+
focusing on minimizing duplication and providing context to reduce the noise.
24+
25+
In the long term, errors recorded on logs **will replace span events**
26+
(according to [Event vision OTEP](./0265-event-vision.md)).
27+
28+
> [!NOTE]
29+
> Throughout the OTEP *exception* and *error* are used in the following way:
30+
> - *Error* refers to a general concept describing any non-success condition,
31+
> which may manifest as an exception, non-successful status code, or an invalid
32+
> response.
33+
> - *Exception* specifically refers to runtime exceptions and their associated stack traces.
2334
2435
## Motivation
2536

26-
Today OTel supports recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries and may be recorded as OTel logs via logging bridge.
37+
Today OTel supports recording *exceptions* using span events available through Trace API. Outside of OTel world,
38+
*errors* are usually recorded by user apps and libraries using logging libraries
39+
and may be recorded as OTel logs via logging bridge.
2740

28-
Exceptions recorded on logs have the following advantages over span events:
41+
Errors recorded on logs have the following advantages over span events:
2942

3043
- they can be recorded for operations that don't have any tracing instrumentation
3144
- they can be sampled along with or separately from spans
32-
- they can have different severity levels to reflect how critical the exception is
45+
- they can have different severity levels to reflect how critical the error is
3346
- they are already reported natively by many frameworks and libraries
3447

35-
Recording exceptions is essential for troubleshooting, but regardless of how exceptions are recorded, they could be noisy:
48+
Recording errors is essential for troubleshooting, but regardless of how they are recorded, they could be noisy:
3649

37-
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading -
38-
individual occurrence of transient errors are not necessarily indicative of a problem.
50+
- distributed applications experience transient errors at the rate proportional to their scale and
51+
errors in logs could be misleading - individual occurrence of transient errors
52+
are not necessarily indicative of a problem.
3953
- exception stack traces can be huge. Corresponding attribute value can frequently reach several KBs resulting in high costs
40-
associated with ingesting and storing them. It's also common to log exceptions multiple times while they bubble up
41-
leading to duplication and aggravating the verbosity problem.
54+
associated with ingesting and storing them. It's also common to log errors multiple times
55+
as they bubble up leading to duplication and aggravating the verbosity problem.
56+
- severity depends on the context and, in general case, is not known when error
57+
occurs. Errors are frequently handled (suppressed, retried, ignored) by the caller.
58+
59+
In this OTEP, we'll provide guidance around recording errors that minimizes duplication,
60+
allows reducing noise with configuration, and allows capturing errors in the
61+
absence of a recorded span.
4262

43-
In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows reducing noise with configuration, and
44-
allows capturing exceptions in the absence of a recorded span.
63+
This guidance applies to general-purpose instrumentations including natively
64+
instrumented libraries.
4565

46-
This guidance applies to general-purpose instrumentations including native ones. Application developers should consider following it as a
47-
starting point, but they are encouraged to adjust it to their needs.
66+
Application developers should consider following it as a starting point, but
67+
they are encouraged to adjust it to their needs.
4868

4969
## Guidance
5070

5171
This guidance boils down to the following:
5272

53-
Instrumentations SHOULD record exception information (along with other context) as a log record with appropriate severity.
54-
Only unhandled exceptions SHOULD be recorded as `Error` or higher. Instrumentations SHOULD do the best effort to report
55-
each exception once.
73+
Instrumentations SHOULD record error information along with relevant context as
74+
a log record with appropriate severity.
5675

57-
Instrumentations SHOULD provide the whole exception instance to the OTel SDK so it can
58-
record it fully or partially based on provided configuration. The default SDK behavior SHOULD
59-
be to record exception stack traces when logging exceptions at `Error` or higher severity.
76+
Instrumentations SHOULD set severity to `Error` or higher only when log describes a
77+
problem affecting application functionality, availability, performance, security or
78+
another aspect important for this type of applications.
6079

61-
In the long term, exceptions recorded on logs will replace span events (according to [Event vision OTEP](./0265-event-vision.md)).
80+
When instrumentation records exception, it SHOULD provide
81+
the whole exception instance to the OTel SDK so the SDK can record it fully or
82+
partially based on provided configuration. The default SDK behavior SHOULD
83+
be to record exception stack traces when logging exceptions at `Error` or higher severity.
6284

6385
### Details
6486

65-
1. Exceptions SHOULD be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
66-
or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
87+
1. Errors SHOULD be recorded on [logs](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/exceptions/exceptions-logs.md)
88+
or as [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/general/events.md)
6789

68-
2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually
69-
create local root spans, SHOULD record logs for unhandled exceptions with `Error` severity.
90+
2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap application code and usually
91+
create local root spans, SHOULD record logs for unhandled errors with `Error` severity.
7092

7193
Some runtimes provide global exception handler that can be used to log exceptions.
7294
Priority should be given to the instrumentation point where the operation context is available.
7395
Language SIGs are encouraged to give runtime-specific guidance. For example, here is the
7496
[.NET guidance](https://github.com/open-telemetry/opentelemetry-dotnet/blob/610045298873397e55e0df6cd777d4901ace1f63/docs/trace/reporting-exceptions/README.md#unhandled-exception)
7597
for recording exceptions on traces.
7698

77-
3. Natively instrumented libraries SHOULD record a log describing an exception and the context it happened in
78-
as soon as the exception is detected (or where the most context is available).
99+
3. Natively instrumented libraries SHOULD record a log describing an error and the context it happened in
100+
as soon as the error is detected (or where the most context is available).
79101

80-
4. It's NOT RECOMMENDED to record the same exception as it propagates through the stack frames, or
102+
4. It's NOT RECOMMENDED to record the same error as it propagates through the call stack, or
81103
to attach the same instance of an exception to multiple log records.
82104

83-
5. An exception (or error) SHOULD be logged with appropriate severity depending on the available context.
105+
5. An error SHOULD be logged with appropriate severity depending on the available context.
84106

85-
- Exceptions or errors that don't indicate actual issues SHOULD be recorded with
107+
- Errors that don't indicate actual issues SHOULD be recorded with
86108
severity not higher than `Info`.
87109

88-
Such exceptions can be used to control application logic and have a minor impact, if any,
89-
on application functionality, availability, or performance.
110+
Such errors can be used to control application logic and have a minor impact, if any,
111+
on application functionality, availability, or performance (beyond performance hit introduced
112+
if exception is used to control application logic).
90113

91114
Examples:
92115

93-
- exception is thrown when checking optional dependency or resource existence.
94-
- exception thrown when client disconnects before reading full response from the server
116+
- error is returned when checking optional dependency or resource existence.
117+
- exception is thrown on the server when client disconnects before reading
118+
full response from the server
95119

96-
- Exceptions or errors that are expected to be retried or handled by the caller or another
97-
layer of the component SHOULD be recorded with severity not higher than `Warning`.
120+
- Errors that are expected to be retried or handled by the caller or another
121+
layer of the component SHOULD be recorded with severity not higher than `Warn`.
98122

99-
Such exceptions represent transient failures that are common and expected in
123+
Such errors represent transient failures that are common and expected in
100124
distributed applications. They typically increase the latency of individual
101125
operations and have a minor impact on overall application availability.
102126

@@ -108,40 +132,40 @@ In the long term, exceptions recorded on logs will replace span events (accordin
108132
- remote dependency returned 503 "Service Unavailable" response for 5 times in a row,
109133
retry attempts are exhausted and the corresponding operation has failed.
110134

111-
- Unhandled (by the user code) exceptions that don't result in application shutdown SHOULD
112-
be recorded with severity `Error`
135+
- Unhandled (by the application code) errors that don't result in application
136+
shutdown SHOULD be recorded with severity `Error`
113137

114-
These exceptions are not expected and may indicate a bug in the application logic
138+
These errors are not expected and may indicate a bug in the application logic
115139
that this application instance was not able to recover from or a gap in the error
116140
handling logic.
117141

118142
Examples:
119143

120144
- Background job terminates with an exception
121-
- HTTP framework error handler catches exception thrown by the user code.
145+
- HTTP framework error handler catches exception thrown by the application code.
122146

123147
Note: some frameworks use exceptions as a communication mechanism when request fails. For example,
124148
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html)
125149
exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code.
126-
Application code, in this case, is expected to logs error at appropriate severity and
127-
general-purpose instrumentation SHOULD NOT record such exceptions.
150+
Application code, in this case, is expected to log this at appropriate severity.
151+
General-purpose instrumentation MAY record such errors, but at severity not higher than `Warn`.
128152

129-
- Exceptions or errors that result in application shutdown SHOULD be recorded with severity `Fatal`.
153+
- Errors that result in application shutdown SHOULD be recorded with severity `Fatal`.
130154

131155
- The application detects an invalid configuration at startup and shuts down.
132156
- The application encounters a (presumably) terminal error, such as an out-of-memory condition.
133157

134-
1. When recording exception on logs, user applications and instrumentations are encouraged to add additional attributes
158+
6. When recording exceptions on logs, applications and instrumentations are encouraged to add additional attributes
135159
to describe the context that the exception was thrown in.
136160
They are also encouraged to define their own error events and enrich them with exception details.
137161

138-
2. OTel SDK SHOULD record stack traces on exceptions with severity `Error` or higher and SHOULD allow users to
162+
7. OTel SDK SHOULD record stack traces on exceptions with severity `Error` or higher and SHOULD allow users to
139163
change the threshold.
140164

141165
See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that
142166
records stack trace conditionally.
143167

144-
3. Instrumentation libraries that record exceptions using span events SHOULD gracefully migrate
168+
8. Instrumentation libraries that record exceptions using span events SHOULD gracefully migrate
145169
to log-based exceptions offering it as an opt-in feature first and then switching to log-based exceptions
146170
in the next major version update.
147171

@@ -163,15 +187,15 @@ OTel Logs API SHOULD provide methods that enrich log record with exception detai
163187
OTel SDK, based on the log severity and configuration, SHOULD record exception details fully or partially.
164188

165189
The signature of the method is to be determined by each language
166-
and can be overloaded as appropriate including ability to collect and customize stack trace
190+
and can be overloaded as appropriate including ability to customize stack trace
167191
collection.
168192

169-
It MUST be possible to efficiently set exception information on a log record without
170-
using the `setException` method.
193+
It MUST be possible to efficiently set exception information on a log record based on configuration
194+
and without using the `setException` method.
171195

172196
## Examples
173197

174-
### Logging exception from client library in a user application
198+
### Logging errors from client library in a user application
175199

176200
```java
177201
StorageClient client = createClient(endpoint, credential);
@@ -206,9 +230,9 @@ try {
206230
}
207231
```
208232

209-
### Logging error inside the natively instrumented Library
233+
### Logging errors inside the natively instrumented Library
210234

211-
It's a common practice to record exceptions using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
235+
It's a common practice to record errors using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
212236
leverage OTel Events/Logs API for their exception logging purposes.
213237

214238
```java
@@ -223,13 +247,13 @@ public class StorageClient {
223247
}
224248

225249
logger.logRecordBuilder()
226-
// In general we don't know if it's certainly an error - we expect caller
227-
// to handle the exception and decide. So this is warning (at most).
228-
// If it remains unhandled, it'd be logged by the global handler.
250+
// In general we don't know if it's an error - we expect caller
251+
// to handle it and decide. So this is warning (at most).
252+
// If exception thrown below remains unhandled, it'd be logged by the global handler.
229253
.setSeverity(Severity.WARN)
230254
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
231255
.addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode())
232-
.addException(ex)
256+
.setBody("Unexpected HTTP response")
233257
.emit();
234258

235259
if (response.statusCode() == 404) {
@@ -258,6 +282,7 @@ public class NetworkClient {
258282
.setSeverity(Severity.INFO)
259283
.addAttribute("connection.id", this.getId())
260284
.addException(ex)
285+
.setBody("Failed to send content")
261286
.emit();
262287

263288
throw ex;
@@ -284,7 +309,7 @@ MessagingProcessorClient processorClient = new MessagingClientBuilder()
284309
processorClient.start();
285310
```
286311

287-
The `MessagingProcessorClient` implementation should catch exceptions thrown by the `processMessage` callback and log them similarly to
312+
The `MessagingProcessorClient` implementation should catch exceptions thrown by the `processMessage` callback and log them similarly to
288313
this example:
289314

290315
```java
@@ -298,7 +323,7 @@ try {
298323
.addKeyValuePair("messaging.message.id", context.getMessageId())
299324
...
300325
.setException(t)
301-
.log();
326+
.log("Message processing failed");
302327
// error handling logic ...
303328
}
304329
```
@@ -324,6 +349,7 @@ final class InstrumentedRecordInterceptor<K, V> implements RecordInterceptor<K,
324349
.addAttribute("messaging.message.id", record.getId())
325350
...
326351
.addException(ex)
352+
.setBody("Consumer error")
327353
.emit();
328354
// ..
329355
}
@@ -334,12 +360,13 @@ See [corresponding Java (tracing) instrumentation](https://github.com/open-telem
334360

335361
## Trade-offs and mitigations
336362

337-
1. Breaking change for any component following existing [exception guidance](/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
363+
1. Switching from recording exceptions as span events to log records is a breaking change
364+
for any component following existing [exception guidance](/specification/trace/exceptions.md).
338365

339366
**Mitigation:**
340367
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
341-
but that's not enough - instrumentations will have to change their behavior to report exception logs
342-
with appropriate severity (or stop reporting them).
368+
but that's not enough - instrumentations will have to change their behavior to report errors
369+
as logs with appropriate severity.
343370
- We should provide opt-in mechanism for existing instrumentations to switch to logs.
344371

345372
2. Recording exceptions as log-based events would result in UX degradation for users
@@ -355,12 +382,8 @@ Alternatives:
355382

356383
1. Deduplicate exception info by marking exception instances as logged.
357384
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
358-
We should still provide optimal guidance for the greenfield applications and libraries.
359-
360-
2. Log full exception info only when exception is thrown for the first time.
361-
This results in at-most-once logging, but even this is known to be problematic since absolute
362-
majority of exceptions are handled.
363-
It also relies on the assumption that most libraries will follow this guidance.
385+
We should still provide optimal guidance for the greenfield applications and libraries,
386+
covering wider problem of recording errors.
364387

365388
## Open questions
366389

0 commit comments

Comments
 (0)