Correctly log exceptions that are thrown during cache prewarming #74419

tlrx · 2021-06-22T12:12:24Z

Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted.

This pull request adds log traces to warn when an exception occurred during the prewarming of a part of a file. In the issue we
encountered it would have logged something like this:

[2021-06-22T13:42:10,510][WARN ][o.e.x.s.s.SearchableSnapshotDirectory] [pop-os] [test][0] prewarming failed for file [_a.cfe] org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution	at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.13.3-SNAPSHOT.jar:7.13.3-SNAPSHOT]	... Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262) ~[elasticsearch-7.13.3-SNAPSHOT.jar:7.13.3-SNAPSHOT]	... 21 more Caused by: java.io.IOException: Failed to prefetch file part in cache	at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:216) ~[?:?]	... 3 more Caused by: java.nio.file.NoSuchFileException: /tmp/cluster/repo/indices/vLuVOFjKRAexJYJ7h4dVXw/0/__iWOCoAtpR2-WDCZ06eOUew	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]	...

elasticmachine · 2021-06-22T12:12:27Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

LGTM.

henningandersen · 2021-06-23T06:49:05Z

...main/java/org/elasticsearch/xpack/searchablesnapshots/store/SearchableSnapshotDirectory.java

 } catch (IOException e) {
 logger.warn(() -> new ParameterizedMessage("{} unable to prewarm file [{}]", shardId, file.physicalName()), e);
+ if (submitted == false) {
+ completionListener.onFailure(e);


I think this attempts to ensure that the recovery state moves to DONE? I guess that is more correct than FINALIZE, but I think we should add a test for it.

It attempts to complete the grouped listener but has not other side effect besides logging a warning message if the shard is a relocation target:

elasticsearch/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/allocation/SearchableSnapshotIndexEventListener.java

Lines 78 to 90 in a5d2251

if (success && shardRouting.isRelocationTarget()) {

final Runnable preWarmCondition = indexShard.addCleanFilesDependency();

preWarmListener.whenComplete(v -> preWarmCondition.run(), e -> {

logger.warn(

new ParameterizedMessage(

"pre-warm operation failed for [{}] while it was the target of primary relocation [{}]",

shardRouting.shardId(),

shardRouting

),

e

);

preWarmCondition.run();

});

I'm OK with leaving the recovery state in FINALIZE as this is what allowed to detect a non-fully prewarmed shard at the first place. I think that adding more logs would have helped to troubleshoot some issues when accessing repository files.

tlrx · 2021-06-28T08:19:16Z

Thanks @henningandersen - I've merged it like this, see #74419 (comment)

…stic#74419) Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file.

#74606 Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file. Backport of #74419

…#74604 Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file. Backport of #74419

Correctly log exceptions that are thrown during cache prewarming

7cc64d2

tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.14.0 v7.13.3 labels Jun 22, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 22, 2021

tlrx requested a review from original-brownbear June 22, 2021 12:55

henningandersen approved these changes Jun 23, 2021

View reviewed changes

tlrx merged commit 01d0865 into elastic:master Jun 28, 2021

tlrx deleted the log-prewarming-exceptions branch June 28, 2021 08:18

tlrx mentioned this pull request Jun 28, 2021

[7.x] Correctly log exceptions that are thrown during cache prewarming #74604

Merged

tlrx mentioned this pull request Jun 28, 2021

[7.13] Correctly log exceptions that are thrown during cache prewarming #74606

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly log exceptions that are thrown during cache prewarming #74419

Correctly log exceptions that are thrown during cache prewarming #74419

Uh oh!

tlrx commented Jun 22, 2021

elasticmachine commented Jun 22, 2021

henningandersen left a comment

henningandersen Jun 23, 2021

tlrx Jun 24, 2021

tlrx commented Jun 28, 2021

Labels

4 participants

	if (success && shardRouting.isRelocationTarget()) {
	final Runnable preWarmCondition = indexShard.addCleanFilesDependency();
	preWarmListener.whenComplete(v -> preWarmCondition.run(), e -> {
	logger.warn(
	new ParameterizedMessage(
	"pre-warm operation failed for [{}] while it was the target of primary relocation [{}]",
	shardRouting.shardId(),
	shardRouting
	),
	e
	);
	preWarmCondition.run();
	});

Correctly log exceptions that are thrown during cache prewarming #74419

Correctly log exceptions that are thrown during cache prewarming #74419

Uh oh!

Conversation

tlrx commented Jun 22, 2021

elasticmachine commented Jun 22, 2021

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Jun 23, 2021

Choose a reason for hiding this comment

tlrx Jun 24, 2021

Choose a reason for hiding this comment

tlrx commented Jun 28, 2021

Labels

4 participants