Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Jun 22, 2021

Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted.

This pull request adds log traces to warn when an exception occurred during the prewarming of a part of a file. In the issue we
encountered it would have logged something like this:

[2021-06-22T13:42:10,510][WARN ][o.e.x.s.s.SearchableSnapshotDirectory] [pop-os] [test][0] prewarming failed for file [_a.cfe] org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution	at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.13.3-SNAPSHOT.jar:7.13.3-SNAPSHOT]	... Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262) ~[elasticsearch-7.13.3-SNAPSHOT.jar:7.13.3-SNAPSHOT]	... 21 more Caused by: java.io.IOException: Failed to prefetch file part in cache	at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:216) ~[?:?]	... 3 more Caused by: java.nio.file.NoSuchFileException: /tmp/cluster/repo/indices/vLuVOFjKRAexJYJ7h4dVXw/0/__iWOCoAtpR2-WDCZ06eOUew	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]	... 
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@tlrx tlrx requested a review from original-brownbear June 22, 2021 12:55
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

} catch (IOException e) {
logger.warn(() -> new ParameterizedMessage("{} unable to prewarm file [{}]", shardId, file.physicalName()), e);
if (submitted == false) {
completionListener.onFailure(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this attempts to ensure that the recovery state moves to DONE? I guess that is more correct than FINALIZE, but I think we should add a test for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It attempts to complete the grouped listener but has not other side effect besides logging a warning message if the shard is a relocation target:

if (success && shardRouting.isRelocationTarget()) {
final Runnable preWarmCondition = indexShard.addCleanFilesDependency();
preWarmListener.whenComplete(v -> preWarmCondition.run(), e -> {
logger.warn(
new ParameterizedMessage(
"pre-warm operation failed for [{}] while it was the target of primary relocation [{}]",
shardRouting.shardId(),
shardRouting
),
e
);
preWarmCondition.run();
});

I'm OK with leaving the recovery state in FINALIZE as this is what allowed to detect a non-fully prewarmed shard at the first place. I think that adding more logs would have helped to troubleshoot some issues when accessing repository files.

@tlrx tlrx merged commit 01d0865 into elastic:master Jun 28, 2021
@tlrx tlrx deleted the log-prewarming-exceptions branch June 28, 2021 08:18
@tlrx
Copy link
Member Author

tlrx commented Jun 28, 2021

Thanks @henningandersen - I've merged it like this, see #74419 (comment)

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 28, 2021
…stic#74419) Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Jun 28, 2021
…stic#74419) Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file.
tlrx added a commit that referenced this pull request Jun 28, 2021
#74606 Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file. Backport of #74419
tlrx added a commit that referenced this pull request Jun 28, 2021
…#74604 Today if an exception is thrown during the prewarming of (parts of) the files it is simply swallowed. This is confusing because the shard recovery state is left in a FINALIZE state but nothing in the logs can help troubleshooting why the recovery has been interrupted. This commit adds log traces to warn when an exception occurred during the prewarming of a part of a file. Backport of #74419
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.13.3 v7.14.0 v8.0.0-alpha1

4 participants