Skip to content

Transport worker thread blocked while closing shard #84602

@DaveCTurner

Description

@DaveCTurner

Elasticsearch Version

7.16.3 (but seems similar in master too)

Installed Plugins

N/A

Java Version

bundled

OS Version

N/A

Problem Description

A transport_worker thread might call IndexShard#failShard, which can block whilst waiting for some other IO to complete.

Steps to Reproduce

Unclear, but from the stack trace it looks like an attempt to fail a replica itself failed.

Logs (if relevant)

 0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[data-edcr-es1-03-es-iz4-hot-7][transport_worker][T#9]' 10/10 snapshots sharing following 65 elements java.base@17.0.1/java.lang.Object.wait(Native Method) app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983) app//org.apache.lucene.index.IndexWriter.abortMerges(IndexWriter.java:2668) app//org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2425) app//org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2409) app//org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2402) app//org.elasticsearch.index.engine.InternalEngine.closeNoLock(InternalEngine.java:2474) app//org.elasticsearch.index.engine.Engine.failEngine(Engine.java:1179) app//org.elasticsearch.index.shard.IndexShard.failShard(IndexShard.java:1581) app//org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.failShard(TransportReplicationAction.java:1137) app//org.elasticsearch.action.support.replication.ReplicationOperation.onNoLongerPrimary(ReplicationOperation.java:337) app//org.elasticsearch.action.support.replication.ReplicationOperation.access$1100(ReplicationOperation.java:46) app//org.elasticsearch.action.support.replication.ReplicationOperation$2.lambda$onFailure$2(ReplicationOperation.java:261) app//org.elasticsearch.action.support.replication.ReplicationOperation$2$$Lambda$8313/0x0000000801c36800.accept(Unknown Source) app//org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) app//org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:313) app//org.elasticsearch.action.ResultDeduplicator$CompositeListener.onFailure(ResultDeduplicator.java:105) app//org.elasticsearch.cluster.action.shard.ShardStateAction$1.handleException(ShardStateAction.java:195) app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1481) app//org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:368) app//org.elasticsearch.transport.InboundHandler$$Lambda$8185/0x0000000801dba530.run(Unknown Source) app//org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:285) app//org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:366) app//org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:358) app//org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:132) app//org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:88) app//org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:743) org.elasticsearch.transport.netty4.Netty4MessageChannelHandler$$Lambda$5965/0x00000008019cc630.accept(Unknown Source) app//org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147) app//org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:119) app//org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:84) 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions