Skip to content

Quorum queues can enter a state it cannot recover from due to a timeout #13828

@michaelklishin

Description

@michaelklishin

Moving back to an issue because @kjnilsson has identified something to address regardless of the quorum queue churn (a workload they were never designed for).

Originally filed by @matthew-s-walker.

Discussed in #13827

Originally posted by matthew-s-walker April 29, 2025

Describe the bug

Hi,

Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.

We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.

After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:

  • Publishing to the queue fails
  • Consumers from the queue think that they are consuming but are not shown as consuming in the management UI, or receive internal errors attempting to consume.
    The issue either occurs immediately after/during creation or within 2-3 minutes of creation.

We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:

  • 3.12.14
  • 3.13.7
  • 4.0.1
  • 4.1.0

On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.

Our cluster setup is:

  • Kubernetes (EKS) with the cluster and topology operators
  • 3 nodes, 2 CPUs allocated, 4Gi memory
  • 10Gi EBS volumes (GP2)

Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.

The issue reproduces with:

  • Both continuous_membership_reconciliation enabled and disabled
  • channel_operation_timeout set to 60000 and set to the default
  • 3 node and single node clusters

Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):

2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.xxxxx', initial machine version 5 2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Generic server <0.5779956.0> terminating 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Last message in was {'$gen_cast', 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {method, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {'queue.declare',0, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> false,true,false,false,false, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{<<"x-expires">>,signedint,7200000}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {<<"x-queue-type">>,longstr, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <<"quorum">>}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> none,noflow}} 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** When Server state == {ch, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {conf,running,rabbit_framing_amqp_0_9_1,1, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <0.5779924.0>,<0.5779953.0>,<0.5779924.0>, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <<"10.47.104.51:3215 -> 10.47.88.201:5672">>, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {user,<<"xxxxxxx">>,[], 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{rabbit_auth_backend_oauth2, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> #Fun<rabbit_auth_backend_oauth2.5.36627292>}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <<"lido">>,<<>>,<0.5779934.0>, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{<<"connection.blocked">>,bool,false}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {<<"consumer_cancel_notify">>,bool,true}], 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> none,0,1800000,#{},infinity,1000000000}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {lstate,<0.5779955.0>,false}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> none,1, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {0,[],[]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> undefined,#{},#{}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {state,fine,5000, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> #Ref<0.1001428202.3157000194.108089>}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> true,1, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_confirms,undefined,#{}}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [],[],none,flow,[], 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_queue_type,#{}}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> #Ref<0.1001428202.3157000194.108085>,false} 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Reason for termination == 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** {{exception,partition_parallel_timeout}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{erpc,call,5,[{file,"erpc.erl"},{line,264}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_queue_type_util,erpc_call,5, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{file,"rabbit_queue_type_util.erl"},{line,81}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_quorum_queue,start_cluster,1, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{file,"rabbit_quorum_queue.erl"},{line,283}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_channel,handle_method,6, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{file,"rabbit_channel.erl"},{line,2437}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_channel,handle_method,3, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> [{file,"rabbit_channel.erl"},{line,1564}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,615}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]}, 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]} 2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> crasher: 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> initial call: rabbit_channel:init/1 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> pid: <0.5779956.0> 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> registered_name: [] 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> exception exit: {{exception,partition_parallel_timeout}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{erpc,call,5,[{file,"erpc.erl"},{line,264}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_queue_type_util,erpc_call,5, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"rabbit_queue_type_util.erl"},{line,81}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_quorum_queue,start_cluster,1, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"rabbit_quorum_queue.erl"},{line,283}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_channel,handle_method,6, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"rabbit_channel.erl"},{line,2437}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_channel,handle_method,3, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"rabbit_channel.erl"},{line,1564}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_channel,handle_cast,2, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"rabbit_channel.erl"},{line,615}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {gen_server2,handle_msg,2, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"gen_server2.erl"},{line,1056}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {proc_lib,init_p_do_apply,3, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{file,"proc_lib.erl"},{line,329}]}]} 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> in function gen_server2:terminate/3 (gen_server2.erl, line 1172) 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> ancestors: [<0.5779951.0>,<0.5779948.0>,<0.5779909.0>,<0.5779925.0>, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> <0.874.0>,<0.873.0>,<0.872.0>,<0.870.0>,<0.869.0>, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> rabbit_sup,<0.216.0>] 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> message_queue_len: 5 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> messages: [{<0.5780268.0>,true}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {'DOWN',#Ref<0.1001428202.3157000194.108819>,process, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> <0.5780268.0>,normal}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> emit_stats,emit_gen_server2_stats,tick] 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> links: [<0.5779951.0>] 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> dictionary: [{rand_seed, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {#{max => 288230376151711743,type => exsplus, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> next => #Fun<rand.5.40079776>, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> jump => #Fun<rand.3.40079776>}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [161353397928511601|287143983225069820]}}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {permission_cache, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> [{{resource,<<"lido">>,queue, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> #{},configure}]}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {channel_operation_timeout,60000}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {permission_cache_can_expire,true}, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {process_name, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {rabbit_channel, 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> {<<"10.47.104.51:3215 -> 10.47.88.201:5672">>,1}}}] 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> trap_exit: true 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> status: running 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> heap_size: 10958 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> stack_size: 29 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> reductions: 54575 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> neighbours: 2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> supervisor: {<0.5779951.0>,rabbit_channel_sup} 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> errorContext: child_terminated 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> reason: {{exception,partition_parallel_timeout}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{erpc,call,5,[{file,"erpc.erl"},{line,264}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_queue_type_util,erpc_call,5, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"rabbit_queue_type_util.erl"}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {line,81}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_quorum_queue,start_cluster,1, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"rabbit_quorum_queue.erl"}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {line,283}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_channel,handle_method,6, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"rabbit_channel.erl"},{line,2437}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_channel,handle_method,3, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"rabbit_channel.erl"},{line,1564}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_channel,handle_cast,2, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"rabbit_channel.erl"},{line,615}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {gen_server2,handle_msg,2, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"gen_server2.erl"},{line,1056}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {proc_lib,init_p_do_apply,3, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{file,"proc_lib.erl"},{line,329}]}]} 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> offender: [{pid,<0.5779956.0>}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {id,channel}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {mfargs, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {rabbit_channel,start_link, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> <<"10.47.104.51:3215 -> 10.47.88.201:5672">>, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> rabbit_framing_amqp_0_9_1, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {user,<<"xxxxxxx">>,[], 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{rabbit_auth_backend_oauth2, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> #Fun<rabbit_auth_backend_oauth2.5.36627292>}]}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> <<"lido">>, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> [{<<"connection.blocked">>,bool,false}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {<<"consumer_cancel_notify">>,bool,true}], 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> <0.5779934.0>,<0.5779955.0>]}}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {restart_type,transient}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {significant,true}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {shutdown,70000}, 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> {child_type,worker}] 2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0> 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> supervisor: {<0.5779951.0>,rabbit_channel_sup} 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> errorContext: shutdown 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> reason: reached_max_restart_intensity 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> offender: [{pid,<0.5779956.0>}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {id,channel}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {mfargs, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {rabbit_channel,start_link, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> <<"10.47.104.51:3215 -> 10.47.88.201:5672">>, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> rabbit_framing_amqp_0_9_1, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {user,<<"xxxxxxx">>,[], 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> [{rabbit_auth_backend_oauth2, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> #Fun<rabbit_auth_backend_oauth2.5.36627292>}]}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> <<"lido">>, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> [{<<"connection.blocked">>,bool,false}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {<<"consumer_cancel_notify">>,bool,true}], 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> <0.5779934.0>,<0.5779955.0>]}}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {restart_type,transient}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {significant,true}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {shutdown,70000}, 2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0> {child_type,worker}] 

server 0:

574895:2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.xxxxx', initial machine version 5 574919:2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 574920:2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 574921:2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20 574939:2025-04-23 10:18:49.378959+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0} 574959:2025-04-23 10:18:49.726729+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0 574960:2025-04-23 10:18:49.726869+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0 574961:2025-04-23 10:18:49.727319+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms 574962:2025-04-23 10:18:49.727468+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0 574963:2025-04-23 10:18:49.727641+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0 574964:2025-04-23 10:18:49.727818+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0 592793:2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>, 592873:2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0> <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>}, 595996:2025-04-23 10:19:54.884806+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 669352:2025-04-23 10:23:14.944921+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 678457:2025-04-23 10:24:49.023934+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}} 713621:2025-04-23 10:26:29.950377+00:00 [error] <0.5914313.0> operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n 'rabbit@rabbitmq-server-1.rabbitmq-nodes.xxxxx'}" 714189:2025-04-23 10:26:34.982961+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 

server 1:

628440:2025-04-23 10:18:49.202017+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0} 628461:2025-04-23 10:18:49.262588+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0 628462:2025-04-23 10:18:49.266395+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0 628463:2025-04-23 10:18:49.266568+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms 628464:2025-04-23 10:18:49.267539+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0 628465:2025-04-23 10:18:49.267793+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0 628466:2025-04-23 10:18:49.268106+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0 677717:2025-04-23 10:22:04.886201+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 754750:2025-04-23 10:25:20.161779+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}} 754983:2025-04-23 10:25:24.949112+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 

server 2:

648735:2025-04-23 10:19:50.911784+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0} 648803:2025-04-23 10:19:51.159007+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0 648815:2025-04-23 10:19:51.167261+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0 648840:2025-04-23 10:19:51.329633+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms 648938:2025-04-23 10:19:51.367773+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0 648946:2025-04-23 10:19:51.397766+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0 648955:2025-04-23 10:19:51.398296+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0 662995:2025-04-23 10:20:59.884392+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known. 704973:2025-04-23 10:23:09.886627+00:00 [error] <0.5830650.0> operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n 'rabbit@rabbitmq-server-1.rabbitmq-nodes.xxxxx'}" 

I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.

Reproduction steps

The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:

  • In parallel (16x concurrency):
    • Create 5-50 quorum queues (random), each using separate connections, wait for all to complete
    • Consume from all the queues, also on separate connections
    • Using a single connection, publish a message to all of the queues over a 5 second period
    • Consumers disconnect after receiving a message
    • If the consumer does not receive a message within 2 minutes, then we have encountered the broken state
    • Delete all the queues, also using separate connections
    • Sleep for from 0 to 5 seconds, repeat

Please note the above is significantly higher load than our production system is subjected to.

With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.

Expected behavior

Queues eventually recover from this state or the client receives an error/disconnect and can try again later.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions