- Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
Td-agent is deployed in a Kubernetes pod and configured to receive Fluent Bit logs and forward to Loki and Graylog. Some time after the start - often after a target such as the Loki Distributors cannot be reached - the CPU utilization of the td-agent Ruby process goes up to 100% utilization and stays there forever. No more logs from Fluent Bit are received, none are forwarded. This only happens with multiple forward targets, in our case Graylog and Loki. So far the issue has never occcurred with only one target.
Apparently the flush_thread_0 Thread is stuck with 100% CPU utilization:
top -H
top - 14:20:36 up 19 days, 22:26, 0 users, load average: 1.42, 1.48, 1.70
Threads: 17 total, 2 running, 15 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.1 us, 0.9 sy, 0.0 ni, 83.8 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 31657.9 total, 3179.4 free, 15111.0 used, 13367.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 16346.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41476 root 20 0 916380 148856 9632 R 99.7 0.5 9032:07 flush_thread_0
41478 root 20 0 916380 148856 9632 S 0.3 0.5 22:28.29 event_loop
1 root 20 0 308200 53128 9328 S 0.0 0.2 1:50.29 td-agent
12 root 20 0 308200 53128 9328 S 0.0 0.2 0:31.32 signal_thread.*
13 root 20 0 308200 53128 9328 S 0.0 0.2 0:00.00 socket_manager*
41473 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.16 ruby
41474 root 20 0 916380 148856 9632 S 0.0 0.5 7:39.57 flush_thread_0
41475 root 20 0 916380 148856 9632 S 0.0 0.5 0:01.66 enqueue_thread
41477 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.22 enqueue_thread
41479 root 20 0 916380 148856 9632 S 0.0 0.5 0:03.69 event_loop
41480 root 20 0 916380 148856 9632 S 0.0 0.5 0:02.40 event_loop
41481 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41482 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.81 event_loop
41483 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.01 ruby
41484 root 20 0 916380 148856 9632 S 0.0 0.5 0:00.00 fluent_log_eve*
43545 root 20 0 4244 3540 2976 S 0.0 0.0 0:00.01 bash
43557 root 20 0 6136 3232 2720 R 0.0 0.0 0:00.00 top
Attaching a debugger shows that it's stuck in the libyajl.so library.
After rebuilding libyajl.so with debug symbols we could see the exact location:
gdb -p 41476
(gdb) bt
#0 yajl_buf_ensure_available (want=1, buf=0x7fb0b441ec50) at ../../../../ext/yajl/yajl_buf.c:64
#1 yajl_buf_append (buf=0x7fb0b441ec50, data=0x7fb0bd2a517e, len=1) at ../../../../ext/yajl/yajl_buf.c:89
#2 0x00007fb0bd2a104e in yajl_gen_string (g=0x7fb0b4420210, str=0x56277d137758 "facility", len=8) at ../../../../ext/yajl/yajl_gen.c:248
#3 0x00007fb0bd29eff4 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=, io=io@entry=8) at ../../../../ext/yajl/yajl_ext.c:249
#4 0x00007fb0bd29f1a8 in yajl_encode_part (wrapper=wrapper@entry=0x7fb0b44201e0, obj=obj@entry=94727611734280, io=io@entry=8)
at ../../../../ext/yajl/yajl_ext.c:197
#5 0x00007fb0bd29f288 in rb_yajl_encoder_encode (argc=argc@entry=1, argv=argv@entry=0x7fb0bc5c0398, self=)
at ../../../../ext/yajl/yajl_ext.c:1115
#6 0x00007fb0bd29feda in rb_yajl_json_ext_hash_to_json (argc=0, argv=0x7fb0bc5c0fe8, self=) at ../../../../ext/yajl/yajl_ext.c:1179
...
In order to provide even more details we opened a separate issue with yajl-ruby:
brianmario/yajl-ruby#205 (comment)
To Reproduce
- deploy Td-agent like in the above scenario with the below configuration.
- make sure some log volume is forwarded by Fluent Bit to Fluentd ( a few 100 logs/sec)
- scale down Loki distributors to zero in order to force a backpressure situation.
- wait until the Td-agent pod consumes 100% (as observed with Kubernetes resources and limits). This can take a few hours.
Expected behavior
Td-agent should not get stuck eating up 100% in an infinite loop when trying to flush records. Instead, if running into backpressure situations, it should buffer or discard records until the target is available again.
Your Environment
- Fluentd version: 1.13.3 - TD Agent version: 4.2.0 - Operating system: Ubuntu 20.04.3 LTS - Kernel version: 5.4.0-6-cloud-amd64 #1 SMP Debian 5.4.93-1 (2021-02-09)Your Configuration
<system> log_level info </system> <source> @type monitor_agent bind 0.0.0.0 port 24220 </source> <source> @type prometheus port 24222 </source> <source> @type prometheus_monitor <labels> hostname ${hostname} </labels> </source> <source> @type prometheus_output_monitor <labels> hostname ${hostname} </labels> </source> <source> @type forward port 24224 bind 0.0.0.0 </source> <filter ** > @type prometheus <metric> name fluentd_input_status_num_records_total type counter desc The total number of incoming records <labels> tag ${tag} hostname ${hostname} </labels> </metric> </filter> <filter **> @type record_transformer <record> ingestor "#{ENV['HOSTNAME']}" </record> </filter> <match ** > @type copy <store> @type gelf host graylog-input port 12222 protocol tcp tls false <buffer> @type file path /var/xxx/store/fluentd/Graylog total_limit_size 10GB chunk_limit_size 10240k flush_at_shutdown true overflow_action drop_oldest_chunk retry_max_times 120 retry_type periodic retry_wait 10s </buffer> </store> <store> @type prometheus <metric> name fluentd_output_status_num_records_total type counter desc The total number of outgoing records <labels> tag ${tag} hostname ${hostname} </labels> </metric> </store> <store> @type loki url "http://#{ENV['LOKI_HOST']}:#{ENV['LOKI_PORT']}" username xxx password xxx <label> ingestor abap_system app az landscape namespace node pod Labels MessageID MessageSource RFC3339_Timestamp SAP_EPP Severity ServiceName ServiceComponent UserName bosh_deployment bosh_director bosh_group bosh_id bosh_az </label> <buffer time> @type file path /var/xxx/store/fluentd/Loki total_limit_size 10GB chunk_limit_size 10240k flush_at_shutdown true overflow_action drop_oldest_chunk retry_max_times 120 retry_type periodic retry_wait 10s timekey 10 timekey_wait 60 </buffer> </store> </match>Your Error Log
As the Td-agent is stuck in an infinite loop, no more infos/warnings/errors are logged.Additional context
No response