Troubleshooting frozen disk when queue is full

Question

I have a system (centos 7.8) with k8s installed on top.

After a few days of normal operations (system load is ~30% and disk activity is around 60 iops, not saturated), the system goes to an unstable state where nothing is commited to disk anymore. As can be seen from an iostat -x 5, avgqu-sz freezes no more r/w are accepted.

The question is where should I look next in order to identify the root cause.

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 11,36 0,00 1,71 86,94 0,00 0,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 11,41 0,00 1,58 87,01 0,00 0,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00

The last log lines from /var/log/messages before a force system reset were:

kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20130517/exfield-389) kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMM] (Node ffff99c2ba2513c0), AE_AML_BUFFER_LIMIT (20130517/psparse-536) kernel: ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20130517/power_meter-339)

although, according to this RedHat thread, it should not be an issue https://access.redhat.com/discussions/3871951

LE 1: Occasionally I get similar freezes over small periods of time (less than a minute) and then it recovers. In the dmesg output I have:

[Lu aug 17 21:04:07 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 [Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: device is ready. [Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1

LE 2: Managed to save a dmesg output when the disk doesn't recover anymore and a reset is required.

[Lu aug 24 13:00:18 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 [Lu aug 24 13:03:20 2020] INFO: task scsi_eh_0:332 blocked for more than 120 seconds. [Lu aug 24 13:03:20 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Lu aug 24 13:03:20 2020] scsi_eh_0 D ffff8c603fc9acc0 0 332 2 0x00000000 [Lu aug 24 13:03:20 2020] Call Trace:

So hpsa enters this resetting logical disk-access procedure that never completes.

Ok, I'll try to keep a connection open and get the last logs from dmesg before resetting. Otherwise they don't get persisted. In the meantime I've updated the question with an observation that could correlate. — Laurentiu Soica
– Laurentiu Soica, Commented Aug 17, 2020 at 18:14
@MichaelHampton, I've updated the post with dmesg output when the problem occurs. — Laurentiu Soica
– Laurentiu Soica, Commented Aug 24, 2020 at 10:12

yagmoth555 · Accepted Answer · 2020-08-17 18:40:22Z

2

Last time I seen such symptom for disk IO stoping/pausing it was a bad disk issue. The disk controller on it were probably starting to malfunction, but the platten were ok.

I would check to be sure you have good backup, and as it's a system in raid check if the scsi controller is up to date as it didnt flagged the disk as bad yet.

answered Aug 17, 2020 at 18:40

yagmoth555♦

17.5k5 gold badges32 silver badges51 bronze badges

The controller is an HP H240. My version is 6.3. I see the latest is 7.0 so I'll try to update to that. Other that that, the controller and the drives are marked as green.

Laurentiu Soica
– Laurentiu Soica

2020-08-17 19:16:50 +00:00
Commented Aug 17, 2020 at 19:16
@LaurentiuSoica Perfect, let me know if that change something. In my example the disk were green too, but a corruption happened into the raid because the faulty disk didnt got flagged by the controller. It's why I asked to make sure you have a backup

yagmoth555
– yagmoth555 ♦

2020-08-17 19:25:54 +00:00
Commented Aug 17, 2020 at 19:25
I did upgrade the firmware to latest available. After another ~3 days, the disk froze again and the only option left was a force reset. The disks and controller are still reported as green.

Laurentiu Soica
– Laurentiu Soica

2020-08-24 09:46:40 +00:00
Commented Aug 24, 2020 at 9:46
So, few days later, as expected, one of the disks went down. What's still unclear for me is what's to be done during this window to troubleshoot and act faster

Laurentiu Soica
– Laurentiu Soica

2020-08-28 18:43:55 +00:00
Commented Aug 28, 2020 at 18:43

Add a comment |

Stack Exchange Network

Troubleshooting frozen disk when queue is full

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Troubleshooting frozen disk when queue is full

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions