2

I have a system (centos 7.8) with k8s installed on top.

After a few days of normal operations (system load is ~30% and disk activity is around 60 iops, not saturated), the system goes to an unstable state where nothing is commited to disk anymore. As can be seen from an iostat -x 5, avgqu-sz freezes no more r/w are accepted.

The question is where should I look next in order to identify the root cause.

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 11,36 0,00 1,71 86,94 0,00 0,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 11,41 0,00 1,58 87,01 0,00 0,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00 dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00 

The last log lines from /var/log/messages before a force system reset were:

kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20130517/exfield-389) kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMM] (Node ffff99c2ba2513c0), AE_AML_BUFFER_LIMIT (20130517/psparse-536) kernel: ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20130517/power_meter-339) 

although, according to this RedHat thread, it should not be an issue https://access.redhat.com/discussions/3871951

LE 1: Occasionally I get similar freezes over small periods of time (less than a minute) and then it recovers. In the dmesg output I have:

[Lu aug 17 21:04:07 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 [Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: device is ready. [Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 

LE 2: Managed to save a dmesg output when the disk doesn't recover anymore and a reset is required.

[Lu aug 24 13:00:18 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 [Lu aug 24 13:03:20 2020] INFO: task scsi_eh_0:332 blocked for more than 120 seconds. [Lu aug 24 13:03:20 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Lu aug 24 13:03:20 2020] scsi_eh_0 D ffff8c603fc9acc0 0 332 2 0x00000000 [Lu aug 24 13:03:20 2020] Call Trace: 

So hpsa enters this resetting logical disk-access procedure that never completes.

3
  • 1
    Check dmesg when the problem occurs. Commented Aug 17, 2020 at 8:46
  • Ok, I'll try to keep a connection open and get the last logs from dmesg before resetting. Otherwise they don't get persisted. In the meantime I've updated the question with an observation that could correlate. Commented Aug 17, 2020 at 18:14
  • @MichaelHampton, I've updated the post with dmesg output when the problem occurs. Commented Aug 24, 2020 at 10:12

1 Answer 1

2

Last time I seen such symptom for disk IO stoping/pausing it was a bad disk issue. The disk controller on it were probably starting to malfunction, but the platten were ok.

I would check to be sure you have good backup, and as it's a system in raid check if the scsi controller is up to date as it didnt flagged the disk as bad yet.

4
  • The controller is an HP H240. My version is 6.3. I see the latest is 7.0 so I'll try to update to that. Other that that, the controller and the drives are marked as green. Commented Aug 17, 2020 at 19:16
  • @LaurentiuSoica Perfect, let me know if that change something. In my example the disk were green too, but a corruption happened into the raid because the faulty disk didnt got flagged by the controller. It's why I asked to make sure you have a backup Commented Aug 17, 2020 at 19:25
  • I did upgrade the firmware to latest available. After another ~3 days, the disk froze again and the only option left was a force reset. The disks and controller are still reported as green. Commented Aug 24, 2020 at 9:46
  • So, few days later, as expected, one of the disks went down. What's still unclear for me is what's to be done during this window to troubleshoot and act faster Commented Aug 28, 2020 at 18:43

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.