How to test Linux server for hardware errors?

Question

I have a Debian 10 server that is randomly rebooting, though no error were written to journald. The server has rebooted 20 times in last 3 days.

$ journalctl --list-boots -22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC ... -2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC -1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC 0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC

Usually I run memtester which takes a couple of hours (depending on RAM size) and it's quite unlikely to actually reproduce the issue (if it really is memory).

$ apt install memtester $ memtester 245GB 4 > memtester.log 2>&1

My server has 256GB RAM, in 16 RAM modules:

$ dmidecode -t memory | grep Size | wc -l 16

free -h total used free shared buffers cached Mem: 251G 32G 218G 113M 0B 135M -/+ buffers/cache: 32G 219G Swap: 0B 0B 0B

DDR3 modules:

Handle 0x002D, DMI type 17, 34 bytes Memory Device Array Handle: 0x002B Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: P1-DIMMA1 Bank Locator: P0_Node0_Channel0_Dimm0 Type: DDR3 Type Detail: Registered (Buffered) Speed: 1600 MHz Manufacturer: Hynix Semiconducto Serial Number: 093C2E1C Asset Tag: Dimm0_AssetTag Part Number: HMT42GR7AFR4C-RD Rank: 2 Configured Clock Speed: 1600 MHz

UPDATE: The system should have ECC memory modules (seems to be detected in dmidecode -t memory)

Handle 0x002B, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8

After replacing all memory modules the system shows EDAC MC0 errors (I haven't seen those before)

Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000 Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000 Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000 Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000 Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1 Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026 Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000 Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0 Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

UPDATE 2 I've tried disabling edac kernel module, as suggested by RedHat/Suse in order to rule out possibility that the module is in conflict with hardware correction on motherboard

echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf

This seems to prevent reboots, but memory allocation is failing (on workload). All memtests still passing.

Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015 Call Trace: dump_stack+0x66/0x81 dump_header+0x6b/0x283 ? ___ratelimit+0xa1/0x100 oom_kill_process.cold.30+0xb/0x1cf out_of_memory+0x1a5/0x450 mem_cgroup_out_of_memory+0xbe/0xd0 try_charge+0x707/0x780 mem_cgroup_try_charge+0x86/0x190 __add_to_page_cache_locked+0x64/0x240 add_to_page_cache_lru+0x4a/0xe0 filemap_fault+0x34c/0x780 ? filemap_map_pages+0x1ed/0x3a0 ext4_filemap_fault+0x2c/0x40 [ext4] __do_fault+0x36/0x170 __handle_mm_fault+0xdb6/0x11b0 handle_mm_fault+0xd6/0x200 __do_page_fault+0x249/0x4f0 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x7f1e1d58ff9d Code: Bad RIP value. RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202 RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040 RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626 RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007 R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0 R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8 mce: [Hardware Error]: Machine check events logged mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1 mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428 mce: [Hardware Error]: Machine check events logged mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1 mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428

can you be more specific rebooting? crashing and restarting? powering off and on? could it be a power supply issue (ups fault perhaps?) — SEWTGIYWTKHNTDS
– SEWTGIYWTKHNTDS, Commented Jan 24, 2023 at 14:19
I'm trying to rule out all possibilities. Technicians have checked the power supply, it looks ok. The only suspicious messages are kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 — Tombart
– Tombart, Commented Jan 24, 2023 at 15:57
is it old? I had a system reboot and it was because the thermal paste on the cpu cooler had dried out and the cpu was overheating. Another server didn't like the UPS Self test, a firmware update sorted that one but your frequency seems too high for that. I see interrupt too long on lots of systems so probably not significant. Malicious user? Hope you sort it soon.. — SEWTGIYWTKHNTDS
– SEWTGIYWTKHNTDS, Commented Jan 24, 2023 at 16:30
I've installed the system 2 weeks ago, cooling seems to be working fine. The motherboard is Supermicro X9DRFR. — Tombart
– Tombart, Commented Jan 24, 2023 at 20:20
Supermicro servers have an IPMI BMC with its own network connection (sometimes a dedicated port, sometimes shared with the NIC 1) and it has its own hardware error log. What's in that log? Also you can get that from the OS using ipmitool or ipmiutil package (Debian has them both), try sel command. Better use ipmiutil (I've seen cases when it decoded messages way better). — Nikita Kipriyanov
– Nikita Kipriyanov, Commented Feb 13, 2023 at 15:34

Chopper3 · Accepted Answer · 2023-01-24 13:36:38Z

1

Have you tried booting from https://www.memtest86.com/ - it's always been great for me.

answered Jan 24, 2023 at 13:36

Chopper3

102k9 gold badges112 silver badges240 bronze badges

Not yet, I have ssh access to a booted OS. Unfortunately booting custom image is not possible in this case. Is the memtest86 algorithm very different from memtester?

Tombart
– Tombart

2023-01-24 15:52:47 +00:00
Commented Jan 24, 2023 at 15:52
It boots from the tester ISO, so you've no OS in the way.

Chopper3
– Chopper3

2023-01-24 19:10:28 +00:00
Commented Jan 24, 2023 at 19:10
Yes, I know. I can only install/compile packages in provided rescue system. I don't have physical access to the server. AFAIK it's not possible to install memtest86 as a package.

Tombart
– Tombart

2023-01-24 19:54:02 +00:00
Commented Jan 24, 2023 at 19:54
If you have no control over hardware and suspect a hardware problem, this is not your problem. Hand it over to the person who is in charge of the hardware.

Nikita Kipriyanov
– Nikita Kipriyanov

2023-02-13 15:45:01 +00:00
Commented Feb 13, 2023 at 15:45

Add a comment |

Stack Exchange Network

How to test Linux server for hardware errors?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to test Linux server for hardware errors?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions