I have a Debian 10 server that is randomly rebooting, though no error were written to journald. The server has rebooted 20 times in last 3 days.
$ journalctl --list-boots -22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC ... -2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC -1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC 0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC Usually I run memtester which takes a couple of hours (depending on RAM size) and it's quite unlikely to actually reproduce the issue (if it really is memory).
$ apt install memtester $ memtester 245GB 4 > memtester.log 2>&1 My server has 256GB RAM, in 16 RAM modules:
$ dmidecode -t memory | grep Size | wc -l 16 free -h total used free shared buffers cached Mem: 251G 32G 218G 113M 0B 135M -/+ buffers/cache: 32G 219G Swap: 0B 0B 0B DDR3 modules:
Handle 0x002D, DMI type 17, 34 bytes Memory Device Array Handle: 0x002B Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 16384 MB Form Factor: DIMM Set: None Locator: P1-DIMMA1 Bank Locator: P0_Node0_Channel0_Dimm0 Type: DDR3 Type Detail: Registered (Buffered) Speed: 1600 MHz Manufacturer: Hynix Semiconducto Serial Number: 093C2E1C Asset Tag: Dimm0_AssetTag Part Number: HMT42GR7AFR4C-RD Rank: 2 Configured Clock Speed: 1600 MHz UPDATE: The system should have ECC memory modules (seems to be detected in dmidecode -t memory)
Handle 0x002B, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8 After replacing all memory modules the system shows EDAC MC0 errors (I haven't seen those before)
Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000 Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000 Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000 Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000 Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1 Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026 Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000 Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0 Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1) UPDATE 2 I've tried disabling edac kernel module, as suggested by RedHat/Suse in order to rule out possibility that the module is in conflict with hardware correction on motherboard
echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf This seems to prevent reboots, but memory allocation is failing (on workload). All memtests still passing.
Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015 Call Trace: dump_stack+0x66/0x81 dump_header+0x6b/0x283 ? ___ratelimit+0xa1/0x100 oom_kill_process.cold.30+0xb/0x1cf out_of_memory+0x1a5/0x450 mem_cgroup_out_of_memory+0xbe/0xd0 try_charge+0x707/0x780 mem_cgroup_try_charge+0x86/0x190 __add_to_page_cache_locked+0x64/0x240 add_to_page_cache_lru+0x4a/0xe0 filemap_fault+0x34c/0x780 ? filemap_map_pages+0x1ed/0x3a0 ext4_filemap_fault+0x2c/0x40 [ext4] __do_fault+0x36/0x170 __handle_mm_fault+0xdb6/0x11b0 handle_mm_fault+0xd6/0x200 __do_page_fault+0x249/0x4f0 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x7f1e1d58ff9d Code: Bad RIP value. RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202 RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040 RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626 RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007 R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0 R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8 mce: [Hardware Error]: Machine check events logged mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1 mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428 mce: [Hardware Error]: Machine check events logged mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1 mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428 
kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000X9DRFR.ipmitooloripmiutilpackage (Debian has them both), tryselcommand. Better useipmiutil(I've seen cases when it decoded messages way better).