Understanding the reason for VM crash (OOM) from a crash dump

stefanchrobot · March 17, 2022, 12:15pm

Hi, we’re trying to understand why the VM crashed with OOM error by analysing the crash dump. Here’s the output of erl_crashdump_analyzer.sh:

$ ./erl_crashdump_analyzer.sh prod-2crash-2.dump analyzing prod-2crash-2.dump, generated on: Wed Mar 16 07:10:32 2022 Slogan: eheap_alloc: Cannot allocate 762886488 bytes of memory (of type "old_heap"). Memory: === processes: 694 Mb processes_used: 694 Mb system: 2230 Mb atom: 1 Mb atom_used: 1 Mb binary: 2164 Mb code: 0 Mb ets: 6 Mb --- total: 2924 Mb Different message queue lengths (5 largest different): === 990 0 Error logger queue length: === File descriptors open: === UDP: 0 TCP: 109 Files: 1 --- Total: 110 Number of processes: === 990 Processes Heap+Stack memory sizes (words) used in the VM (5 largest different): === 1 66222786 1 999631 1 514838 1 318187 2 121536 Processes OldHeap memory sizes (words) used in the VM (5 largest different): === 1 999631 3 833026 13 514838 1 318187 4 196650 Process States when crashing (sum): === 1 CONNECTED 1 CONNECTED|BINARY_IO 104 CONNECTED|BINARY_IO|PORT_LOCK 1 CONNECTED|BINARY_IO|SOFT_EOF|PORT_LOCK 8 CONNECTED|PORT_LOCK 1 CONNECTED|SOFT_EOF 1 Current Process Garbing 1 Current Process Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS 1 Garbing 1 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH 3 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH | OFF_HEAP_MSGQ 3 Internal ACT_PRIO_LOW | USR_PRIO_LOW | PRQ_PRIO_LOW 4 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX 1 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX | OFF_HEAP_MSGQ 973 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL 1 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS 4 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | OFF_HEAP_MSGQ 989 Waiting

The crash dump is +1.6GB in size and the tool spews out this error:

 ./erl_crashdump_analyzer.sh: line 30: let: m=Current Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)/(1024*1024): syntax error in expression (error token is "Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)/(1024*1024)")

Here are some relevant parts of the crash dump:

31 =dirty_cpu_scheduler:3 32 Scheduler Sleep Info Flags: 33 Scheduler Sleep Info Aux Work: 34 Current Process: <0.24269.0> 35 Current Process State: Garbing 36 Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS 37 Current Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24) 38 Current Process Limited Stack Trace: 39 0x00007f6a3e9bc490:SReturn addr 0x23B60130 ('Elixir.Jason.Decoder':parse/2 + 288) 40 0x00007f6a3e9bc4b8:SReturn addr 0x232685D0 ('Elixir.Tesla.Middleware.JSON':process/3 + 72) 41 0x00007f6a3e9bc4d0:SReturn addr 0x23267DF0 ('Elixir.Tesla.Middleware.JSON':decode/2 + 280) 42 0x00007f6a3e9bc4e8:SReturn addr 0x23C900D0 ('Elixir.Tesla.Middleware.Retry':retry/3 + 216) ... 47 0x00007f6a3e9bc578:SReturn addr 0x23E5AE98 ('Elixir.Oban.Queue.Executor':perform/1 + 296) 48 0x00007f6a3e9bc5a0:SReturn addr 0x23E597B0 ('Elixir.Oban.Queue.Executor':call/1 + 136) 49 0x00007f6a3e9bc5b0:SReturn addr 0x60B011E0 ('Elixir.Task.Supervised':invoke_mfa/2 + 128) 50 0x00007f6a3e9bc5f8:SReturn addr 0x60B01920 ('Elixir.Task.Supervised':reply/4 + 320) 51 0x00007f6a3e9bc610:SReturn addr 0x6C5CB950 (proc_lib:init_p_do_apply/3 + 64) 52 0x00007f6a3e9bc630:SReturn addr 0x9A0F88 (<terminate process normally>) ... 104 =dirty_io_run_queue 105 Run Queue Max Length: 0 106 Run Queue High Length: 0 107 Run Queue Normal Length: 0 108 Run Queue Low Length: 0 109 Run Queue Port Length: 0 110 Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK 111 =memory 112 total: 3066553160 113 processes: 728034440 114 processes_used: 728028416 115 system: 2338518720 116 atom: 1917297 117 atom_used: 1915015 118 binary: 2269481192 119 code: 42046517 120 ets: 6942480 ... 24606 =proc:<0.24269.0> 24607 State: Garbing 24608 Spawned as: proc_lib:init_p/5 24609 Last scheduled in for: 'Elixir.Jason.Decoder':string/7 24610 Spawned by: <0.3530.0> 24611 Message queue length: 0 24612 Number of heap fragments: 6552 24613 Heap fragment data: 3373654 24614 Link list: [<0.3530.0>, {from,<0.3531.0>,#Ref<0.2180618153.564920321.79435>}, {from,<0.24270.0>,#Ref<0.2180618153.564920321.79437>}] 24615 Reductions: 1513280838 24616 Stack+heap: 66222786 24617 OldHeap: 0 24618 Heap unused: 3372710 24619 OldHeap unused: 0 24620 BinVHeap: 275173304 24621 OldBinVHeap: 0 24622 BinVHeap unused: 87907486 24623 OldBinVHeap unused: 28711198 24624 Memory: 556772760 24625 New heap start: 7F6A1F07F028 24626 New heap top: 7F6A3D001108 24627 Stack top: 7F6A3E9BC490 24628 Stack end: 7F6A3E9BC638 24629 Old heap start: 0 24630 Old heap top: 0 24631 Old heap end: 0 24632 Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24) 24633 Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS

Things that we managed to figure out so far:

The VM went down trying to allocate extra ~760MB of memory while already consuming ~3GB,
The culprit is the process with pid 0.24269.0: an Oban job that was paging through the results of an external API and accumulating JSON responses in memory (this was probably a mistake); it looks like that particular endpoint had an unusual amount of data; running the job would bring the whole VM down again and again after the app was brought up; removing the job solved the problem,
The VM crashed when the offending process was doing garbage collection,
The system was not showing any signs of overload in terms of CPU or memory prior to going down,
It’s unlikely that the data returned by the endpoint would be huge, definitely not in the ballpark of hundreds of megabytes,
We’re on Erlang 23.3.4.11 and Elixir 1.13.3,
We’re using the default Logger config (handle_otp_reports: true, handle_sasl_reports: false).

I’m unable to inspect the offending process in crashdump_viewer - the tool cannot handle that, reading the heap of the process times out and the UI disappears.

What really happened here?

My theory is that the offending process went over some per-process memory limit and crashed; while logging the crash, the VM went down. Can this be somehow confirmed? I was not able to reproduce it.

Is there a way to prevent this from happening again?

joey_the_snake · March 17, 2022, 12:52pm

Which http client are you using? You might want to look at Finch as it reduces the amount of data copying. This can really help when dealing with large responses.

stefanchrobot · March 17, 2022, 12:53pm

Thanks, I’ll look into that! But the priority now is to pinpoint the actual reason for the crash and find measures to prevent this from happening again.

dominicletz · March 17, 2022, 2:46pm

The memory of 556Mb on that single process indeed looks like it’s the culprit, and somewhere it must have gotten all that data from. If the crashdump is 1.6gb and this process alone is already at 500Mb then you could try some manual inspection of the crashdump as 1/3 of the whole file belongs to this one process. Assuming it’s JSON you might be quick in figuring out what kind of payload can have become this large.

In terms of preventing, you could set up a maximum memory limit either only for these worker processes using process_flag_max_heap_size or for all processes in the system with the +hmax parameter. E.g. setup a 300Mb per-process memory limit, if that is more than you expect. That will ensure that individual “poison” data sets won’t kill your VM.

Understanding the reason for VM crash (OOM) from a crash dump

Have you seen...