Our service runs on AWS on m5.12xlarge nodes (48 cores, 192 G RAM) on Ubuntu 16.04. We use Java 8. For our service we allocate about 150G as max heap size. We have no swap on the node. The nature of our service is that it allocates a lot of large short-lived objects. Apart from this, through a 3rd party library that we depend on, we create a lot of short-lived processes that communicate with our process via pipes and get reaped after serving a handful of requests.
We noticed that sometime after the process is started and the RES (in top) of the process touches about 70G, CPU interrupts increase significantly and JVM's GC logs show that sys time shoots up to tens of seconds (sometimes 70 seconds). Load averages which start out at < 1 end up at almost 10 on these 48 core nodes in this state.
sar output indicated that when a node is in this state, min page faults increase significantly. Broadly, high number of CPU interrupts correlate with this state.
Restarting our service provides only a temporary respite. Load averages slowly but surely spike up and GC sys times go through the roof again.
We run our service on cluster of about 10 nodes each with load (almost) equally distributed. We see some nodes get into this state more often and more quickly than others that work normally.
We tried various GC options and options such as large pages/THP and so on with no luck.
Here's a snapshot of loadavg and meminfo
/proc/meminfo on a node with high load avg: MemTotal: 193834132 kB MemFree: 21391860 kB MemAvailable: 52217676 kB Buffers: 221760 kB Cached: 9983452 kB SwapCached: 0 kB Active: 144240208 kB Inactive: 4235732 kB Active(anon): 138274336 kB Inactive(anon): 24772 kB Active(file): 5965872 kB Inactive(file): 4210960 kB Unevictable: 3652 kB Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 89140 kB Writeback: 4 kB AnonPages: 138292556 kB Mapped: 185656 kB Shmem: 25480 kB Slab: 22590684 kB SReclaimable: 21680388 kB SUnreclaim: 910296 kB KernelStack: 56832 kB PageTables: 611304 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 96917064 kB Committed_AS: 436086620 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 85121024 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 212960 kB DirectMap2M: 33210368 kB DirectMap1G: 163577856 kB /proc/meminfo on a node that is behaving ok MemTotal: 193834132 kB MemFree: 22509496 kB MemAvailable: 45958676 kB Buffers: 179576 kB Cached: 6958204 kB SwapCached: 0 kB Active: 150349632 kB Inactive: 2268852 kB Active(anon): 145485744 kB Inactive(anon): 8384 kB Active(file): 4863888 kB Inactive(file): 2260468 kB Unevictable: 3652 kB Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 1519448 kB Writeback: 0 kB AnonPages: 145564840 kB Mapped: 172080 kB Shmem: 9056 kB Slab: 17642908 kB SReclaimable: 17356228 kB SUnreclaim: 286680 kB KernelStack: 52944 kB PageTables: 302344 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 96917064 kB Committed_AS: 148479160 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 142260224 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 149472 kB DirectMap2M: 20690944 kB DirectMap1G: 176160768 kB The most significant chunk of the flamegraph is:
https://i.sstatic.net/yXmOM.png
By chance we ended up rebooting a node and noticed that it ran in a very stable manner for about 2 weeks with no change elsewhere. Since then we've resorted to rebooting nodes that hit this state to get some breathing room. Later we found elsewhere that these symptoms could be related to page table getting wedged which can only be mitigated by a reboot. It is not clear if this is correct and if this is the reason for our situation.
Is there a way to resolve this issue permanently?
/proc/meminfoin the problem state. Also try visualizing what is on CPU to see where the problem is. Consider collecting both Java and kernel profiling data then making mixed-mode flame graphs. medium.com/netflix-techblog/java-in-flames-e763b3d32166