Possible Linux page table issue/huge load average with large heap JVM that results in significant sys time in GC logs

Question

Our service runs on AWS on m5.12xlarge nodes (48 cores, 192 G RAM) on Ubuntu 16.04. We use Java 8. For our service we allocate about 150G as max heap size. We have no swap on the node. The nature of our service is that it allocates a lot of large short-lived objects. Apart from this, through a 3rd party library that we depend on, we create a lot of short-lived processes that communicate with our process via pipes and get reaped after serving a handful of requests.

We noticed that sometime after the process is started and the RES (in top) of the process touches about 70G, CPU interrupts increase significantly and JVM's GC logs show that sys time shoots up to tens of seconds (sometimes 70 seconds). Load averages which start out at < 1 end up at almost 10 on these 48 core nodes in this state.

sar output indicated that when a node is in this state, min page faults increase significantly. Broadly, high number of CPU interrupts correlate with this state.

Restarting our service provides only a temporary respite. Load averages slowly but surely spike up and GC sys times go through the roof again.

We run our service on cluster of about 10 nodes each with load (almost) equally distributed. We see some nodes get into this state more often and more quickly than others that work normally.

We tried various GC options and options such as large pages/THP and so on with no luck.

Here's a snapshot of loadavg and meminfo

 /proc/meminfo on a node with high load avg: MemTotal: 193834132 kB MemFree: 21391860 kB MemAvailable: 52217676 kB Buffers: 221760 kB Cached: 9983452 kB SwapCached: 0 kB Active: 144240208 kB Inactive: 4235732 kB Active(anon): 138274336 kB Inactive(anon): 24772 kB Active(file): 5965872 kB Inactive(file): 4210960 kB Unevictable: 3652 kB Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 89140 kB Writeback: 4 kB AnonPages: 138292556 kB Mapped: 185656 kB Shmem: 25480 kB Slab: 22590684 kB SReclaimable: 21680388 kB SUnreclaim: 910296 kB KernelStack: 56832 kB PageTables: 611304 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 96917064 kB Committed_AS: 436086620 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 85121024 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 212960 kB DirectMap2M: 33210368 kB DirectMap1G: 163577856 kB /proc/meminfo on a node that is behaving ok MemTotal: 193834132 kB MemFree: 22509496 kB MemAvailable: 45958676 kB Buffers: 179576 kB Cached: 6958204 kB SwapCached: 0 kB Active: 150349632 kB Inactive: 2268852 kB Active(anon): 145485744 kB Inactive(anon): 8384 kB Active(file): 4863888 kB Inactive(file): 2260468 kB Unevictable: 3652 kB Mlocked: 3652 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 1519448 kB Writeback: 0 kB AnonPages: 145564840 kB Mapped: 172080 kB Shmem: 9056 kB Slab: 17642908 kB SReclaimable: 17356228 kB SUnreclaim: 286680 kB KernelStack: 52944 kB PageTables: 302344 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 96917064 kB Committed_AS: 148479160 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 142260224 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 149472 kB DirectMap2M: 20690944 kB DirectMap1G: 176160768 kB

The most significant chunk of the flamegraph is:

https://i.sstatic.net/yXmOM.png

By chance we ended up rebooting a node and noticed that it ran in a very stable manner for about 2 weeks with no change elsewhere. Since then we've resorted to rebooting nodes that hit this state to get some breathing room. Later we found elsewhere that these symptoms could be related to page table getting wedged which can only be mitigated by a reboot. It is not clear if this is correct and if this is the reason for our situation.

Is there a way to resolve this issue permanently?

Please edit your question to add the contents of /proc/meminfo in the problem state. Also try visualizing what is on CPU to see where the problem is. Consider collecting both Java and kernel profiling data then making mixed-mode flame graphs. medium.com/netflix-techblog/java-in-flames-e763b3d32166 — John Mahowald
– John Mahowald, Commented Jun 25, 2019 at 0:25
@JohnMahowald Thanks for your response. I updated my post with the information as suggested by you. I am still not able to spot anything that could help solve the problem. — devurandom
– devurandom, Commented Jun 25, 2019 at 21:43
It kind of smells like a NUMA issue. Can you run the workload on a larger quantity of smaller instances? I'm not sure what else you can really do with a NUMA problem on EC2. — Michael Hampton
– Michael Hampton, Commented Jun 25, 2019 at 22:16
(1) We were running on smaller machines initially and then we moved to these bigger ones. We had similar load average issues on the smaller ones, though we didn't analyze the issue deeply. We were treating it as a Java heap-size/GC issue at that time. (2) There's no /proc/<pid>/numa_maps on the box, so I suppose NUMA is not in play. (3) The fact that a box behaves okay for a few days after a reboot (but not on restarting our application) seems to indicate a case of a state build up in the system that can be undone only with a reboot. — devurandom
– devurandom, Commented Jun 25, 2019 at 23:20

John Mahowald · Accepted Answer · 2019-06-28 13:35:34Z

Transparent huge pages are getting fragmented or churning. On Linux, this is the size of memory to consider abandoning transparent and explicitly setting up page sizes.

Differences of greater than 10 GB, in bad minus good:

Committed_AS: 274.3 AnonHugePages: -54.5 DirectMap2M: 11.9 DirectMap1G: -12.0

Shift from DirectMap of 1G to 2M shows how internally the x86 TLB and Linux had less contiguous space to work with. A large difference is lost 50 GB from AnonHugePages there. Somehow that blew up Committed_AS to 225% of your MemTotal which is a bad symptom; this system is going to page out like mad.

Given the page faults in the flame graph stack, you are getting large overheads from the Linux virtual memory system shuffling pages around.

Improving performance includes explicitly configuring huge pages. 150 GB of heap is well beyond the 30 GB transition point where compressed pointers are no longer feasible. (Lots has been written about staying under this 30 GB threshold.) Triple-digit GB also is the size where I consider Linux huge pages need to be seriously evaluated.

On OpenJDK or Oracle JDK: properly allocate huge pages first then use option -XX:+UseLargePages. Java Support for Large Memory Pages and Debian wiki on HugePages

If you also wish to experiment with garbage collectors, have look at OpenJDK's ZGC wiki page. Limited pause times, handling large heaps, and NUMA awareness are explicit goals. In short, also tack on experimental options: -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseLargePages. ZGC wiki page also discusses faffing about with Linux huge page pools and hugetlbfs, always helpful to have examples of those things.

Regarding NUMA, think for a minute on the CPU this would run on. Probably two 24 core sockets or so. AWS isn't specific, but say it was a Platinum 8175. Because you will be executing on different sockets, some of the memory will not be local to the socket. This is true even if the hypervisor doesn't expose this topology to the VM guest.

Two socket on a modern Xeon can have manageable NUMA effects, however. Page sizes is a bigger problem.

Thanks a lot John. Will try your suggestions.

devurandom
– devurandom

2019-06-28 22:03:30 +00:00
Commented Jun 28, 2019 at 22:03 — devurandom
– devurandom, Commented Jun 28, 2019 at 22:03

Stack Exchange Network

Possible Linux page table issue/huge load average with large heap JVM that results in significant sys time in GC logs

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Possible Linux page table issue/huge load average with large heap JVM that results in significant sys time in GC logs

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions