Why Swap usage is high for Influxdb: 100% Disk I/O and Swap usage but only 50% memory?

Question

We have an influxdb VM that is constantly under 100% swap. Even if we restart the VM, the swap usage reaches 100% in about 20 minutes. However, memory usage is only about 50%. (The VM has 32 CPU Cores and 128 GB of Memory.)

Running free -h:

 total used free shared buff/cache available Mem: 123Gi 70Gi 567Mi 551Mi 52Gi 59Gi Swap: 9Gi 9Gi 0B

Shows that we have at least 59GB of memory and 100% of the swap is still used.

If we run atop we see that the disk is 100% busy (swap and disk are red)

SWP | tot 10.0G | | free 0.0M | swcac 505.9M DSK | nvme2n1 | busy 100% | read 33115 | write 527 | discrd 0 | KiB/r 19 | KiB/w 173 | | KiB/d 0 | MBr/s 63.3 | MBw/s 8.9 | avq 88.19 | avio 0.30 ms

This I'm guessing is the constant inflow of data-events.... (But why is read high then?)

Memory and I/O pressure from PSI:

cat /proc/pressure/memory some avg10=32.65 avg60=32.74 avg300=31.25 total=35534063966 full avg10=32.25 avg60=32.34 avg300=30.87 total=35182532561 cat /proc/pressure/io some avg10=84.83 avg60=78.83 avg300=78.96 total=70337558807 full avg10=84.38 avg60=78.05 avg300=78.08 total=69619870053

Memory pressure doesn't seem high but IO pressure is.

Running iotop it is clear that the disk activity is from influxdb:

4272 be/3 root 0.00 B/s 94.47 K/s ?unavailable? [jbd2/nvme2n1p1-8] 36921 be/2 vcap 1169.95 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid 36927 be/2 vcap 323.37 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid 36928 be/2 vcap 2038.33 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid 36941 be/2 vcap 1936.59 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid 37020 be/2 vcap 385.14 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid . . . . . . . . (Lots of influx threads)

SAR output

sar -d 10 6 Linux 6.2.0-39-generic (ac2f95dd-14d9-4eed-8e2f-060615e24dce) 03/24/2024 _x86_64_ (32 CPU) 06:45:57 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 06:46:07 AM nvme1n1 0.30 12.80 1.60 0.00 48.00 0.00 1.33 0.12 06:46:07 AM nvme0n1 0.30 0.00 3.20 0.00 10.67 0.00 1.00 0.12 06:46:07 AM nvme2n1 3420.80 67438.40 3687.20 0.00 20.79 106.47 31.13 100.00 06:46:07 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 06:46:17 AM nvme1n1 1.00 0.00 9.20 0.00 9.20 0.00 0.90 0.16 06:46:17 AM nvme0n1 0.90 16.00 9.60 0.00 28.44 0.00 0.67 0.20 06:46:17 AM nvme2n1 3404.80 68434.40 7868.00 0.00 22.41 102.23 30.03 100.00 06:46:17 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util 06:46:27 AM nvme1n1 9.70 26.40 20.40 0.00 4.82 0.02 1.69 1.24 06:46:27 AM nvme0n1 0.30 0.00 4.40 0.00 14.67 0.00 0.67 0.08 06:46:27 AM nvme2n1 3215.40 46037.20 12006.40 0.00 18.05 66.12 20.56 100.00 ^C Average: DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util Average: nvme1n1 3.67 13.07 10.40 0.00 6.40 0.01 1.61 0.51 Average: nvme0n1 0.50 5.33 5.73 0.00 22.13 0.00 0.73 0.13 Average: nvme2n1 3347.00 60636.67 7853.87 0.00 20.46 91.61 27.37 100.00

Running queries in influxdb:

It seems like this swap issue is even when queries arent running?

> show queries qid query database duration status --- ----- -------- -------- ------ 265 SHOW QUERIES metrics 53µs running

vmstat output:

vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 32 10485756 541300 8784 108148928 11 140 3563 194 76 217 1 1 58 40 0 0 32 10485756 638500 8764 108060800 0 0 128216 60 5181 3351 0 1 59 40 0 1 31 10485756 505964 8780 108189872 0 0 128252 256 5077 3769 0 1 54 45 0 0 32 10485756 663736 8744 108035424 0 0 128332 0 5047 3327 0 1 50 50 0 0 32 10485756 536476 8752 108164376 0 0 127776 24 4087 3335 0 0 53 46 0

/proc/meminfo is

MemTotal: 129202084 kB MemFree: 486060 kB MemAvailable: 71279440 kB Buffers: 24116 kB Cached: 59442056 kB SwapCached: 489676 kB Active: 51318648 kB Inactive: 75364416 kB Active(anon): 27646572 kB Inactive(anon): 28055976 kB Active(file): 23672076 kB Inactive(file): 47308440 kB Unevictable: 24 kB Mlocked: 24 kB SwapTotal: 10485756 kB SwapFree: 4 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 102236 kB Writeback: 6156 kB AnonPages: 66728116 kB Mapped: 43055064 kB Shmem: 127816 kB KReclaimable: 855024 kB Slab: 971400 kB SReclaimable: 855024 kB SUnreclaim: 116376 kB KernelStack: 10976 kB PageTables: 747920 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 75086796 kB Committed_AS: 95698296 kB VmallocTotal: 34359738367 kB VmallocUsed: 151392 kB VmallocChunk: 0 kB Percpu: 17920 kB HardwareCorrupted: 0 kB AnonHugePages: 7997440 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 202656 kB DirectMap2M: 6404096 kB DirectMap1G: 124780544 kB

I am also adding some excerpts of the pmap -x command:

Address Kbytes RSS Dirty Mode Mapping 0000000000400000 15232 3684 0 r-x-- influxd 00000000012e0000 31428 6552 0 r---- influxd 0000000003191000 4668 4380 396 rw--- influxd 0000000003620000 180 92 92 rw--- [ anon ] 0000000004436000 132 0 0 rw--- [ anon ] 000000c000000000 16384 9864 9864 rw--- [ anon ] 000000c001000000 47104 28172 28172 rw--- [ anon ] 000000c003e00000 6144 5016 5016 rw--- [ anon ] 000000c004400000 2048 1616 1616 rw--- [ anon ] 000000c004600000 2048 1620 1620 rw--- [ anon ] . . . 000000c033a00000 155648 120028 120028 rw--- [ anon ] 000000c03d200000 8192 8192 8192 rw--- [ anon ] 000000c03da00000 114688 92768 92768 rw--- [ anon ] . . . 000000c07d000000 270336 234948 234948 rw--- [ anon ] . 000000cecc000000 176128 174080 174080 rw--- [ anon ] . . 000000ced8e00000 2048 2048 2048 rw--- [ anon ] 000000ced9000000 137216 135168 135168 rw--- [ anon ] . . (Towrds the lower) . . 00007fa61fdef000 2116 2044 2044 rw--- [ anon ] 00007fa620000000 9664 0 0 r--s- L3-00000023.tsi 00007fa620a00000 40048 0 0 r--s- L5-00000032.tsi 00007fa623200000 40212 0 0 r--s- L5-00000032.tsi . . . 00007fa6a2c00000 9772 0 0 r--s- L3-00000023.tsi 00007fa6a3600000 2098160 0 0 r--s- 000024596-000000002.tsm 00007fa723800000 9920 0 0 r--s- L3-00000023.tsi 00007fa724200000 615764 0 0 r--s- 000024596-000000005.tsm 00007fa749c00000 2100756 0 0 r--s- 000024596-000000004.tsm 00007fa7ca000000 9768 0 0 r--s- L3-00000023.tsi . . . 00007fce82403000 28660 5412 5412 rw--- [ anon ] 00007fce84000000 4194308 2575504 0 r--s- index 00007fcf84001000 4 0 0 r--s- L0-00000001.tsl 00007fcf84002000 4 0 0 r--s- L0-00000001.tsl 00007fcf84003000 4 0 0 r--s- L0-00000001.tsl . . 00007fcfc48f7000 1060 0 0 r--s- L0-00000002.tsl 00007fcfc4a00000 262144 35444 0 r--s- 0046 00007fcfd4a00000 2048 1988 1988 rw--- [ anon ] 00007fcfd4c00000 262144 35948 0 r--s- 0045 . . 00007fd055a00000 4 0 0 r--s- L0-00000001.tsl 00007fd055a01000 4 0 0 r--s- L0-00000001.tsl 00007fd055a02000 4 0 0 r--s- L0-00000001.tsl . . 00007fd065c0f000 960 924 924 rw--- [ anon ] 00007fd065cff000 1028 0 0 r--s- L0-00000005.tsl 00007fd065e00000 262144 31952 0 r--s- 003c . . 00007fda27fee000 8192 8 8 rw--- [ anon ] 00007fda287ee000 4 0 0 ----- [ anon ] 00007fda287ef000 43076 1164 1164 rw--- [ anon ] 00007fda2b200000 160 160 0 r---- libc.so.6 00007fda2b228000 1620 780 0 r-x-- libc.so.6 00007fda2b3bd000 352 64 0 r---- libc.so.6 00007fda2b415000 16 0 0 r---- libc.so.6 00007fda2b419000 8 0 0 rw--- libc.so.6 00007fda2b41b000 52 0 0 rw--- [ anon ] 00007fda2b428000 4 0 0 r--s- L0-00000001.tsl 00007fda2b429000 4 0 0 r--s- L0-00000001.tsl 00007fda2b42a000 4 0 0 r--s- L0-00000001.tsl 00007fda2b42b000 4 0 0 r--s- L0-00000001.tsl 00007fda2b42c000 4 0 0 r--s- L0-00000001.tsl 00007fda2b42d000 4 0 0 r--s- L0-00000001.tsl 00007fda2b42e000 452 452 452 rw--- [ anon ] 00007fda2b49f000 16 0 0 r--s- L0-00000018.tsl 00007fda2b4af000 268 112 112 rw--- [ anon ] 00007fda2b4f2000 4 0 0 r---- libpthread.so.0 00007fda2b4f3000 4 0 0 r-x-- libpthread.so.0 00007fda2b4f4000 4 0 0 r---- libpthread.so.0 00007fda2b4f5000 4 0 0 r---- libpthread.so.0 00007fda2b4f6000 4 0 0 rw--- libpthread.so.0 00007fda2b4f7000 4 0 0 r--s- L0-00000001.tsl 00007fda2b4f8000 8 0 0 r--s- L0-00000001.tsl 00007fda2b4fa000 4 0 0 r--s- L0-00000001.tsl 00007fda2b4fb000 8 0 0 rw--- [ anon ] 00007fda2b4fd000 8 8 0 r---- ld-linux-x86-64.so.2 00007fda2b4ff000 168 168 0 r-x-- ld-linux-x86-64.so.2 00007fda2b529000 44 40 0 r---- ld-linux-x86-64.so.2 00007fda2b534000 4 0 0 r--s- L0-00000001.tsl 00007fda2b535000 8 0 0 r---- ld-linux-x86-64.so.2 00007fda2b537000 8 0 0 rw--- ld-linux-x86-64.so.2 00007fff74913000 132 12 12 rw--- [ stack ] 00007fff7499b000 16 0 0 r---- [ anon ] 00007fff7499f000 8 4 0 r-x-- [ anon ] ffffffffff600000 4 0 0 --x-- [ anon ] ---------------- ------- ------- ------- total kB 534464172 112696540 74590512

The series cardinality is 252390866 (So is the VM size in-adequate?)

VM details: Influxdb: 1.8.10 CPU Count: 32 Memory: 128 GB Disk: 1TB (Only 50% used) AWS VM Type: m6a.8xlarge (32CPU,128GB Memory)... EBS IO is 10GBps based on this https://aws.amazon.com/ec2/instance-types/m6a/ Linux Version: Linux 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The swapiness of the VM is 60 (default). (What does this mean? Initially, I thought it was a percentage but apparently, it's an absolute number?)

How do we debug this disk usage and also if the IOPS has reached its limits? And what is causing so much read rather than write?

Update Vm size was increased to 2x in memory:

Observations

vmstat:

and meminfo: MemFree: 9436328 kB MemAvailable: 246346788 kB Buffers: 829708 kB Cached: 171495864 kB SwapCached: 124960 kB Active: 78087852 kB Inactive: 167324320 kB Active(anon): 6396424 kB Inactive(anon): 2389588 kB Active(file): 71691428 kB Inactive(file): 164934732 kB

vmstat

vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 2379520 10251664 835112 172756112 1 2 196 596 7 4 2 0 93 5 0

Disk usage atop has significantly reduced to 20%

DSK | nvme2n1 | busy 20% | read 51 | write 2103 | discrd 0 | KiB/r 18 | KiB/w 165 | | KiB/d 0 | MBr/s 0.1 | MBw/s 34.0 | avq 13.95 | avio 0.94 ms

You have huge amount of memory, small swap and quite aggressive (use more swap) swapiness? IMHO this is what you can expect. — Romeo Ninov
– Romeo Ninov, Commented Mar 24, 2024 at 9:03
@RomeoNinov swappiness=60 isn't aggressive. It gives more than double the weight to discarding file pages than to swapping anon pages. The formula is ((200-60)/60) = 140/60. — AlexD
– AlexD, Commented Mar 24, 2024 at 9:41
vm.swappiness = 60 is the default value for debian installations - i would suggest as i always do, vm.swappiness = 1 so it will start only if 99% of memory is used. 0 would mostly disable swapping. — djdomi
– djdomi, Commented Mar 24, 2024 at 10:07

AlexD · Accepted Answer · 2024-03-24 17:42:32Z

You don't have enough memory.

The OS swapped out whatever unused memory pages it could and there is zero swap activity (si/so columns in vmstat) but still high memory and IO pressure.

You can't rely on free output in your case as InfluxDB memory maps its data and memory-mapped pages are counted as Cached/Available and not as Used. Under memory pressure, these memory-mapped pages are discarded and InfluxDB has to read them back again when needed.

As your data set is 409G but only 52G is available for memory-mapped files then it is possible that your active dataset is larger than the available 52GB and InfluxDB gets into a cycle similar to swap thrashing - it needs to access a memory-mapped page but it isn't in the memory so it reads it back from the disk but at the same time discards another page as it doesn't have memory for the current page and this keeps high read I/O. But this doesn't explain high read I/O when you don't have any queries - you need to check if you actually have high read I/O in that case.

If my guess is correct you should see a large value in Mapped in /proc/meminfo and large total values in pmap output for InfluxDB processes.

Possible mitigations:

tune InfluxDB to reduce its memory usage if possible
add memory
add swap and increase vm.swappiness up to 200 to avoid discarding memory-mapped pages but watch for si/so columns in vmstat and keep them at zero.

Note about vm.swappiness. It is a common misconception that vm.swappiness represents a percentage of allocated memory to initiate swapping. Per documentation, it is "the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200". With the default value of 60 it means that if the kernel needs to free 200 pages it would discard 140 file pages from the page cache pool (Cached in free) and swap out 60 pages from the anon pages pool (Used in free). With the value of 100 it would discard/swap equally between the pools. These proportions are ignored if there are not enough pages or if free memory is too low.

P.S. I don't know anything about InfluxDB so it is considered a black box here. It could be something internal to InfluxDB that forces it to read all the data. You may possibly find better answers on InfluxDB support forums but the fact that you are low on memory in the current configuration still stays.

UPDATE Additional info from /proc/meminfo shows what I expected - 43G Mapped memory out of 59G Cached. At the same time, it shows a lot of Inactive memory.

Inactive: 75364416 kB Active(anon): 27646572 kB Inactive(anon): 28055976 kB Active(file): 23672076 kB Inactive(file): 47308440 kB

28 GB of Inactive(anon) are potentially swappable. I would add 5GB of swap and check if it fills up to 100%. If it does and there is no significant swap activity si/so then add another 5GB of swap. If it doesn't fill up to 100% then increase vm.swappiness to 100, 150, 200 while checking si/so. While si/so is kept close to zero, a swap increase should be a safe performance improvement as it saves the memory for a more useful page cache.

On the other hand 47 GB of Inactive(file) doesn't look good. It means that 2/3 of the page cache is mostly missed and the queries are too scattered over all 400GB data set. Saving 10-20 GB by increasing the swap probably won't reduce the I/O load significantly but still worth a try.

Hi @AlexD. First of all....Thank you for your quick help and such a detailed answer. I have been googling and putting questions in the influx forum for quite a while on this. It is the first time somebody has taken some time to explain the issue in such a proper manner. I'm happy to read this answer as there is a lot of learning for me here personally, so thank you for that as well :) The fact that form memory-mapped processes we look at cached is new to me and something that I will look into more closely. I'll also look into pmap. I will try to optimize influxdb and increase swapiness — Vipin Menon
– Vipin Menon, Commented Mar 24, 2024 at 15:41
This is an excellent answer. I believe, that in a fully pressured environment in the split LRU Anonymous pages are 50/50 active/inactive typically. This actually indicates high pressure memory activity (per psi) as pages get pushed down the LRU. I assume the anonymous mappings relate to some 'shared buffer' thing in influxdb. I'd also look at what queries are being done. Its likely some query that scans the entire history of a table which forces all the evictions/reactivations of old pages. — Matthew Ife
– Matthew Ife, Commented Mar 24, 2024 at 18:57
One thing I did today was increase the VM size to double the size. (I will try the swap experiment as well..) After increasing the VM size: We can see the following: total used free shared buff/cache available Mem: 246Gi 71Gi 9.8Gi 285Mi 165Gi 234Gi Swap: 9Gi 2.3Gi 7.7Gi Almost 165Gb was used for Buffer/Cache. SI/SO is ~1-2 from vmstat. — Vipin Menon
– Vipin Menon, Commented Mar 25, 2024 at 14:06

Stack Exchange Network

Why Swap usage is high for Influxdb: 100% Disk I/O and Swap usage but only 50% memory?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why Swap usage is high for Influxdb: 100% Disk I/O and Swap usage but only 50% memory?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions