1

I'm trying to setup ZFS on a single disk because of the amazing compression and snapshotting capabilities. My workload is a postgres server. The usual guides suggest the following settings:

atime = off compression = lz4 primarycache = metadata recordsize=16k 

But with those settings I do see some weirdness in read speed - I'm just looking at this atm!

For reference here's my test drive (Intel P4800X) with XFS, it's a simple direct IO test with dd:

 [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=4K iflag=direct 910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 10.9987 s, 339 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=8K iflag=direct 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 6.05091 s, 616 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=16K iflag=direct 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 3.8243 s, 975 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=32K iflag=direct 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 2.78787 s, 1.3 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=64K iflag=direct 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 2.18482 s, 1.7 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=128K iflag=direct 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 1.83346 s, 2.0 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=256K iflag=direct 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 1.69168 s, 2.2 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=512K iflag=direct 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.54205 s, 2.4 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=1M iflag=direct 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.51988 s, 2.5 GB/s 

As you can see the drive can go to about 80k IOPS at 4K reads, and the same at 8K - linear increase here (According to spec it can go to 550k IOPS at QD16, but I'm testing here single thread sequential read - so everything as expected)

Kernel Parameters for zfs:

options zfs zfs_vdev_scrub_min_active=48 options zfs zfs_vdev_scrub_max_active=128 options zfs zfs_vdev_sync_write_min_active=64 options zfs zfs_vdev_sync_write_max_active=128 options zfs zfs_vdev_sync_read_min_active=64 options zfs zfs_vdev_sync_read_max_active=128 options zfs zfs_vdev_async_read_min_active=64 options zfs zfs_vdev_async_read_max_active=128 options zfs zfs_top_maxinflight=320 options zfs zfs_txg_timeout=30 options zfs zfs_dirty_data_max_percent=40 options zfs zfs_vdev_scheduler=deadline options zfs zfs_vdev_async_write_min_active=8 options zfs zfs_vdev_async_write_max_active=64 

Now the same test with ZFS and a blocksize of 16K:

 910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 39.6985 s, 93.9 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 20.2442 s, 184 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 10.5837 s, 352 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 6.64908 s, 561 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 4.85928 s, 767 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 3.91185 s, 953 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 3.41855 s, 1.1 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 3.17058 s, 1.2 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 2.97989 s, 1.3 GB/s 

As you can see, the 4K read test maxes out already at 93 MB/s and the 8K read at 184 MB/s, the 16K reaches 352 MB/s. Based on the previous tests I would definitly expect faster reads at the 4k (243.75),8k (487.5),16k (975). Additionally I read that the recordsize has no impact on the read performance - but clearly it does.

for comparison 128k recordsize:

910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 107.661 s, 34.6 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 55.6932 s, 66.9 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 27.3412 s, 136 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 14.1506 s, 263 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 7.4061 s, 503 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 4.1867 s, 890 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 2.6765 s, 1.4 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.87574 s, 2.0 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.40653 s, 2.7 GB/s 

What I also can clearly see with iostat the the disk has an average request size of the corresponding record size. But the IOPS are way lower than with XFS.

Is that how it should behave? Where is that behaviour documented? I need good performance for my postgres server (sequential + random), but I also want great performance for my backups, copies etc. (sequential) - so it seems either I get good sequentials speed with big records, or good random speed with small records.

Edit: Additionally I also tested with primarycache=all there's more weirdness because it maxes out at 1.3 GB/s regardless of the record size.

Server details:

64 GB DDR4 RAM
Intel Xeon E5-2620v4
Intel P4800X

2
  • I would have left block size at 128k. You should also set xattr=sa on the pool and/or filesysyem. Commented Sep 26, 2017 at 13:24
  • actually I'm testing all of the different recordsizes up to 128k and each one has it's problems - for 128k I max out at 30MB/s for 4K - but 1M reads at ~2.8 GB/s Commented Sep 26, 2017 at 14:17

1 Answer 1

3

The observed behavior is due to how ZFS does end-to-end checksumming, which is based on the recordsize concept.

Basically, each object is decomposed in an appropriate number of record-sized chunks, which are separately checksummed. This means that smaller-than-recordsize reads really need to tranfer and re-checksum the whole recordsized object, leading to "wasted" storage bandwidth.

This means that large-recordsize ZFS dataset performs poorly with small reads and, conversely, well with large reads. On contrary, small-recordsize ZFS dataset performs well with small reads and sub-par with large reads.

Please note that compression and snapshots also works with recordsize granularity: a dataset with 4K or 8K recordsize will have much lower compression ratio than, say, a 32K dataset.

In short, ZFS recordsize has no "bullet-proof" value, rather you need to tune it to the specific application requirement. This also implies that dd is a poor choice for benchmarking (albeit, being quick & dirty, I also use it extensively!); rather, you should use fio (tuned to behave as your application) or the application itself.

You can read here for further information.

For general purpose use, I would let it to the default value (128K), while for database and virtual machines I would use a much smaller 32K value.

Finally, pay attention to ZFS read-ahead/prefetch tuning, which can considerably increase read speed.

6
  • okay this sounds reasonable and also covers how my performance testing (also done with fio) shows the changes in brutto vs net bandwidth (reading 4k with 32k recordsize uses 8 times brutto bandwidth) What it does not really explain is why the zfs formatted drive doesn't use the IOPS the device can provide (4k record size, 4k reads to disk, 4k bs in dd) - for XFS we have 80k IOPS for ZFS we have 25k IOPS - any idea to figure that out? Commented Sep 26, 2017 at 21:02
  • ZFS is a CoW filesystem, which means data locality will be much lower than classical filesystem. Lower read performance than XFS should be take as granted. As said, you can tune the prefetch option to recover some speed, but XFS-like performance are hard to achieve. Moreover, ZFS does not support O_DIRECT; when reading big files, this means you are constantly trashing your ARC (with associated non-trivial CPU overhead). Commented Sep 26, 2017 at 21:08
  • as far as I see with primarycache=metadata the cache is never established, but with primarycache=all it is used all the time (since it's compressed since 0.7 limited to 1.3 GB/s, without compression 2.5 GB/s zfs_compressed_arc_enabled=0 ) Why should a lower read performance be granted? It's a 3DXP device (check it out 550k IOPS @4k) - especially there the performance should be the same, regardless of ARC Commented Sep 26, 2017 at 21:16
  • In my experience, ZFS read performance with small reads will be lower due to a) more complex/big metadata, b) low efficiency of small write coalescing (compared to linux own I/O scheduler), c) higher fragmentation (but this mainly affect HDD) and d) read amplification. In general, you can not expect a complex CoW filesystem to perform as an extemely well-tuned classical filesystem as XFS on the same hardware. Anyway, if you can, try to change your primarycache to "all": does it change anything? Be sure that your test file is at least 4x the total ARC size, though. Commented Sep 26, 2017 at 21:34
  • Yes Arc does change stuff, but this only has an effect after the second read for obvious reasons. It also has a tremendous effect if the recordsize != bs. I run into CPU limits for multi thread read / write - where I hit for 4k randread 128k and for seqread 252k IOPS For single thread I get on XFS ~80k IOPS on ZFS ~25k IOPS as I already said. Which makes no sense with ~30% CPU load of the one thread. Commented Sep 26, 2017 at 21:44

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.