Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

KernelTLV Filesystem Supersession Part I File Systems: why, how and where (another slide deck) Philip.Derbeko@gmail.com Part II Emerging PM HW & SW stack implications Amit.Golander@netapp.com Part III ZUFS – PM-based file systems in user space Shachar.Sharon@netapp.com Boaz.Harrosh@netapp.com © 2017 NetApp, Inc. All rights reserved1 KernelTLV Meetup Nov. 14th 2017

Part II Persistent Memory (PM) – HW & SW implications Emerging PM/NVDIMM devices, the value they bring to applications and how they revolutionize the storage stack © 2017 NetApp, Inc. All rights reserved2 KernelTLV Meetup Nov. 14th 2017

About © 2017 NetApp, Inc. All rights reserved.3 ~12,000 employees 50+ countries Only Top 5 vendor that is rapidly growing Celebrating 25Years Founded in 1992 NetApp acquired @ June 2017 TLV area based Recruiting FS Dev. +1 PM Software Pioneer Since 2014 Ground breaking Latencies

Storage Media Generations © 2017 NetApp, Inc. All rights reserved4 PM marries the best of both worlds: + Storage Persistency Memory Speed HDD FLASH IOPS (even if random…) Latency (even under load…) NVDIMM / PM

Definitions Rounded latency numbers & under typical load 5 SCM (Storage Class Memory) Byte-addressable Media @ Near-memory speed <1us 5 © 2017 NetApp, Inc. All rights reserved. PM (Persistent Memory) Byte-addressable Device @ Near-memory speed, on memory bus

PM-based Storage - Question Traditional Assumptions Byte-addressable media Block-addressable wrapper SW layers Network SW caching Block abstraction ? 66 © 2017 NetApp, Inc. All rights reserved. Memory Vs. Storage

Block wrapper PM-based FS Application Block-based FS Page Cache bio PM-based Software Approaches Application Re-written Application NPM DAX-enabled FS SW reuse Performance App SW Infrastructure HW

Linux Kernel Enablers “-o dax” Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4 NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1 QEMO support BTT (Block, Atomic) PMEM (Direct Access) DAX Enabled FS NFIT Core 8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ

Block wrapper PM-based FS Applications DAX-enabled FS Storage semantics PM-based Software Approaches Memory semantics Block-based FS Page Cache bio NPM Mmap, ld/st, msyncRead/write, fsync NVDIMM Driver

Examples: Block wrapper PM-based FS Applications DAX-enabled FS Storage semantics Memory semantics Block-based FS Page Cache bio NPM NTFS-DAX REFS-DAX XFS-DAX EXT4-DAX NOVA LUMFS SIMFS HINFS Plexistor M1FS NVDIMM Driver Examples: Windows server 2016 Linux 4.4 and above Ubuntu 16.04 LTS RHEL 7.3 Fedora 23 SLES 12 SP2 Examples: NVML 1.3 (*) Huge variance in features and stability(*) Good portability PM-based Software Approaches

Part III ZUFS - Zero-copy User-mode FS New style user-mode filesystems that require: - Extremely Low-Latency - Synchronous & DAX © 2017 NetApp, Inc. All rights reserved12 KernelTLV Meetup Nov. 14th 2017

Motivation FuSE is great for HDDs and ok(ish) for SSDs, but not suitable for PMEM © 2017 NetApp, Inc. All rights reserved.16 FlashHDD Memory FUSE SCM ? RDMATCP Latency$/GB FUSE ZUFS Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM SW Perf. goals • Secondary (High latency media) • Async I/O Throughput • SW is the bottleneck • Latency is everything SW caching Slow media -> Rely on OS Page Cache Near-memory speed media -> Bypass OS Page Cache Access method I/O only I/O and mmap (DAX) Cost of redundant copy / context switch Negligible The bottleneck -> Avoid copies, queues & remain on core Latency penalty under load 100s of µs 3-4 µs DesignAssumptions

Preliminary Results FUSE Vs. ZUFS (PM Media) © 2017 NetApp, Inc. All rights reserved.21 • Measured on Dual socket Intel XEON 2650v4 (48 HW Threads) DRAM-backed PMEM type • Random 4KB DirectIO writ(ish) access

 ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)  Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)  ZT-vma - Mmap 4M vma zero copy communication area per ZT  IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation  On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation  After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.  On exit (or server crash) file is closed, Kernel cleans all resources  Async operation is also supported. Server can return EAGAIN.  Server will later complete the operation ASYNC. App will be woken up.  Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM © 2017 NetApp, Inc. All rights reserved.23 Architecture

Perf. Optimizations - 1 MMAP_LOCAL_CPU © 2017 NetApp, Inc. All rights reserved24 • mm patch to allow single-core TLB invalidate (in the common case) 0 5 10 15 20 25 30 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 Latency[us] IOPS ZUFS w/wo mm patch ZUFS_unpatched_mm ZUFS_patched_mm

Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

More Related Content

What's hot

Similar to Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

More from Kernel TLV

Recently uploaded

Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space