KernelTLV Filesystem Supersession Part I File Systems: why, how and where (another slide deck) Philip.Derbeko@gmail.com Part II Emerging PM HW & SW stack implications Amit.Golander@netapp.com Part III ZUFS – PM-based file systems in user space Shachar.Sharon@netapp.com Boaz.Harrosh@netapp.com © 2017 NetApp, Inc. All rights reserved1 KernelTLV Meetup Nov. 14th 2017
Part II Persistent Memory (PM) – HW & SW implications Emerging PM/NVDIMM devices, the value they bring to applications and how they revolutionize the storage stack © 2017 NetApp, Inc. All rights reserved2 KernelTLV Meetup Nov. 14th 2017
About © 2017 NetApp, Inc. All rights reserved.3 ~12,000 employees 50+ countries Only Top 5 vendor that is rapidly growing Celebrating 25Years Founded in 1992 NetApp acquired @ June 2017 TLV area based Recruiting FS Dev. +1 PM Software Pioneer Since 2014 Ground breaking Latencies
Storage Media Generations © 2017 NetApp, Inc. All rights reserved4 PM marries the best of both worlds: + Storage Persistency Memory Speed HDD FLASH IOPS (even if random…) Latency (even under load…) NVDIMM / PM
Definitions Rounded latency numbers & under typical load 5 SCM (Storage Class Memory) Byte-addressable Media @ Near-memory speed <1us 5 © 2017 NetApp, Inc. All rights reserved. PM (Persistent Memory) Byte-addressable Device @ Near-memory speed, on memory bus
PM-based Storage - Question Traditional Assumptions Byte-addressable media Block-addressable wrapper SW layers Network SW caching Block abstraction ? 66 © 2017 NetApp, Inc. All rights reserved. Memory Vs. Storage
Block wrapper PM-based FS Application Block-based FS Page Cache bio PM-based Software Approaches Application Re-written Application NPM DAX-enabled FS SW reuse Performance App SW Infrastructure HW
Linux Kernel Enablers “-o dax” Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4 NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1 QEMO support BTT (Block, Atomic) PMEM (Direct Access) DAX Enabled FS NFIT Core 8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ
Block wrapper PM-based FS Applications DAX-enabled FS Storage semantics PM-based Software Approaches Memory semantics Block-based FS Page Cache bio NPM Mmap, ld/st, msyncRead/write, fsync NVDIMM Driver
Examples: Block wrapper PM-based FS Applications DAX-enabled FS Storage semantics Memory semantics Block-based FS Page Cache bio NPM NTFS-DAX REFS-DAX XFS-DAX EXT4-DAX NOVA LUMFS SIMFS HINFS Plexistor M1FS NVDIMM Driver Examples: Windows server 2016 Linux 4.4 and above Ubuntu 16.04 LTS RHEL 7.3 Fedora 23 SLES 12 SP2 Examples: NVML 1.3 (*) Huge variance in features and stability(*) Good portability PM-based Software Approaches
11 1111 © 2017 NetApp, Inc. All rights reserved.
Part III ZUFS - Zero-copy User-mode FS New style user-mode filesystems that require: - Extremely Low-Latency - Synchronous & DAX © 2017 NetApp, Inc. All rights reserved12 KernelTLV Meetup Nov. 14th 2017
From VFS to Zufs © 2017 NetApp, Inc. All rights reserved 13
Why Userspace? • Resiliency • Ease of development • Externals (e.g. compress, encrypt) • Licensing • Market requirements (avoid kernel modules) © 2017 NetApp, Inc. All rights reserved 14
ZUFS and FUSE are complementary tools © 2017 NetApp, Inc. All rights reserved 15
Motivation FuSE is great for HDDs and ok(ish) for SSDs, but not suitable for PMEM © 2017 NetApp, Inc. All rights reserved.16 FlashHDD Memory FUSE SCM ? RDMATCP Latency$/GB FUSE ZUFS Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM SW Perf. goals • Secondary (High latency media) • Async I/O Throughput • SW is the bottleneck • Latency is everything SW caching Slow media -> Rely on OS Page Cache Near-memory speed media -> Bypass OS Page Cache Access method I/O only I/O and mmap (DAX) Cost of redundant copy / context switch Negligible The bottleneck -> Avoid copies, queues & remain on core Latency penalty under load 100s of µs 3-4 µs DesignAssumptions
Zufs Overview Core 1 Core 2 Core 3 Core 4 © 2017 NetApp, Inc. All rights reserved 18
Kernel to Userspace © 2017 NetApp, Inc. All rights reserved 19
ZUFS – Kernel Zoom-in © 2017 NetApp, Inc. All rights reserved20 KernelTLV Meetup Nov. 14th 2017
Preliminary Results FUSE Vs. ZUFS (PM Media) © 2017 NetApp, Inc. All rights reserved.21 • Measured on Dual socket Intel XEON 2650v4 (48 HW Threads) DRAM-backed PMEM type • Random 4KB DirectIO writ(ish) access
Architecture © 2017 NetApp, Inc. All rights reserved.22 APP zt-vma PP P App pages Mapped into Server VM Unmapped on return ZUS Zu Server ZUF Zu Feeder zt per cpu ... kernel
 ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)  Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)  ZT-vma - Mmap 4M vma zero copy communication area per ZT  IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation  On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation  After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.  On exit (or server crash) file is closed, Kernel cleans all resources  Async operation is also supported. Server can return EAGAIN.  Server will later complete the operation ASYNC. App will be woken up.  Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM © 2017 NetApp, Inc. All rights reserved.23 Architecture
Perf. Optimizations - 1 MMAP_LOCAL_CPU © 2017 NetApp, Inc. All rights reserved24 • mm patch to allow single-core TLB invalidate (in the common case) 0 5 10 15 20 25 30 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 Latency[us] IOPS ZUFS w/wo mm patch ZUFS_unpatched_mm ZUFS_patched_mm
Perf. Optimizations - 2 © 2017 NetApp, Inc. All rights reserved.25 • scheduler patch to allow efficient context switch on same core (Relay Object) Unimplemented No Perf. Results
© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---26 Questions

Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

  • 1.
    KernelTLV Filesystem Supersession PartI File Systems: why, how and where (another slide deck) Philip.Derbeko@gmail.com Part II Emerging PM HW & SW stack implications Amit.Golander@netapp.com Part III ZUFS – PM-based file systems in user space Shachar.Sharon@netapp.com Boaz.Harrosh@netapp.com © 2017 NetApp, Inc. All rights reserved1 KernelTLV Meetup Nov. 14th 2017
  • 2.
    Part II Persistent Memory(PM) – HW & SW implications Emerging PM/NVDIMM devices, the value they bring to applications and how they revolutionize the storage stack © 2017 NetApp, Inc. All rights reserved2 KernelTLV Meetup Nov. 14th 2017
  • 3.
    About © 2017 NetApp,Inc. All rights reserved.3 ~12,000 employees 50+ countries Only Top 5 vendor that is rapidly growing Celebrating 25Years Founded in 1992 NetApp acquired @ June 2017 TLV area based Recruiting FS Dev. +1 PM Software Pioneer Since 2014 Ground breaking Latencies
  • 4.
    Storage Media Generations ©2017 NetApp, Inc. All rights reserved4 PM marries the best of both worlds: + Storage Persistency Memory Speed HDD FLASH IOPS (even if random…) Latency (even under load…) NVDIMM / PM
  • 5.
    Definitions Rounded latency numbers& under typical load 5 SCM (Storage Class Memory) Byte-addressable Media @ Near-memory speed <1us 5 © 2017 NetApp, Inc. All rights reserved. PM (Persistent Memory) Byte-addressable Device @ Near-memory speed, on memory bus
  • 6.
    PM-based Storage -Question Traditional Assumptions Byte-addressable media Block-addressable wrapper SW layers Network SW caching Block abstraction ? 66 © 2017 NetApp, Inc. All rights reserved. Memory Vs. Storage
  • 7.
    Block wrapper PM-based FS Application Block-basedFS Page Cache bio PM-based Software Approaches Application Re-written Application NPM DAX-enabled FS SW reuse Performance App SW Infrastructure HW
  • 8.
    Linux Kernel Enablers “-odax” Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4 NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1 QEMO support BTT (Block, Atomic) PMEM (Direct Access) DAX Enabled FS NFIT Core 8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ
  • 9.
    Block wrapper PM-based FS Applications DAX-enabledFS Storage semantics PM-based Software Approaches Memory semantics Block-based FS Page Cache bio NPM Mmap, ld/st, msyncRead/write, fsync NVDIMM Driver
  • 10.
    Examples: Block wrapper PM-based FS Applications DAX-enabledFS Storage semantics Memory semantics Block-based FS Page Cache bio NPM NTFS-DAX REFS-DAX XFS-DAX EXT4-DAX NOVA LUMFS SIMFS HINFS Plexistor M1FS NVDIMM Driver Examples: Windows server 2016 Linux 4.4 and above Ubuntu 16.04 LTS RHEL 7.3 Fedora 23 SLES 12 SP2 Examples: NVML 1.3 (*) Huge variance in features and stability(*) Good portability PM-based Software Approaches
  • 11.
    11 1111 © 2017NetApp, Inc. All rights reserved.
  • 12.
    Part III ZUFS -Zero-copy User-mode FS New style user-mode filesystems that require: - Extremely Low-Latency - Synchronous & DAX © 2017 NetApp, Inc. All rights reserved12 KernelTLV Meetup Nov. 14th 2017
  • 13.
    From VFS toZufs © 2017 NetApp, Inc. All rights reserved 13
  • 14.
    Why Userspace? • Resiliency •Ease of development • Externals (e.g. compress, encrypt) • Licensing • Market requirements (avoid kernel modules) © 2017 NetApp, Inc. All rights reserved 14
  • 15.
    ZUFS and FUSEare complementary tools © 2017 NetApp, Inc. All rights reserved 15
  • 16.
    Motivation FuSE is greatfor HDDs and ok(ish) for SSDs, but not suitable for PMEM © 2017 NetApp, Inc. All rights reserved.16 FlashHDD Memory FUSE SCM ? RDMATCP Latency$/GB FUSE ZUFS Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM SW Perf. goals • Secondary (High latency media) • Async I/O Throughput • SW is the bottleneck • Latency is everything SW caching Slow media -> Rely on OS Page Cache Near-memory speed media -> Bypass OS Page Cache Access method I/O only I/O and mmap (DAX) Cost of redundant copy / context switch Negligible The bottleneck -> Avoid copies, queues & remain on core Latency penalty under load 100s of µs 3-4 µs DesignAssumptions
  • 17.
    Zufs Overview Core 1 Core2 Core 3 Core 4 © 2017 NetApp, Inc. All rights reserved 18
  • 18.
    Kernel to Userspace ©2017 NetApp, Inc. All rights reserved 19
  • 19.
    ZUFS – KernelZoom-in © 2017 NetApp, Inc. All rights reserved20 KernelTLV Meetup Nov. 14th 2017
  • 20.
    Preliminary Results FUSEVs. ZUFS (PM Media) © 2017 NetApp, Inc. All rights reserved.21 • Measured on Dual socket Intel XEON 2650v4 (48 HW Threads) DRAM-backed PMEM type • Random 4KB DirectIO writ(ish) access
  • 21.
    Architecture © 2017 NetApp,Inc. All rights reserved.22 APP zt-vma PP P App pages Mapped into Server VM Unmapped on return ZUS Zu Server ZUF Zu Feeder zt per cpu ... kernel
  • 22.
     ZT -ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)  Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)  ZT-vma - Mmap 4M vma zero copy communication area per ZT  IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation  On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation  After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.  On exit (or server crash) file is closed, Kernel cleans all resources  Async operation is also supported. Server can return EAGAIN.  Server will later complete the operation ASYNC. App will be woken up.  Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM © 2017 NetApp, Inc. All rights reserved.23 Architecture
  • 23.
    Perf. Optimizations -1 MMAP_LOCAL_CPU © 2017 NetApp, Inc. All rights reserved24 • mm patch to allow single-core TLB invalidate (in the common case) 0 5 10 15 20 25 30 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 Latency[us] IOPS ZUFS w/wo mm patch ZUFS_unpatched_mm ZUFS_patched_mm
  • 24.
    Perf. Optimizations -2 © 2017 NetApp, Inc. All rights reserved.25 • scheduler patch to allow efficient context switch on same core (Relay Object) Unimplemented No Perf. Results
  • 25.
    © 2017 NetApp,Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---26 Questions