Yi Feng & Emery Berger University of Massachusetts Amherst A Locality-Improving Dynamic Memory Allocator
motivation Memory performance: bottleneck for many applications Heap data often dominates Dynamic allocators dictate spatial locality of heap objects
related work Previous work on dynamic allocation Reducing fragmentation [survey: Wilson et al., Wilson & Johnstone] Improving locality Search inside allocator [Grunwald et al.] Programmer-assisted [Chilimbi et al., Truong et al.] Profile-based [Barrett & Zorn, Seidl & Zorn]
this work Replacement allocator called Vam Reduces fragmentation Improves allocator & application locality Cache and page-level Automatic and transparent
outline Introduction Designing Vam Experimental Evaluation Space Efficiency Run Time Cache Performance Virtual Memory Performance
Vam design Builds on previous allocator designs DLmalloc Doug Lea, default allocator in Linux/GNU libc PHKmalloc Poul-Henning Kamp, default allocator in FreeBSD Reap [Berger et al. 2002] Combines best features
DLmalloc Goal Reduce fragmentation Design Best-fit Small objects: fine-grained, cached Large objects: coarse-grained, coalesced sorted by size, search Object headers ease deallocation and coalescing
PHKmalloc Goal Improve page-level locality Design Page-oriented design Coarse size classes: 2 x or n *page size Page divided into equal-size chunks, bitmap for allocation Objects share headers at page start (BIBOP) Discards free pages via madvise
Reap Goal Capture speed and locality advantages of region allocation while providing individual frees Design Pointer-bumping allocation Reclaims free objects on associated heap
Vam overview Goal Improve application performance across wide range of available RAM Highlights Page-based design Fine-grained size classes No headers for small objects Implemented in Heap Layers using C++ templates [Berger et al. 2001]
page-based heap Virtual space divided into pages Page-level management maps pages from kernel records page status discards freed pages
page-based heap Heap Space Page Descriptor Table free discard
fine-grained size classes Small (8-128 bytes) and medium (136-496 bytes) sizes 8 bytes apart, exact-fit dedicated per-size page blocks (group of pages) 1 page for small sizes 4 pages for medium sizes either available or full reap-like allocation inside block available full
fine-grained size classes Large sizes (504-32K bytes) also 8 bytes apart, best-fit collocated in contiguous pages aggressive coalescing Extremely large sizes (above 32KB) use mmap/munmap Contiguous Pages free free coalesce empty empty empty empty empty 504 512 520 528 536 544 552 560 … … Free List Table
header elimination Object headers simplify deallocation & coalescing but: Space overhead Cache pollution Eliminated in Vam for small objects header object per-page metadata
header elimination Need to distinguish “headered” from “headerless” objects in free() Heap address space partitioning address space 16MB area (homogeneous objects) partition table
outline Introduction Designing Vam Experimental Evaluation Space efficiency Run time Cache performance Virtual memory performance
experimental setup Dell Optiplex 270 Intel Pentium 4 3.0GHz 8KB L1 (data) cache, 512KB L2 cache, 64-byte cache lines 1GB RAM 40GB 5400RPM hard disk Linux 2.4.24 Use perfctr patch and perfex tool to set Intel performance counters (instructions, caches, TLB)
benchmarks Memory-intensive SPEC CPU2000 benchmarks custom allocators removed in gcc and parser 471 bytes 285 bytes 21 bytes 52 bytes Average Object Size 68K 21K 0.5K 4.4K Alloc Interval (# of inst) 30K 129K 2813K 373K Alloc Rate (#/sec) 1.5M 5.4M 788M 9M Total Allocations 45MB 90MB 10MB 110MB Max Live Size 65MB 120MB 15MB 130MB VM Size 102 billion 114 billion 424 billion 40 billion Instructions 62 sec 43 sec 275 sec 24 sec Execution Time 255.vortex 253.perlbmk 197.parser 176.gcc
space efficiency Fragmentation = max (physical) mem in use / max live data of app
total execution time
total instructions
cache performance L2 cache misses closely correlated to run time performance
VM performance Application performance degrades with reduced RAM Better page-level locality produces better paging performance, smoother degradation
 
Vam summary Outperforms other allocators both with enough RAM and under memory pressure Improves application locality cache level page-level (VM) see paper for more analysis
the end Heap Layers publicly available http:// www.heaplayers.org Vam to be included soon
backup slides
TLB performance
average fragmentation Fragmentation = average of mem in use / live data of app

Vam: A Locality-Improving Dynamic Memory Allocator

  • 1.
    Yi Feng &Emery Berger University of Massachusetts Amherst A Locality-Improving Dynamic Memory Allocator
  • 2.
    motivation Memory performance:bottleneck for many applications Heap data often dominates Dynamic allocators dictate spatial locality of heap objects
  • 3.
    related work Previouswork on dynamic allocation Reducing fragmentation [survey: Wilson et al., Wilson & Johnstone] Improving locality Search inside allocator [Grunwald et al.] Programmer-assisted [Chilimbi et al., Truong et al.] Profile-based [Barrett & Zorn, Seidl & Zorn]
  • 4.
    this work Replacementallocator called Vam Reduces fragmentation Improves allocator & application locality Cache and page-level Automatic and transparent
  • 5.
    outline Introduction DesigningVam Experimental Evaluation Space Efficiency Run Time Cache Performance Virtual Memory Performance
  • 6.
    Vam design Buildson previous allocator designs DLmalloc Doug Lea, default allocator in Linux/GNU libc PHKmalloc Poul-Henning Kamp, default allocator in FreeBSD Reap [Berger et al. 2002] Combines best features
  • 7.
    DLmalloc Goal Reducefragmentation Design Best-fit Small objects: fine-grained, cached Large objects: coarse-grained, coalesced sorted by size, search Object headers ease deallocation and coalescing
  • 8.
    PHKmalloc Goal Improvepage-level locality Design Page-oriented design Coarse size classes: 2 x or n *page size Page divided into equal-size chunks, bitmap for allocation Objects share headers at page start (BIBOP) Discards free pages via madvise
  • 9.
    Reap Goal Capturespeed and locality advantages of region allocation while providing individual frees Design Pointer-bumping allocation Reclaims free objects on associated heap
  • 10.
    Vam overview GoalImprove application performance across wide range of available RAM Highlights Page-based design Fine-grained size classes No headers for small objects Implemented in Heap Layers using C++ templates [Berger et al. 2001]
  • 11.
    page-based heap Virtualspace divided into pages Page-level management maps pages from kernel records page status discards freed pages
  • 12.
    page-based heap HeapSpace Page Descriptor Table free discard
  • 13.
    fine-grained size classesSmall (8-128 bytes) and medium (136-496 bytes) sizes 8 bytes apart, exact-fit dedicated per-size page blocks (group of pages) 1 page for small sizes 4 pages for medium sizes either available or full reap-like allocation inside block available full
  • 14.
    fine-grained size classesLarge sizes (504-32K bytes) also 8 bytes apart, best-fit collocated in contiguous pages aggressive coalescing Extremely large sizes (above 32KB) use mmap/munmap Contiguous Pages free free coalesce empty empty empty empty empty 504 512 520 528 536 544 552 560 … … Free List Table
  • 15.
    header elimination Objectheaders simplify deallocation & coalescing but: Space overhead Cache pollution Eliminated in Vam for small objects header object per-page metadata
  • 16.
    header elimination Needto distinguish “headered” from “headerless” objects in free() Heap address space partitioning address space 16MB area (homogeneous objects) partition table
  • 17.
    outline Introduction DesigningVam Experimental Evaluation Space efficiency Run time Cache performance Virtual memory performance
  • 18.
    experimental setup DellOptiplex 270 Intel Pentium 4 3.0GHz 8KB L1 (data) cache, 512KB L2 cache, 64-byte cache lines 1GB RAM 40GB 5400RPM hard disk Linux 2.4.24 Use perfctr patch and perfex tool to set Intel performance counters (instructions, caches, TLB)
  • 19.
    benchmarks Memory-intensive SPECCPU2000 benchmarks custom allocators removed in gcc and parser 471 bytes 285 bytes 21 bytes 52 bytes Average Object Size 68K 21K 0.5K 4.4K Alloc Interval (# of inst) 30K 129K 2813K 373K Alloc Rate (#/sec) 1.5M 5.4M 788M 9M Total Allocations 45MB 90MB 10MB 110MB Max Live Size 65MB 120MB 15MB 130MB VM Size 102 billion 114 billion 424 billion 40 billion Instructions 62 sec 43 sec 275 sec 24 sec Execution Time 255.vortex 253.perlbmk 197.parser 176.gcc
  • 20.
    space efficiency Fragmentation= max (physical) mem in use / max live data of app
  • 21.
  • 22.
  • 23.
    cache performance L2cache misses closely correlated to run time performance
  • 24.
    VM performance Applicationperformance degrades with reduced RAM Better page-level locality produces better paging performance, smoother degradation
  • 25.
  • 26.
    Vam summary Outperformsother allocators both with enough RAM and under memory pressure Improves application locality cache level page-level (VM) see paper for more analysis
  • 27.
    the end HeapLayers publicly available http:// www.heaplayers.org Vam to be included soon
  • 28.
  • 29.
  • 30.
    average fragmentation Fragmentation= average of mem in use / live data of app