Project

General

Profile

Actions

Feature #15602

closed

Eliminate recording full-width hash value for small Hash

Feature #15602: Eliminate recording full-width hash value for small Hash

Added by ko1 (Koichi Sasada) almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Target version:
-
[ruby-core:91530]

Description

Abstract

Let's shape up small hash value (1 to 8 entries) from 192B to 128B on 64bit ptr environments.

Data structure proposal

(step 1) Record only key and value pairs.

Now Ruby 2.6, 1 to 8 entry Hash objects allocate 192 byte (8B * 3 (key, value and hash value triple) * 8 entry = 192B) with ar_table (instead of st_table).
Eliminating to record hash value will reduce this allocation from 192B to 128B (8 * 2 * 8).

(step 2) 1 byte hash value

For 1 to 8 entries, full-width Hash value (8 bytes) may be too long to lookup the entry.
1 byte hash value can be generated from 8 byte hash value.
(hash_value & 0xff is most simple way to get it, but not sure it is enough)

Name 1 byte hash value as "hash hint" on my patch.

(step 3) Embed hash hint into RHash

RHash::iter_lev is used to recognize nesting level of a hash (h.each{ "h's iter_lev is 1 here" }).
However, there are only a few cases that this value is bigger than 1.
So we can put this value into flags (if it exceeds the limit, we can store this value into hidden attribute).
8 hash hints becomese 8B == sizeof(VALUE), so we can embed this value into RHash.

Discussion

  • Pros.
    • We can reduce allocation size of small Hash.
    • Increase cache locality on hash lookup
      • We don't need to touch ar_table (allocate memory) if hash hints doesn't match.
      • We can access correct ar_table entry directly.
  • Cons.
    • hash hints can conflict more than full-width hash value => may increase eql? call.
      • performance down
      • incompatibility

Evaluation

I tested this patch and it slightly increase performance (not so big, on my micro-benchmark).
Memory consumption is reduced theoretically.

Patch

https://github.com/ko1/ruby/tree/hash_small_ar

Updated by Eregon (Benoit Daloze) over 6 years ago Actions #1 [ruby-core:91535]

IIRC, not storing #hash breaks specs, does it pass test-spec?
Maybe the 8-bit #hash is enough to avoid problems?

Updated by ko1 (Koichi Sasada) over 6 years ago Actions #2 [ruby-core:91544]

  • Description updated (diff)

Eregon (Benoit Daloze) wrote:

IIRC, not storing #hash breaks specs, does it pass test-spec?

Fortunately, no problem.

Maybe the 8-bit #hash is enough to avoid problems?

I'm not sure what "the 8-bit #hash" is.

Updated by Eregon (Benoit Daloze) over 6 years ago Actions #3 [ruby-core:91568]

ko1 (Koichi Sasada) wrote:

I'm not sure what "the 8-bit #hash" is.

The same as "1 byte hash value".
i.e. after step 1 I would expect tests/specs to fail, but probably the "1 byte hash value" is enough to fix them.

Updated by mame (Yusuke Endoh) over 6 years ago Actions #4 [ruby-core:91910]

Eregon (Benoit Daloze) wrote:

The same as "1 byte hash value".
i.e. after step 1 I would expect tests/specs to fail, but probably the "1 byte hash value" is enough to fix them.

I think so. I'm unsure how may people encounter this incompatibility.

class Foo def hash $hash end end obj = Foo.new h = {} $hash = 0 h[obj] = 42 $hash = 256 p h[obj] #=> nil in trunk, 42 in patched 

Updated by ko1 (Koichi Sasada) over 6 years ago Actions #5 [ruby-core:93962]

  • Status changed from Open to Assigned
  • Assignee set to ko1 (Koichi Sasada)

Patch is updated:
https://github.com/ko1/ruby/tree/hash_small_ar

NEWS:

  • Support 32bit CPU
  • Support > 127 Hash iteration level
  • Fix bugs

Evaluation

Discourse benchmark

overall benchmark

With discourse benchmark, there is no speed improvement.

# master ruby 2.7.0dev (2019-07-26T07:34:15Z master 51f22deadb) [x86_64-linux] categories: 50: 37 75: 38 90: 46 99: 76 home: 50: 38 75: 39 90: 47 99: 83 topic: 50: 38 75: 39 90: 41 99: 63 categories_admin: 50: 60 75: 64 90: 75 99: 113 home_admin: 50: 63 75: 64 90: 77 99: 111 topic_admin: 50: 64 75: 66 90: 73 99: 102 timings: load_rails: 3593 ruby-version: 2.7.0-p-1 rss_kb: 316540 pss_kb: 307648 # this patch ruby 2.7.0dev (2019-07-26T08:11:10Z :detached: 8f8d83fa3f) [x86_64-linux] categories: 50: 36 75: 37 90: 44 99: 70 home: 50: 37 75: 38 90: 47 99: 82 topic: 50: 37 75: 38 90: 45 99: 71 categories_admin: 50: 62 75: 65 90: 74 99: 130 home_admin: 50: 62 75: 64 90: 76 99: 125 topic_admin: 50: 63 75: 65 90: 75 99: 118 timings: load_rails: 3674 ruby-version: 2.7.0-p-1 rss_kb: 269296 pss_kb: 260475 

However, it achieves 50MB memory efficiency.

master: rss_kb: 316540 pss_kb: 307648 this patch: rss_kb: 269296 pss_kb: 260475 

conflict measurement

I added new debug counters:

  • hit: the count hint match and eql?() #=> true
  • miss: the count hint match but eql?() #=> false
  • notfound: the count that there are no hint match.

In otherwords, lookup count is "hit + notfound".

[RUBY_DEBUG_COUNTER] artable_hint_hit 81,394,995 [RUBY_DEBUG_COUNTER] artable_hint_miss 968,533 [RUBY_DEBUG_COUNTER] artable_hint_notfound 84,984,795 

With discourse benchmark, we can see 160M lookup and 1M miss.
Hint values will be conflict in 1/256 (because hint is 1B). So not strange result (*1).

(*1) 0.6M is ideal, so there is a room to improve. However, making hint algorithm more complicated introduce additional overhead.

1M times usless eql? can be a matter.

hint algorithm

To make 1B from hash value (8B), now I only use (unsigned char)hash_value, the lowest 8 bits.
There are several algorithm:

  • (1) lowest 8 bit
  • (2) xor with least 4B
  • (3) xor with least 2B
  • (4) using 15 to 8 bits ((unsigned char)hash_value >> 8)

However, (1) got high-score (1M misses. others > 2M misses).

Rdoc benchmark

(make gcbench-rdoc)

master {:count=>179, :heap_allocated_pages=>9008, :heap_sorted_length=>9008, :heap_allocatable_pages=>0, :heap_available_slots=>3671670, :heap_live_slots=>2609737, :heap_free_slots=>1061933, :heap_final_slots=>0, :heap_marked_slots=>2447202, :heap_eden_pages=>9008, :heap_tomb_pages=>0, :total_allocated_pages=>9008, :total_freed_pages=>0, :total_allocated_objects=>33443045, :total_freed_objects=>30833308, :malloc_increase_bytes=>222056, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>151, :object_id_collisions=>0, :major_gc_count=>28, :remembered_wb_unprotected_objects=>2490, :remembered_wb_unprotected_objects_limit=>4976, :old_objects=>2443098, :old_objects_limit=>4848924, :oldmalloc_increase_bytes=>8658088, :oldmalloc_increase_bytes_limit=>71306460} ruby 2.7.0dev (2019-07-26T07:34:15Z master 51f22deadb) [x86_64-linux] ["USE_RGENGC", "RGENGC_DEBUG", "RGENGC_ESTIMATE_OLDMALLOC", "GC_ENABLE_LAZY_SWEEP"] /home/ko1/ruby/v2/src/trunk/benchmark/gc/rdoc.rb user system total real 25.753310 0.308549 26.061859 ( 26.065280) GC total time (sec): 0 VmHWM: 467852 kB Summary of rdoc on 2.7.0dev 26.065279869362712 0 179 (real time in sec, GC time in sec, GC count) small_hash {:count=>175, :heap_allocated_pages=>10295, :heap_sorted_length=>10295, :heap_allocatable_pages=>0, :heap_available_slots=>4196252, :heap_live_slots=>2687349, :heap_free_slots=>1508903, :heap_final_slots=>0, :heap_marked_slots=>2445290, :heap_eden_pages=>10295, :heap_tomb_pages=>0, :total_allocated_pages=>10295, :total_freed_pages=>0, :total_allocated_objects=>33443475, :total_freed_objects=>30756126, :malloc_increase_bytes=>8655184, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>146, :object_id_collisions=>0, :major_gc_count=>29, :remembered_wb_unprotected_objects=>2490, :remembered_wb_unprotected_objects_limit=>4976, :old_objects=>2441968, :old_objects_limit=>4848944, :oldmalloc_increase_bytes=>16372800, :oldmalloc_increase_bytes_limit=>89024695} ruby 2.7.0dev (2019-07-26T08:11:10Z :detached: 8f8d83fa3f) [x86_64-linux] ["USE_RGENGC", "RGENGC_DEBUG", "RGENGC_ESTIMATE_OLDMALLOC", "GC_ENABLE_LAZY_SWEEP"] ../../src/hash_small_ar/benchmark/gc/rdoc.rb user system total real 25.876454 0.392925 26.269379 ( 26.273089) GC total time (sec): 0 VmHWM: 495984 kB Summary of rdoc on 2.7.0dev 26.273089297115803 0 175 (real time in sec, GC time in sec, GC count) 

VmHWM is corrupted :(

Maybe because GC count is not so high because of low xmalloc() consumption.

Updated by Anonymous over 6 years ago Actions #6

  • Status changed from Assigned to Closed

Applied in changeset git|72825c35b0d8b9d566663de961fddbf4f010fff7.


Use 1 byte hint for ar_table [Feature #15602]

On ar_table, Do not keep a full-length hash value (FLHV, 8 bytes)
but keep a 1 byte hint from a FLHV (lowest byte of FLHV).
An ar_table only contains at least 8 entries, so hints consumes
8 bytes at most. We can store hints in RHash::ar_hint.

On 32bit CPU, we use 4 entries ar_table.

The advantages:

  • We don't need to keep FLHV so ar_table only consumes
    16 bytes (VALUEs of key and value) * 8 entries = 128 bytes.
  • We don't need to scan ar_table, but only need to check hints
    in many cases. Especially we don't need to access ar_table
    if there is no match entries (in many cases).
    It will increase memory cache locality.

The disadvantages:

  • This technique can increase #eql? time because hints can
    conflicts (in theory, it conflicts once in 256 times).
    It can introduce incompatibility if there is a object x where
    x.eql? returns true even if hash values are different.
    I believe we don't need to care such irregular case.
  • We need to re-calculate FLHV if we need to switch from ar_table
    to st_table (e.g. exceeds 8 entries).
    It also can introduce incompatibility, on mutating key objects.
    I believe we don't need to care such irregular case too.

Add new debug counters to measure the performance:

  • artable_hint_hit - hint is matched and eql?#=>true
  • artable_hint_miss - hint is not matched but eql?#=>false
  • artable_hint_notfound - lookup counts
Actions

Also available in: PDF Atom