Helper Utility for Generating Command Line Syntax
This project is a library for creating perfect hash tables from 32-bit key sets. It is based on the acyclic random 2-part hypergraph algorithm. Its primary goal is finding the fastest possible runtime solution.
It is geared toward offline table generation: a command line application is used to generate a small C library that implements an Index() routine, which, given an input key, will return the order-preserved index of that key within the original key set, e.g.:
uint32_t ix; uint32_t key = 0x20190903; ix = Index(key);This allows the efficient implementation of key-value tables, e.g.:
extern uint32_t Table[]; uint32_t Lookup(uint32_t Key) { return Table[Index(Key)] }; void Insert(uint32_t Key, uint32_t Value) { Table[Index(Key)] = Value; } void Delete(uint32_t Key) { Table[Index(Key)] = 0; }The fastest Index() routine is the MultiplyShiftRX routine, which clocks in at about 6 cycles in practice, and boils down to something like this:
extern uint32_t Assigned[]; uint32_t Index(uint32_t Key) { uint32_t Vertex1; uint32_t Vertex2; Vertex1 = ((Key * Seed1) >> Shift); Vertex2 = ((Key * Seed2) >> Shift); return ((Assigned[Vertex1] + Assigned[Vertex2]) & IndexMask); }N.B. Seed1, Seed2, Shift, and IndexMask will all be literal constants in the final source code, not variables.
This compiles down to:
lea r8, ptr [rip+0x23fc0] imul edx, ecx, 0xe8d9cdf9 shr rdx, 0x10 movzx eax, word ptr [r8+rdx*2] imul edx, ecx, 0xc2e3c0b7 shr rdx, 0x10 movzx ecx, word ptr [r8+rdx*2] add eax, ecx retThe IACA profile reports 8 uops:
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24 Analyzed File - .\x64\Release\HologramWorld_31016_Chm01_MultiplyShiftRX_And.dll Binary Format - 64Bit Architecture - SKL Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 10.00 Cycles Throughput Bottleneck: Backend Loop Count: 26 Port Binding In Cycles Per Iteration: ---------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ---------------------------------------------------------------------------- | Cycles | 1.3 0.0 | 2.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.3 | 1.3 | 0.0 | ---------------------------------------------------------------------------- | # Of | Ports pressure in cycles | | Uops |0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------- | 1 | | | | | | 1.0 | | | lea r8, ptr [rip+0x23fc0] | 1 | | 1.0 | | | | | | | imul edx, ecx, 0xe8d9cdf9 | 1 | 0.3 | | | | | | 0.7 | | shr rdx, 0x10 | 1 | | | 1.0 1.0 | | | | | | movzx eax, word ptr [r8+rdx*2] | 1 | | 1.0 | | | | | | | imul edx, ecx, 0xc2e3c0b7 | 1 | 0.7 | | | | | | 0.3 | | shr rdx, 0x10 | 1 | | | | 1.0 1.0 | | | | | movzx ecx, word ptr [r8+rdx*2] | 1 | 0.3 | | | | | 0.3 | 0.3 | | add eax, ecx Total Num of Uops: 8 The "cost" behind the perfect hash table is the Assigned array. The size of this array will be the number of keys, rounded up to a power of two, and then doubled. E.g. HologramWorld-31016.keys has 31,016 keys. Rounded up to a power of two is 32,768, then doubled: 65,336.
The data type used by the Assigned array is the smallest C data type that can hold the number of keys rounded up to a power of two. Thus, a 16-bit unsigned short int can be used for the HologramWorld-31016.keys array:
unsigned short int Assigned[65336] = { ... };Thus, sizeof(Assigned) will be 131,072, or 128KB.
The Index() routine will perform two memory lookups into this array per call. No pointer chasing or indirection is required. The most frequent keys will have both locations in L1 cache; the worst-case scenario is two memory lookups for both locations for cold or infrequent keys.
Initially, all development on this project was done exclusively on Windows, with Visual Studio 2022. Linux support has recently been added, although it is still quite rough around the edges. For some context: about 1,700 hours have been spent on the Windows portion. The Linux support was hacked together in about two weeks of evening and weekend work.
The generated compiled perfect hash tables are cross-platform, and will work on Windows, Mac, Linux, x86, x64, and ARM64.
mkdir c:\src cd src git clone https://github.com/tpn/perfecthash git clone https://github.com/tpn/perfecthash-keys cd perfecthash/src The PerfectHash.sln file lives in perfecthash/src. You can either build this directly via Visual Studio, use one of the build-*.bat files, or just use msbuild from a Visual Studio 2022 command prompt:
msbuild /nologo /m /t:Rebuild /p:Configuration=Release;Platform=x64 You can also download the latest binaries from the Releases page. The PGO zip files refer to profile-guided optimization builds, and are generally faster than the Release builds by up to 30-40%.
Once built or downloaded, there are two main command line executables: PerfectHashCreate.exe, and PerfectHashBulkCreate.exe. The former is for creating a single table, and it takes a single input key file. The latter can be pointed at a directory of keys, and it will create tables for all of them.
Prerequisites: C compiler (GCC 10 tested), CMake. Optional: Ninja.
mkdir -p ~/src && cd ~/src git clone https://github.com/tpn/perfecthash git clone https://github.com/tpn/perfecthash-keys cd perfecthash/src mkdir build && cd build cmake -S .. -B . -G"Ninja Multi-Config" ninja -F build-Release.ninja ninja -F build-Debug.ninja For normal Makefile support:
cd perfecthash/src mkdir build && cd build cmake -S .. -B . -G"Unix Makefiles" make I haven't tried to build it on Mac yet. It'll require a bit of hacking, I assume.
The usage options are almost identical for both programs. If you run either one without arguments, it will print detailed usage instructions, also available here.
The main usage follows:
PerfectHashBulkCreate.exe Usage: <KeysDirectory> <OutputDirectory> <Algorithm> <HashFunction> <MaskFunction> <MaximumConcurrency> [BulkCreateFlags] [KeysLoadFlags] [TableCreateFlags] [TableCompileFlags] [TableCreateParameters] PerfectHashCreate.exe Usage: <KeysPath> <OutputDirectory> <Algorithm> <HashFunction> <MaskFunction> <MaximumConcurrency> [CreateFlags] [KeysLoadFlags] [TableCreateFlags] [TableCompileFlags] [TableCreateParameters] Assuming you have built a Release version of the library, from a Visual Studio 2022 command prompt (i.e. so the compiler is available in your PATH):
mkdir c:\Temp\ph.out cd c:\src\perfecthash\src ..\bin\timemem.exe x64\Release\PerfectHashCreate.exe c:\src\perfecthash\keys\HologramWorld-31016.keys c:\Temp\ph.out Chm01 MultiplyShiftR And 0 --Compile On Linux this would look like:
mkdir -p ~/tmp/ph.out cd ~/src/perfecthash/src time ../x64/Release/PerfectHashCreateExe $HOME/src/perfecthash-keys/sys32/HologramWorld-31016.keys ~/tmp/ph.out Chm01 MultiplyShiftR And 0 --DisableCsvOutputFile This should result in some output that looks like this:
c:\src\perfecthash\src>..\bin\timemem x64\Release\PerfectHashCreate.exe c:\src\perfecthash\keys\HologramWorld-31016.keys c:\Temp\ph.out Chm01 MultiplyShiftR And 0 --Compile Keys File Name: HologramWorld-31016.keys Number of Keys: 31016 Number of Table Resize Events: 0 Keys to Edges Ratio: 0.946533203125 Duration: 0 hours, 0 mins, 0 secs Duration Since Last Best Graph: Attempts: 8633 Attempts Per Second: 53956.250000000 Current Attempts: 8633 Current Attempts Per Second: 53956.250000000 Successful Attempts: 1 Failed Attempts: 8625 First Attempt Solved: 0 Most Recent Attempt Solved: 8562 Predicted Attempts to Solve: 8634 Predicted Attempts Remaining until next Solve: 8563 Estimated Seconds until next Solve: 0.15870265261206998 New Best Graph Count: 0 Equal Best Graph Count: 0 Solutions Found Ratio: 0.00011583458820803892 Vertex Collision Failures: 4294 Cyclic Graph Failures: 4331 Vertex Collision to Cyclic Graph Failure Ratio: 0.99145693835142 Highest Deleted Edges Count: 31008 [r] Refresh [f] Finish [e] Resize [c] Toggle Callback [?] More Help . Exit code : 0 Elapsed time : 2.89 Kernel time : 0.00 (0.0%) User time : 2.97 (102.7%) page fault # : 6504 Working set : 24164 KB Paged pool : 167 KB Non-paged pool : 52 KB Page file size : 32280 KB N.B. Console output isn't supported on Linux yet.
N.B. If you get an error like this, it means msbuild couldn't be found on your path; make sure to launch a Visual Studio 2022 command prompt:
C:\src\perfecthash\src\PerfectHash\PerfectHashTableCompile.c: 217: CreateProcessW failed with error: 2 (0x2). The system cannot find the file specified. C:\src\perfecthash\src\PerfectHash\PerfectHashContextTableCreate.c: 492: PerfectHashTableCompile failed with error: 3758359076 (0xe0040224). System call failed. If you look in C:\Temp\ph.out, you'll see something along the following lines. Intermediate build files have been omitted for brevity.
C:\Temp\ph.out>tree /f Folder PATH listing for volume Windows Volume serial number is 2490-4F03 C:. │ CompiledPerfectHash.h │ CompiledPerfectHash.props │ CompiledPerfectHashMacroGlue.h │ no_sal2.h │ PerfectHashTableCreate_10A0ED40.csv │ ├───HologramWorld_31016_Chm01_MultiplyShiftR_And │ HologramWorld_31016_Chm01_MultiplyShiftR_And.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And.def │ HologramWorld_31016_Chm01_MultiplyShiftR_And.h │ HologramWorld_31016_Chm01_MultiplyShiftR_And.pht1 │ HologramWorld_31016_Chm01_MultiplyShiftR_And.sln │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFull.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFull.mk │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFullExe.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkFullExe.vcxproj │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndex.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndex.mk │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndexExe.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_BenchmarkIndexExe.vcxproj │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Build.bat │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Dll.vcxproj │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Keys.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Lib.mk │ HologramWorld_31016_Chm01_MultiplyShiftR_And_So.mk │ HologramWorld_31016_Chm01_MultiplyShiftR_And_StdAfx.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_StdAfx.h │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Support.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Support.h │ HologramWorld_31016_Chm01_MultiplyShiftR_And_TableData.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_TableValues.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Test.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Test.mk │ HologramWorld_31016_Chm01_MultiplyShiftR_And_TestExe.c │ HologramWorld_31016_Chm01_MultiplyShiftR_And_TestExe.vcxproj │ HologramWorld_31016_Chm01_MultiplyShiftR_And_Types.h │ main.mk │ Makefile │ └───x64 └───Release BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb HologramWorld_31016_Chm01_MultiplyShiftR_And.dll HologramWorld_31016_Chm01_MultiplyShiftR_And.lib HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb Test_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe Test_HologramWorld_31016_Chm01_MultiplyShiftR_And.pdb The main library implementing the perfect hash table is the .dll file. There are three helper .exe utilities also compiled for benchmarking and testing. You can assess the performance of just the Index() routine via: BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe:
cd /d c:\Temp\ph.out\x64\Release c:\src\perfecthash\bin\timemem.exe BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe On Windows, the output will look like this:
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\timemem.exe BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe Exit code : 7106 Elapsed time : 0.01 Kernel time : 0.00 (0.0%) User time : 0.00 (0.0%) page fault # : 849 Working set : 3220 KB Paged pool : 22 KB Non-paged pool : 5 KB Page file size : 520 KB The exit code, 7106 in this case, is the minimum number of cycles it took, out of 1000 attempts, to do 1000 calls to the Index() routine. So, you can divide by 1000 to get the approximate number of cycles per call, in this case, about 7.
The BenchmarkFull executable returns the minimum number of cycles it took, out of 100 attempts, to do 10 iterations of the following:
- For each key, call
Insert(Key, RotateLeft(Key, 15)). - For each key, call
Value = Lookup(Key). - For each key, call
Previous = Delete(Key).
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\timemem.exe BenchmarkFull_HologramWorld_31016_Chm01_MultiplyShiftR_And.exe Exit code : 7321346 Elapsed time : 0.28 Kernel time : 0.00 (0.0%) User time : 0.16 (55.7%) page fault # : 1036 Working set : 3832 KB Paged pool : 31 KB Non-paged pool : 5 KB Page file size : 564 KB The .dll files are compiled with special IACA versions of each Index() routine (i.e. IndexIaca()), so you can call iaca.exe on them to get an analysis of the generated code, e.g.:
C:\Temp\ph.out\x64\Release>c:\src\perfecthash\bin\iaca.exe HologramWorld_31016_Chm01_MultiplyShiftR_And.dll Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24 Analyzed File - HologramWorld_31016_Chm01_MultiplyShiftR_And.dll Binary Format - 64Bit Architecture - SKL Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 8.93 Cycles Throughput Bottleneck: Backend Loop Count: 30 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | -------------------------------------------------------------------------------------------------- DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3) F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion occurred # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected X - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1 | | 1.0 | | | | | | | imul ecx, ecx, 0xff8d672d | 1 | 1.0 | | | | | | | | shr rax, 0x9 | 1* | | | | | | | | | movzx edx, ax | 1 | | | | | | | 1.0 | | shr rcx, 0xd | 1 | | | 1.0 1.0 | | | | | | movzx eax, word ptr [r8+rdx*2] | 1* | | | | | | | | | movzx edx, cx | 1 | | | | 1.0 1.0 | | | | | movzx ecx, word ptr [r8+rdx*2] | 1 | | | | | | 1.0 | | | add eax, ecx Total Num Of Uops: 8 Analysis Notes: Backend allocation was stalled due to unavailable allocation resources. Although note that this isn't an exact science, sometimes the compiler reorders the IACA_VC_START() and IACA_VC_END() markers such that you end up missing a couple of instructions in the analysis. In the example above, the actual assembly for the Index() routine is as follows:
C:\Temp\ph.out\x64\Release>dumpbin /disasm HologramWorld_31016_Chm01_MultiplyShiftR_And.dll Microsoft (R) COFF/PE Dumper Version 14.34.31935.0 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file HologramWorld_31016_Chm01_MultiplyShiftR_And.dll File Type: DLL CompiledPerfectHash_HologramWorld_31016_Chm01_MultiplyShiftR_And_Index: 0000000180001000: 69 C1 DF 0E AD FF imul eax,ecx,0FFAD0EDFh 0000000180001006: 4C 8D 05 F3 3F 02 lea r8,[HologramWorld_31016_Chm01_MultiplyShiftR_And_TableData] 00 000000018000100D: 69 C9 2D 67 8D FF imul ecx,ecx,0FF8D672Dh 0000000180001013: 48 C1 E8 09 shr rax,9 0000000180001017: 0F B7 D0 movzx edx,ax 000000018000101A: 48 C1 E9 0D shr rcx,0Dh 000000018000101E: 41 0F B7 04 50 movzx eax,word ptr [r8+rdx*2] 0000000180001023: 0F B7 D1 movzx edx,cx 0000000180001026: 41 0F B7 0C 50 movzx ecx,word ptr [r8+rdx*2] 000000018000102B: 03 C1 add eax,ecx 000000018000102D: 25 FF 7F 00 00 and eax,7FFFh 0000000180001032: C3 ret To compile the hash table on Linux (using WSL1 and GCC 9 as an example):
% cd /mnt/c/Temp/ph.out/HologramWorld_31016_Chm01_MultiplyShiftR_And % make % export LD_LIBRARY_PATH=. % ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And 8094 With clang (version 10):
% make clean % CC=clang make % ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And 7068 Identical to Linux, except you don't need export LD_LIBRARY_PATH=.:
% make % ./BenchmarkIndex_HologramWorld_31016_Chm01_MultiplyShiftR_And