Skip to content

Conversation

@user202729
Copy link
Contributor

@user202729 user202729 commented Dec 13, 2025

Minor speedup.
Before (using https://github.com/fmtlib/format-benchmark):

 -------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------- fmt_format_to_compile 13785856 ns 13748336 ns 50 items_per_second=72.7361M/s fmt_format_int 13560583 ns 13522664 ns 51 items_per_second=73.9499M/s 

After:

-------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------- fmt_format_to_compile 13448662 ns 13411962 ns 52 items_per_second=74.5603M/s fmt_format_int 13132046 ns 13090029 ns 53 items_per_second=76.394M/s 

The idea is to avoid the modulo (which gets compiled to a imul and a sub) by looking at the 7 bits after the decimal point of value / 100.

This adds 256+1 bytes worth of lookup table (the old lookup table still need to stay there, unfortunately). Although technically the null terminator and the last 2 spaces are unused.

Correctness is shown by exhaustively search through the whole range of 32-bit integers and ensure that for all i, (i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1) uniquely determines the value of i % 100 and ((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) is exactly equal to i / 100.

The check sizeof(UInt) == 4 implicitly assumes CHAR_BIT == 8 (is it worth being spelled out?)

Source code of brute force checker
using ull = unsigned long long; int main(){ int lookup [1<<7]; for (int i = 1<<7; i-->0;){	lookup[i] = -1;	} for(unsigned i=0;;){ auto& l = lookup[(i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1)]; if(l<0) l = i % 100; if(l != (i % 100)) __builtin_printf("%u\n", i); if(((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) != i / 100) __builtin_printf(">%u %u %u\n", i, i/100, unsigned((i * ((1ull<<39)/100+1)) >> 39)); if(++i==0) break;	} for(unsigned i=0; i<sizeof(lookup)/sizeof(lookup[0]); ++i) { if (i % 16 == 0) __builtin_printf("\""); if (lookup[i] < 0) __builtin_printf(" "); else __builtin_printf("%02d", lookup[i]); if ((i+1) % 16 == 0) __builtin_printf("\"\n");	} }

Future work:

  • adapt to write_significand
  • generalize algorithm to work with 64-bit input (will need __int128).

note:

  • digits2_i is not constexpr (before C++20)
  • write2digits_i is not constexpr either, so there's no need for the std::is_constant_evaluated
  • I don't understand why we don't want memcpy if FMT_OPTIMIZE_SIZE is true, but write2digits do that.
  • apparently gcc cannot compile two char load/store into one short load/store (even with both load to a temporary then store back, so no concern of aliasing here).
  • the benchmark has a very large proportion of values with at most 4 digits, which is why parallel multiplication such as in hofman_fun will always be slower.
@user202729
Copy link
Contributor Author

Sorry for the CI failures. That said, I recommend adding to CONTRIBUTING.md the commands to verify the lint/compiler warnings etc.

@vitaut
Copy link
Contributor

vitaut commented Dec 16, 2025

Thanks for the PR! Could you check how it performs on itoa-benchmark (https://github.com/fmtlib/format-benchmark/tree/master/src/itoa-benchmark)?

@user202729
Copy link
Contributor Author

I made a pull request fmtlib/format-benchmark#31 that adds fmt as an option of itoa_benchmark. Let me know if that accurately benchmark {fmt} library's performance.

@vitaut
Copy link
Contributor

vitaut commented Dec 19, 2025

Thanks for adding fmt to itoa-benchmark. Have you checked the results of your change there and could you post them here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants