I suspect either of the following things could be the cause:
Directly calling vm_call_cfunc requires more optimization effort in GCC, resulting in 30ms-ish compilation time increase for such methods and decreasing the number of methods compiled in a benchmarked period.
Code size increase => icache miss hit
These hypotheses could be verified by some methodologies. However, I'd like to introduce this regardless of the result because this blocks inlining C method's definition.
I may revert this commit when I give up to implement inlining C method definition, which requires this change.
Microbenchmark-wise, this gives slight performance improvement:
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark/mjit_send_cfunc.yml --repeat-count=4 before --jit: ruby 2.8.0dev (2020-04-13T16:25:13Z master fb40495cd9) +JIT [x86_64-linux] after --jit: ruby 2.8.0dev (2020-04-13T23:23:11Z mjit-inline-c bdcd06d159) +JIT [x86_64-linux] Calculating ------------------------------------- before --jit after --jit mjit_send_cfunc 41.961M 56.489M i/s - 100.000M times in 2.383143s 1.770244s Comparison: mjit_send_cfunc after --jit: 56489372.5 i/s before --jit: 41961388.1 i/s - 1.35x slower
Unwrap vm_call_cfunc indirection on JIT
for VM_METHOD_TYPE_CFUNC.
This has been known to decrease optcarrot fps:
and I deleted this capability in an early stage of YARV-MJIT development:
https://github.com/k0kubun/yarv-mjit/commit/0ab130feeefc2b9078a1077e4fec93b3f5e45d07
I suspect either of the following things could be the cause:
Directly calling vm_call_cfunc requires more optimization effort in GCC,
resulting in 30ms-ish compilation time increase for such methods and
decreasing the number of methods compiled in a benchmarked period.
Code size increase => icache miss hit
These hypotheses could be verified by some methodologies. However, I'd
like to introduce this regardless of the result because this blocks
inlining C method's definition.
I may revert this commit when I give up to implement inlining C method
definition, which requires this change.
Microbenchmark-wise, this gives slight performance improvement: