- Notifications
You must be signed in to change notification settings - Fork 817
Cost analysis: Remove "Unacceptable" hack #6782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Would considering something like https://github.com/dfinity/ic/blob/master/rs/execution_environment/benches/wasm_instructions/WASM_BENCHMARKS.md be more informing than some made-up numbers? |
Interesting, thanks. Hmm, e.g. But separately it might make sense to update all our basic math costs at some point, maybe using their numbers in part. Another source of info could be the benchmark framework that is begun in this PR - the cost of 5 for casts is from there. |
The dfinity people did warn us to take their numbers with a grain of salt. In particular, their runtime does NaN canonicalization, so their floating point operations are much more expensive than we would expect. Creating our own benchmarks to get better informed numbers makes sense to me. |
Btw, here is the output of the script here:
Details of what those are is in the benchmark file, but overall the first is just computing the length of a linked list (baseline to see the overhead of memory and the export call), and then pairs of a non-if and an if, that is, patterns of either executing a Overall these justify not executing (Numbers are on V8, but the pattern is similar in SpiderMonkey too.) |
scripts/benchmarking/bench.js Outdated
// We'll call the benchmark functions in random orders. | ||
function makeOrders(prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it standard benchmarking practice to run the benchmarks in random orders? Is this meant to defeat unwanted optimizations? Generating every possible order up front is not going to scale to a larger number of benchmarks. Is there something simpler and more scalable we can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it standard benchmarking practice to run the benchmarks in random orders?
There are other ways to deal with benchmark interactions, like running all the tests for A first, then all the tests for B, etc., rather than interleaving. But interleaving actually makes it more realistic since real-world code is mixed in with other stuff, and it's simple enough to handle here, so it seems appropriate to me.
Is this meant to defeat unwanted optimizations?
Mainly to avoid order being an issue. Imagine that running A, B, C happens to have A warm up the cache for B, or B reset the branch predictor for C. Random orders avoid that.
Generating every possible order up front is not going to scale to a larger number of benchmarks. Is there something simpler and more scalable we can do?
Yeah, past some point it can't work, but so long as we don't hit that limit it is faster to do it this way. The other way would be to generate an unbiased random order on the fly each time, which is not hard, but just takes more work.
) | ||
) | ||
| ||
(func $makeC (export "makeC") (param $next (ref null $A)) (result anyref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never called from the benchmarking script. Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, having $C
prevents the optimizer from thinking $B
could be final. And so far there isn't a benchmark that uses $C
. I'll add a comment.
// The cost of throwing a wasm exception. This does not include the cost of | ||
// catching it (which might be in another function than the one we are | ||
// considering). | ||
static const CostType ThrowCost = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when pulling numbers out of nowhere, how do you divide the total cost between the catch and the throw? Neither executes without the other (assuming throws are caught at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure about Throw. My intuition is this cost would be the total cost (including the catch), since we don't add any cost in Try.
static_assert(TooCostlyToRunUnconditionally < CostAnalyzer::Unacceptable, | ||
"We never run code unconditionally if it has unacceptable cost"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still instructions we know should not be run unconditionally if it can be avoided; is there something else we can replace this assertion with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
All comments should be addressed - @tlively did you have anything else? |
Nope, LGTM |
We marked various expressions as having cost "Unacceptable", fixed at 100, to
ensure we never moved them out from an If arm, etc. Giving them such a high
cost avoids that problem - the cost is higher than the limit we have for moving
code from conditional to unconditional execution - but it also means the total
cost is unrealistic. For example, a function with one such instruction + an add
(cost 1) would end up with cost 101, and removing the add would look
insignificant, which causes issues for things that want to compare costs
(like Monomorphization).
To fix this, adjust some costs. The main change here is to give casts a cost of 5.
I measured this in depth, see the attached benchmark scripts, and it looks
clear that in both V8 and SpiderMonkey the cost of a cast is high enough to
make it not worth turning an if with
ref.test
arm into a select (which wouldalways execute the test).
Other costs adjusted here matter a lot less, because they are on operations
that have side effects and so the optimizer will anyhow not move them from
conditional to unconditional execution, but I tried to make them a bit more
realistic while I was removing "Unacceptable":
stores. Perhaps wait and notify should be slower, however, but it seems like
assuming fast switching might be more relevant.
numbers are entirely made up as I am not even sure how to measure them in
a useful way (but, again, this should not matter much as they have side
effects).
I verified that building a large Java program with this PR causes 0 changes
to code, so this should not risk regressions, though in principle this is not NFC
(and it unlocks Monomorphization work because of that).