by Federico Lochbaum EVREF Team EvaluatingBenchmark Quality:aMutation-Testing- BasedMethodology IWST2025 FedericoLochbaum,GuillermoPolito by Federico Lochbaum IWST 2025 1/33
Test Cases by Federico Lochbaum IWST 2025 Let’s suppose we have a broken house.... Hi! I’m a house! 2 / 33
Test Cases by Federico Lochbaum IWST 2025 And we want to repair it... Hi! I am the repair program! I am a Test case 3 / 33
by Federico Lochbaum IWST 2025 Howwellwedidit? 4/33
How well we did it? by Federico Lochbaum IWST 2025 Passed! 5 / 33
Benchmarks by Federico Lochbaum IWST 2025 Measure execution time ( wall-clock time Often made by averaging benchmarking results, looking to reduce contextual varianc Some work proposes using other kinds of metrics, like energy consumption or memory usage 6 / 33
Benchmarks by Federico Lochbaum IWST 2025 I am the benchmark 7 / 33
by Federico Lochbaum IWST 2025 Howfastwedidit? 8/33
How fast we did it? by Federico Lochbaum IWST 2025 450 ms! 9 / 33
TestCasesvsBenchmarks byFedericoLochbaum IWST2025 TestCases Executesaseriesofstepstovalidatetheprogram’sbehavio Checkcorrectness(Pass/Fail Self-validatin Oneexecutionisenoug Resultsarearchitecture-independent Stresstheprogramtoassessperformanc MeasurePerformancemetrics(Elapsedtime/CPU Notself-validatin Requiremultiplerunstocopewithnois Resultsarearchitecture-dependen Expensivetorun Benchmarks 1 2 3 10/33
How fast we did it? by Federico Lochbaum IWST 2025 450 ms! Is this measuring good enough? 11 / 33
by Federico Lochbaum IWST 2025 Howdoweknowthatabenchmarkis“good”? 450 ms! 312 ms! 12/33
Problem: Assessing Benchmark Quality by Federico Lochbaum IWST 2025 A lack of systematic methodologies to assess benchmark effectiveness What does it mean benchmark quality How to measure benchmark effectiveness detecting performance bugs How are introduced performance issues in a target program to detect them ? 13 / 33
Mutation Testing Benchmark Methodology - Proposal by Federico Lochbaum IWST 2025 Assumption:A benchmark is good if it detects performance bugs 14 / 33
Mutation Testing ? by Federico Lochbaum IWST 2025 Mutation testing measures test quality in relation to it capability to detects bugs It introduces simulatedbugs(mutants) and assess if the test catches the If the test fails → the mutant was killed(detected If not → the mutant survived ( undetected) Test Test Test Bug introduction Test broke? Test 15 / 33
AdaptingMutationTestingforPerformance byFedericoLochbaum IWST2025 Itintroducesperformancebugs(mutants)andassessifthebenchmarkcatchesthe Istheoraclewhodeterminesifthebenchmarkkillsornotthemutant 16/33
Mutation Strategy by Federico Lochbaum IWST 2025 Introduce a controlled performance bug mutant in the original program Program Mutated Program Mutant operator Benchmark’s target 17 / 33
AdaptingMutationTestingforPerformance byFedericoLochbaum IWST2025 Introduceperformanceperturbation Aperformanceoracledeterminesifthebughadaperformanceimpact RQ2 RQ1 18/33
RQ1 - What is a Performance Bug ? by Federico Lochbaum IWST 2025 RQ1 19 / 33
RQ1- Performance Bug by Federico Lochbaum IWST 2025 Perturbation on program execution ( E.g.Latency, Locality issues Excessive consumption of time or space by design ( E.g.Long iterations, Bad implementation decisions Nooptimal data structure used for a problem ( E.g. Use an array instead of a dictionary to index a dataset ) 20 / 33
RQ2 - How do we assess a Benchmark ? by Federico Lochbaum IWST 2025 RQ2 21 / 33
RQ2 - The Benchmark Oracle by Federico Lochbaum IWST 2025 Let’s define a benchmark quality as “How sensible is the benchmark to detect a mutant” What it is a benchmark sensibility ? Where is the threshold, and what do we compare it against? 22 / 33
Experimental Mutants: Sleepstatements by Federico Lochbaum IWST 2025 Why? - Represents latenc How? - Three mutantoperators, 10, 100, 500 millisecond Where? - At the beginning of every statement block 23 / 33
Experimental Oracle by Federico Lochbaum IWST 2025 Baseline: Average + stdev of 30 iterations to reduce external nois Metric: Execution tim Mutant detection: A mutant is killed if the execution time > baseline average + stdev median stdev 2 stdev killed killed survivor 24 / 33
CaseStudy: Regular Expressions in Pharo by Federico Lochbaum IWST 2025 WhyRegexes? → They are well know 100 Regexes are generated via grammar-basedfuzzing( MCTS We benchmark the regex matches:method with the generated regexe ‘b+a(b)*c+(b+c*b|b?(b+b)c?(b)*ab+a?aa)?a*’ matches: ‘bac’ ‘(bb(((b)b+)))+b+|b+’ matches: ‘bbbbb’ We assessthequalityofbenchmarks to find performance bugs on matches: method 25 / 33
Results by Federico Lochbaum IWST 2025 We introduce 62 mutants per mutant operator ( 3 We execute every benchmark once per mutant ( 186 times ) 26 / 33
Average Benchmark Behavior by Federico Lochbaum IWST 2025 On average, the mutation score per benchmark is 51.48% % Mutation score Benchmarks 27 / 33
High Score Benchmarks by Federico Lochbaum IWST 2025 11% of benchmarks have an score > 60% Benchmarks % Mutation score 28 / 33
Performance Perturbation Sensibility by Federico Lochbaum IWST 2025 There are some benchmarks more sensible than others Benchmarks % Mutation score 29 / 33
Baseline Characterization Average stdev is 42.48% ( relative ) by Federico Lochbaum IWST 2025 30 / 33
Baseline Characterization 13% have high variance: Can not detect small perturbations by Federico Lochbaum IWST 2025 31 / 33
Futurework by Federico Lochbaum IWST 2025 Improvethebenchmarkselection,filteringthosewithhighvarianc Studytechniquestominimizeexternalnois ExperimentwithdifferentOracle’sthreshold Experimentwithdifferentperformancemutantoperator Studyalternativemetricstoreducethenumberofneededexecutionstohaveastablemeasure 32 / 33
IWST 2025 Federico Lochbaum, Guillermo Polito Conclusion by Federico Lochbaum IWST 2025 Systematic methodology to evaluate benchmarks’s effectiveness Introduce artificial performance bugs by extending mutation testing Instance of the framework in a real setting showing results 33 / 33

Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology