Skip to content

Conversation

@guangy10
Copy link
Contributor

@guangy10 guangy10 commented May 19, 2025

Summary

The custom script for ET benchmark stability assessment.

pip install openpyxl tabulate matplotlib 

Then

python .ci/scripts/analyze_benchmark_stability.py \ Benchmark\ Dataset\ with\ Private\ AWS\ Devices.xlsx \ --reference_file Benchmark\ Dataset\ with\ Public\ AWS\ Devices.xlsx 

Datasets:

The generated analysis:

Analyzing latency stability from primary file: /Users/guangyang/Desktop/Benchmark Dataset with Private AWS Devices.xlsx Using reference file for comparison: /Users/guangyang/Desktop/Benchmark Dataset with Public AWS Devices.xlsx ==================================================================================================== ===== LOADING PRIMARY DATASETS (Private) ========================================================== ==================================================================================================== Loading dataset: llama3_qlora+s22_android13 Loading dataset: llama3_spinq+s22_android13 Loading dataset: mv3_qnn+s22_android13 Loading dataset: mv3_xnnq8+s22_android13 Loading dataset: llama3_qlora+s22ultra_android14 Loading dataset: llama3_spinq+s22ultra_android14 Loading dataset: mv3_qnn+s22ultra_android14 Loading dataset: mv3_xnnq8+s22ultra_android14 Loading dataset: mv3_xnnq8+pixel3_rooted_android Loading dataset: llama3_qlora+iphone15max_ios17 Loading dataset: llama3_spinq+iphone15max_ios17 Loading dataset: mv3_xnnq8+iphone15max_ios17 Loading dataset: mv3_coreml+iphone15max_ios17 Loading dataset: mv3_mps+iphone15max_ios17 Loading dataset: llama3_qlora+iphone15_ios18 Loading dataset: llama3_spinq+iphone15_ios18 Loading dataset: mv3_xnnq8+iphone15_ios18 Loading dataset: mv3_coreml+iphone15_ios18 Loading dataset: mv3_mps+iphone15_ios18 ==================================================================================================== ===== LOADING REFERENCE DATASETS (Public) ========================================================= ==================================================================================================== Loading reference dataset: llama3_qlora+s22_android13 Loading reference dataset: llama3_spinq+s22_android13 Loading reference dataset: mv3_qnn+s22_android13 Loading reference dataset: mv3_xnnq8+s22_android13 Loading reference dataset: llama3_spinq+s22_android12 Loading reference dataset: llama3_qlora+s22Ultra5G_android Loading reference dataset: llama3_spinq+s22ultra_android12 Loading reference dataset: mv3_xnnq8+s22ultra_android12 Loading reference dataset: mv3_qnn+s22ultra_android12 Loading reference dataset: llama3_qlora+iphone15max_ios17 Loading reference dataset: llama3_spinq+iphone15max_ios17 Loading reference dataset: mv3_xnnq8+iphone15max_ios17 Loading reference dataset: mv3_coreml+iphone15max_ios17 Loading reference dataset: mv3_mps+iphone15max_ios17 Loading reference dataset: llama3_qlora+iphone15_ios18 Loading reference dataset: llama3_spinq+iphone15_ios18 Loading reference dataset: mv3_xnnq8+iphone15_ios18 Loading reference dataset: mv3_coreml+iphone15_ios18 Loading reference dataset: mv3_mps+iphone15_ios18 ==================================================================================================== ===== ANALYZING PRIMARY DATASETS ================================================================== ==================================================================================================== Latency Stability Analysis: llama3_qlora+s22_android13 (Primary) ================================================================================ Model: llama3_qlora Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 22502.10 ms - Median latency (P50): 22447.56 ms - Mean trimmed latency: 22388.87 ms - Median trimmed latency: 22343.47 ms Dispersion Metrics: - Standard deviation: 595.01 ms - Coefficient of variation (CV): 2.64% - Interquartile range (IQR): 858.26 ms - Trimmed standard deviation: 596.25 ms - Trimmed coefficient of variation: 2.66% Percentile Metrics: - P50 (median): 22447.56 ms - P90: 23231.99 ms - P95: 23518.35 ms - P99: 23910.11 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1423 - P99/P50 ratio: 1.0652 - Mean rolling std (window=5): 539.36 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.50% - Max trimming effect ratio: 0.81% Throughput Metrics: - Mean TPS: 33.07 - TPS coefficient of variation: 6.92% Stability Assessment: - Overall stability score: 83.4/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 83.4/100) with low variation between runs (CV: 2.64%). Performance is consistent and predictable for most use cases. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_primary_time_series.png Latency Stability Analysis: llama3_spinq+s22_android13 (Primary) ================================================================================ Model: llama3_spinq Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 21771.59 ms - Median latency (P50): 21668.24 ms - Mean trimmed latency: 21662.53 ms - Median trimmed latency: 21559.89 ms Dispersion Metrics: - Standard deviation: 514.89 ms - Coefficient of variation (CV): 2.36% - Interquartile range (IQR): 602.75 ms - Trimmed standard deviation: 515.03 ms - Trimmed coefficient of variation: 2.38% Percentile Metrics: - P50 (median): 21668.24 ms - P90: 22438.74 ms - P95: 22542.42 ms - P99: 23104.76 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1452 - P99/P50 ratio: 1.0663 - Mean rolling std (window=5): 449.10 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.50% - Max trimming effect ratio: 0.89% Throughput Metrics: - Mean TPS: 33.76 - TPS coefficient of variation: 4.70% Stability Assessment: - Overall stability score: 84.7/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 84.7/100) with low variation between runs (CV: 2.36%). Performance is consistent and predictable for most use cases. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_primary_time_series.png Latency Stability Analysis: mv3_qnn+s22_android13 (Primary) ================================================================================ Model: mv3_qnn Device: s22_android13 Dataset Overview: - Number of samples: 100 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00 Central Tendency Metrics: - Mean latency: 1.01 ms - Median latency (P50): 1.00 ms - Mean trimmed latency: 1.00 ms - Median trimmed latency: 1.00 ms Dispersion Metrics: - Standard deviation: 0.02 ms - Coefficient of variation (CV): 2.34% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.02 ms - Trimmed coefficient of variation: 2.27% Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.01 ms - P95: 1.01 ms - P99: 1.14 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1919 - P99/P50 ratio: 1.1404 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.19% - Max trimming effect ratio: 1.00% Stability Assessment: - Overall stability score: 82.4/100 - Overall stability rating: Good Interpretation: The benchmark shows good stability (score: 82.4/100) with low variation between runs (CV: 2.34%). Performance is consistent and predictable for most use cases. The P99/P50 ratio of 1.14 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary) ================================================================================ Model: mv3_xnnq8 Device: s22_android13 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.73 ms - Median latency (P50): 2.65 ms - Mean trimmed latency: 2.22 ms - Median trimmed latency: 2.10 ms Dispersion Metrics: - Standard deviation: 0.63 ms - Coefficient of variation (CV): 23.03% - Interquartile range (IQR): 0.95 ms - Trimmed standard deviation: 0.36 ms - Trimmed coefficient of variation: 15.98% Percentile Metrics: - P50 (median): 2.65 ms - P90: 3.59 ms - P95: 3.74 ms - P99: 4.46 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.4427 - P99/P50 ratio: 1.6812 - Mean rolling std (window=5): 0.60 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 16.52% - Max trimming effect ratio: 36.96% Stability Assessment: - Overall stability score: 14.9/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 14.9/100) with significant variation between runs (CV: 23.03%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (16.5%) with occasional outliers within benchmark runs. The max/min ratio of 2.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.68 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_primary_time_series.png Latency Stability Analysis: llama3_qlora+s22ultra_android14 (Primary) ================================================================================ Model: llama3_qlora Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 25022.84 ms - Median latency (P50): 25427.33 ms - Mean trimmed latency: 24748.06 ms - Median trimmed latency: 25062.01 ms Dispersion Metrics: - Standard deviation: 1545.62 ms - Coefficient of variation (CV): 6.18% - Interquartile range (IQR): 2844.11 ms - Trimmed standard deviation: 1467.60 ms - Trimmed coefficient of variation: 5.93% Percentile Metrics: - P50 (median): 25427.33 ms - P90: 26581.31 ms - P95: 27184.07 ms - P99: 28668.97 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.2710 - P99/P50 ratio: 1.1275 - Mean rolling std (window=5): 1560.71 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 1.08% - Max trimming effect ratio: 4.80% Throughput Metrics: - Mean TPS: 28.35 - TPS coefficient of variation: 7.88% Stability Assessment: - Overall stability score: 62.5/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 62.5/100) with noticeable variation between runs (CV: 6.18%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.27 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.13 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22ultra_android14_primary_time_series.png Latency Stability Analysis: llama3_spinq+s22ultra_android14 (Primary) ================================================================================ Model: llama3_spinq Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 24761.78 ms - Median latency (P50): 25043.89 ms - Mean trimmed latency: 24466.21 ms - Median trimmed latency: 24731.04 ms Dispersion Metrics: - Standard deviation: 1552.25 ms - Coefficient of variation (CV): 6.27% - Interquartile range (IQR): 1931.42 ms - Trimmed standard deviation: 1466.19 ms - Trimmed coefficient of variation: 5.99% Percentile Metrics: - P50 (median): 25043.89 ms - P90: 26163.60 ms - P95: 26948.68 ms - P99: 28868.51 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3648 - P99/P50 ratio: 1.1527 - Mean rolling std (window=5): 1451.05 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 1.17% - Max trimming effect ratio: 4.90% Throughput Metrics: - Mean TPS: 29.85 - TPS coefficient of variation: 8.24% Stability Assessment: - Overall stability score: 60.3/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 60.3/100) with noticeable variation between runs (CV: 6.27%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.36 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_qnn+s22ultra_android14 (Primary) ================================================================================ Model: mv3_qnn Device: s22ultra_android14 Dataset Overview: - Number of samples: 100 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00 Central Tendency Metrics: - Mean latency: 1.01 ms - Median latency (P50): 1.01 ms - Mean trimmed latency: 1.01 ms - Median trimmed latency: 1.01 ms Dispersion Metrics: - Standard deviation: 0.01 ms - Coefficient of variation (CV): 0.91% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.01 ms - Trimmed coefficient of variation: 0.70% Percentile Metrics: - P50 (median): 1.01 ms - P90: 1.02 ms - P95: 1.02 ms - P99: 1.03 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0900 - P99/P50 ratio: 1.0204 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.19% - Max trimming effect ratio: 1.94% Stability Assessment: - Overall stability score: 93.8/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 93.8/100) with very low variation between runs (CV: 0.91%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+s22ultra_android14 (Primary) ================================================================================ Model: mv3_xnnq8 Device: s22ultra_android14 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android14_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+pixel3_rooted_android (Primary) ================================================================================ Model: mv3_xnnq8 Device: pixel3_rooted_android Dataset Overview: - Number of samples: 148 - Date range: 2025-04-16 02:47:21+00:00 to 2025-04-29 01:17:49+00:00 Central Tendency Metrics: - Mean latency: 5.93 ms - Median latency (P50): 5.87 ms - Mean trimmed latency: 5.51 ms - Median trimmed latency: 5.45 ms Dispersion Metrics: - Standard deviation: 0.46 ms - Coefficient of variation (CV): 7.68% - Interquartile range (IQR): 0.56 ms - Trimmed standard deviation: 0.27 ms - Trimmed coefficient of variation: 4.84% Percentile Metrics: - P50 (median): 5.87 ms - P90: 6.44 ms - P95: 6.57 ms - P99: 7.26 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.6964 - P99/P50 ratio: 1.2386 - Mean rolling std (window=5): 0.41 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 6.66% - Max trimming effect ratio: 26.67% Stability Assessment: - Overall stability score: 46.9/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 46.9/100) with significant variation between runs (CV: 7.68%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (6.7%) with occasional outliers within benchmark runs. The max/min ratio of 1.70 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.24 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+pixel3_rooted_android_primary_time_series.png Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Primary) ================================================================================ Model: llama3_qlora Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 12972.80 ms - Median latency (P50): 12774.50 ms Dispersion Metrics: - Standard deviation: 483.26 ms - Coefficient of variation (CV): 3.73% - Interquartile range (IQR): 624.00 ms Percentile Metrics: - P50 (median): 12774.50 ms - P90: 13389.70 ms - P95: 13736.05 ms - P99: 14730.49 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.1916 - P99/P50 ratio: 1.1531 - Mean rolling std (window=5): 431.32 ms Throughput Metrics: - Mean TPS: 10.18 - TPS coefficient of variation: 11.47% Stability Assessment: - Overall stability score: 75.2/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 75.2/100) with noticeable variation between runs (CV: 3.73%). While average performance is acceptable, occasional latency spikes may occur. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary) ================================================================================ Model: llama3_spinq Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 12195.41 ms - Median latency (P50): 12104.50 ms Dispersion Metrics: - Standard deviation: 461.27 ms - Coefficient of variation (CV): 3.78% - Interquartile range (IQR): 154.25 ms Percentile Metrics: - P50 (median): 12104.50 ms - P90: 12567.20 ms - P95: 12760.05 ms - P99: 14052.31 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3331 - P99/P50 ratio: 1.1609 - Mean rolling std (window=5): 365.79 ms Throughput Metrics: - Mean TPS: 13.89 - TPS coefficient of variation: 16.58% Stability Assessment: - Overall stability score: 72.9/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 72.9/100) with noticeable variation between runs (CV: 3.78%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.33 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.16 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_xnnq8 Device: iphone15max_ios17 Dataset Overview: - Number of samples: 54 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 13.98 ms - Median latency (P50): 14.00 ms Dispersion Metrics: - Standard deviation: 3.44 ms - Coefficient of variation (CV): 24.60% - Interquartile range (IQR): 4.00 ms Percentile Metrics: - P50 (median): 14.00 ms - P90: 18.00 ms - P95: 20.00 ms - P99: 21.94 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.2857 - P99/P50 ratio: 1.5671 - Mean rolling std (window=5): 3.40 ms Stability Assessment: - Overall stability score: 10.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 10.8/100) with significant variation between runs (CV: 24.60%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.29 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.57 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_coreml Device: iphone15max_ios17 Dataset Overview: - Number of samples: 50 - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Primary) ================================================================================ Model: mv3_mps Device: iphone15max_ios17 Dataset Overview: - Number of samples: 51 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00 Central Tendency Metrics: - Mean latency: 1.25 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.44 ms - Coefficient of variation (CV): 35.07% - Interquartile range (IQR): 0.50 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 2.00 ms - P95: 2.00 ms - P99: 2.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.0000 - P99/P50 ratio: 2.0000 - Mean rolling std (window=5): 0.39 ms Stability Assessment: - Overall stability score: 12.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 12.5/100) with significant variation between runs (CV: 35.07%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.00 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.00 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_primary_time_series.png Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Primary) ================================================================================ Model: llama3_qlora Device: iphone15_ios18 Dataset Overview: - Number of samples: 121 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 23169.07 ms - Median latency (P50): 21328.00 ms Dispersion Metrics: - Standard deviation: 5889.20 ms - Coefficient of variation (CV): 25.42% - Interquartile range (IQR): 8558.00 ms Percentile Metrics: - P50 (median): 21328.00 ms - P90: 31324.00 ms - P95: 33057.00 ms - P99: 40256.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.0072 - P99/P50 ratio: 1.8875 - Mean rolling std (window=5): 4851.03 ms Throughput Metrics: - Mean TPS: 3.32 - TPS coefficient of variation: 34.24% Stability Assessment: - Overall stability score: 2.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 2.8/100) with significant variation between runs (CV: 25.42%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.01 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.89 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_primary_time_series.png Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Primary) ================================================================================ Model: llama3_spinq Device: iphone15_ios18 Dataset Overview: - Number of samples: 116 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 22076.03 ms - Median latency (P50): 20174.00 ms Dispersion Metrics: - Standard deviation: 6076.94 ms - Coefficient of variation (CV): 27.53% - Interquartile range (IQR): 7826.00 ms Percentile Metrics: - P50 (median): 20174.00 ms - P90: 32507.00 ms - P95: 34673.00 ms - P99: 37690.75 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7320 - P99/P50 ratio: 1.8683 - Mean rolling std (window=5): 4837.19 ms Throughput Metrics: - Mean TPS: 4.90 - TPS coefficient of variation: 35.91% Stability Assessment: - Overall stability score: 6.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 6.6/100) with significant variation between runs (CV: 27.53%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.73 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.87 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Primary) ================================================================================ Model: mv3_xnnq8 Device: iphone15_ios18 Dataset Overview: - Number of samples: 121 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 48.23 ms - Median latency (P50): 47.00 ms Dispersion Metrics: - Standard deviation: 6.19 ms - Coefficient of variation (CV): 12.84% - Interquartile range (IQR): 6.00 ms Percentile Metrics: - P50 (median): 47.00 ms - P90: 55.00 ms - P95: 57.00 ms - P99: 64.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.2973 - P99/P50 ratio: 1.3702 - Mean rolling std (window=5): 5.53 ms Stability Assessment: - Overall stability score: 24.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 24.5/100) with significant variation between runs (CV: 12.84%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.30 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.37 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Primary) ================================================================================ Model: mv3_coreml Device: iphone15_ios18 Dataset Overview: - Number of samples: 114 - Date range: 2025-04-30 05:23:09+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_primary_time_series.png Latency Stability Analysis: mv3_mps+iphone15_ios18 (Primary) ================================================================================ Model: mv3_mps Device: iphone15_ios18 Dataset Overview: - Number of samples: 118 - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-22 22:41:19+00:00 Central Tendency Metrics: - Mean latency: 4.01 ms - Median latency (P50): 4.00 ms Dispersion Metrics: - Standard deviation: 0.16 ms - Coefficient of variation (CV): 3.99% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 4.00 ms - P90: 4.00 ms - P95: 4.00 ms - P99: 4.83 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.6667 - P99/P50 ratio: 1.2075 - Mean rolling std (window=5): 0.06 ms Stability Assessment: - Overall stability score: 66.5/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 66.5/100) with noticeable variation between runs (CV: 3.99%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.67 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.21 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_primary_time_series.png ==================================================================================================== ===== ANALYZING REFERENCE DATASETS ================================================================ ==================================================================================================== Latency Stability Analysis: llama3_qlora+s22_android13 (Reference) ================================================================================ Model: llama3_qlora Device: s22_android13 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 23841.98 ms - Median latency (P50): 23381.83 ms - Mean trimmed latency: 23727.32 ms - Median trimmed latency: 23286.98 ms Dispersion Metrics: - Standard deviation: 2079.97 ms - Coefficient of variation (CV): 8.72% - Interquartile range (IQR): 3183.16 ms - Trimmed standard deviation: 2068.95 ms - Trimmed coefficient of variation: 8.72% Percentile Metrics: - P50 (median): 23381.83 ms - P90: 26530.88 ms - P95: 27370.45 ms - P99: 28001.62 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4300 - P99/P50 ratio: 1.1976 - Mean rolling std (window=5): 1967.20 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.48% - Max trimming effect ratio: 1.00% Throughput Metrics: - Mean TPS: 32.18 - TPS coefficient of variation: 7.85% Stability Assessment: - Overall stability score: 46.1/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 46.1/100) with significant variation between runs (CV: 8.72%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.43 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.20 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22_android13 (Reference) ================================================================================ Model: llama3_spinq Device: s22_android13 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 22774.60 ms - Median latency (P50): 22491.89 ms - Mean trimmed latency: 22648.15 ms - Median trimmed latency: 22393.30 ms Dispersion Metrics: - Standard deviation: 1947.04 ms - Coefficient of variation (CV): 8.55% - Interquartile range (IQR): 3455.61 ms - Trimmed standard deviation: 1930.79 ms - Trimmed coefficient of variation: 8.53% Percentile Metrics: - P50 (median): 22491.89 ms - P90: 25323.67 ms - P95: 25925.82 ms - P99: 26148.53 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3483 - P99/P50 ratio: 1.1626 - Mean rolling std (window=5): 1745.98 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.55% - Max trimming effect ratio: 2.26% Throughput Metrics: - Mean TPS: 32.96 - TPS coefficient of variation: 8.16% Stability Assessment: - Overall stability score: 48.8/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 48.8/100) with significant variation between runs (CV: 8.55%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.35 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.16 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_reference_time_series.png Latency Stability Analysis: mv3_qnn+s22_android13 (Reference) ================================================================================ Model: mv3_qnn Device: s22_android13 Dataset Overview: - Number of samples: 175 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.44 ms - Median latency (P50): 1.00 ms - Mean trimmed latency: 1.35 ms - Median trimmed latency: 1.00 ms Dispersion Metrics: - Standard deviation: 0.83 ms - Coefficient of variation (CV): 57.29% - Interquartile range (IQR): 0.06 ms - Trimmed standard deviation: 0.65 ms - Trimmed coefficient of variation: 48.32% Percentile Metrics: - P50 (median): 1.00 ms - P90: 2.71 ms - P95: 3.25 ms - P99: 3.95 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 4.5354 - P99/P50 ratio: 3.9482 - Mean rolling std (window=5): 0.70 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 3.01% - Max trimming effect ratio: 32.04% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 57.29%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 4.54 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 3.95 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+s22_android13 (Reference) ================================================================================ Model: mv3_xnnq8 Device: s22_android13 Dataset Overview: - Number of samples: 175 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.92 ms - Median latency (P50): 1.06 ms - Mean trimmed latency: 1.74 ms - Median trimmed latency: 1.06 ms Dispersion Metrics: - Standard deviation: 1.06 ms - Coefficient of variation (CV): 55.09% - Interquartile range (IQR): 1.63 ms - Trimmed standard deviation: 0.85 ms - Trimmed coefficient of variation: 48.75% Percentile Metrics: - P50 (median): 1.06 ms - P90: 3.45 ms - P95: 3.85 ms - P99: 4.63 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 6.1313 - P99/P50 ratio: 4.3683 - Mean rolling std (window=5): 1.08 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 5.85% - Max trimming effect ratio: 32.08% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 55.09%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (5.8%) with occasional outliers within benchmark runs. The max/min ratio of 6.13 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 4.37 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22_android12 (Reference) ================================================================================ Model: llama3_spinq Device: s22_android12 Dataset Overview: - Number of samples: 48 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 01:48:22+00:00 Central Tendency Metrics: - Mean latency: 23902.04 ms - Median latency (P50): 22762.35 ms - Mean trimmed latency: 23743.12 ms - Median trimmed latency: 22590.46 ms Dispersion Metrics: - Standard deviation: 2609.94 ms - Coefficient of variation (CV): 10.92% - Interquartile range (IQR): 4958.35 ms - Trimmed standard deviation: 2588.36 ms - Trimmed coefficient of variation: 10.90% Percentile Metrics: - P50 (median): 22762.35 ms - P90: 27325.35 ms - P95: 27425.17 ms - P99: 27527.28 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3689 - P99/P50 ratio: 1.2093 - Mean rolling std (window=5): 2739.23 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.66% - Max trimming effect ratio: 1.58% Throughput Metrics: - Mean TPS: 30.86 - TPS coefficient of variation: 10.84% Stability Assessment: - Overall stability score: 40.2/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 40.2/100) with significant variation between runs (CV: 10.92%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.37 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.21 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22_android12_reference_time_series.png Latency Stability Analysis: llama3_qlora+s22Ultra5G_android (Reference) ================================================================================ Model: llama3_qlora Device: s22Ultra5G_android Dataset Overview: - Number of samples: 50 - Date range: 2025-04-29 09:14:21+00:00 to 2025-05-16 17:28:34+00:00 Central Tendency Metrics: - Mean latency: 24685.50 ms - Median latency (P50): 23145.09 ms - Mean trimmed latency: 24531.08 ms - Median trimmed latency: 22945.87 ms Dispersion Metrics: - Standard deviation: 2677.07 ms - Coefficient of variation (CV): 10.84% - Interquartile range (IQR): 5112.26 ms - Trimmed standard deviation: 2657.25 ms - Trimmed coefficient of variation: 10.83% Percentile Metrics: - P50 (median): 23145.09 ms - P90: 28096.67 ms - P95: 28195.43 ms - P99: 29486.39 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4421 - P99/P50 ratio: 1.2740 - Mean rolling std (window=5): 2527.53 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.62% - Max trimming effect ratio: 1.43% Throughput Metrics: - Mean TPS: 30.61 - TPS coefficient of variation: 10.01% Stability Assessment: - Overall stability score: 37.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 37.6/100) with significant variation between runs (CV: 10.84%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.27 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+s22Ultra5G_android_reference_time_series.png Latency Stability Analysis: llama3_spinq+s22ultra_android12 (Reference) ================================================================================ Model: llama3_spinq Device: s22ultra_android12 Dataset Overview: - Number of samples: 41 - Date range: 2025-04-30 01:33:50+00:00 to 2025-05-13 17:16:32+00:00 Central Tendency Metrics: - Mean latency: 24769.21 ms - Median latency (P50): 23249.93 ms - Mean trimmed latency: 24611.41 ms - Median trimmed latency: 22998.15 ms Dispersion Metrics: - Standard deviation: 2714.46 ms - Coefficient of variation (CV): 10.96% - Interquartile range (IQR): 5002.67 ms - Trimmed standard deviation: 2691.09 ms - Trimmed coefficient of variation: 10.93% Percentile Metrics: - P50 (median): 23249.93 ms - P90: 28126.42 ms - P95: 28225.43 ms - P99: 29591.36 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.4421 - P99/P50 ratio: 1.2728 - Mean rolling std (window=5): 2490.40 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.63% - Max trimming effect ratio: 1.43% Throughput Metrics: - Mean TPS: 30.58 - TPS coefficient of variation: 10.08% Stability Assessment: - Overall stability score: 37.7/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 37.7/100) with significant variation between runs (CV: 10.96%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 1.44 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.27 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android12_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+s22ultra_android12 (Reference) ================================================================================ Model: mv3_xnnq8 Device: s22ultra_android12 Dataset Overview: - Number of samples: 87 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 3.63 ms - Median latency (P50): 3.62 ms - Mean trimmed latency: 2.94 ms - Median trimmed latency: 2.87 ms Dispersion Metrics: - Standard deviation: 0.81 ms - Coefficient of variation (CV): 22.35% - Interquartile range (IQR): 0.94 ms - Trimmed standard deviation: 0.60 ms - Trimmed coefficient of variation: 20.24% Percentile Metrics: - P50 (median): 3.62 ms - P90: 4.87 ms - P95: 5.15 ms - P99: 5.50 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7228 - P99/P50 ratio: 1.5193 - Mean rolling std (window=5): 0.77 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 17.69% - Max trimming effect ratio: 45.14% Stability Assessment: - Overall stability score: 15.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 15.5/100) with significant variation between runs (CV: 22.35%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (17.7%) with occasional outliers within benchmark runs. The max/min ratio of 2.72 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.52 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android12_reference_time_series.png Latency Stability Analysis: mv3_qnn+s22ultra_android12 (Reference) ================================================================================ Model: mv3_qnn Device: s22ultra_android12 Dataset Overview: - Number of samples: 88 - Date range: 2025-04-16 01:35:32+00:00 to 2025-05-15 17:15:03+00:00 Central Tendency Metrics: - Mean latency: 1.02 ms - Median latency (P50): 1.01 ms - Mean trimmed latency: 1.01 ms - Median trimmed latency: 1.01 ms Dispersion Metrics: - Standard deviation: 0.01 ms - Coefficient of variation (CV): 1.35% - Interquartile range (IQR): 0.01 ms - Trimmed standard deviation: 0.01 ms - Trimmed coefficient of variation: 1.15% Percentile Metrics: - P50 (median): 1.01 ms - P90: 1.02 ms - P95: 1.03 ms - P99: 1.08 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0990 - P99/P50 ratio: 1.0646 - Mean rolling std (window=5): 0.01 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 0.16% - Max trimming effect ratio: 1.94% Stability Assessment: - Overall stability score: 90.4/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 90.4/100) with very low variation between runs (CV: 1.35%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android12_reference_time_series.png Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Reference) ================================================================================ Model: llama3_qlora Device: iphone15max_ios17 Dataset Overview: - Number of samples: 74 - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00 Central Tendency Metrics: - Mean latency: 14133.01 ms - Median latency (P50): 13132.50 ms Dispersion Metrics: - Standard deviation: 3019.85 ms - Coefficient of variation (CV): 21.37% - Interquartile range (IQR): 527.50 ms Percentile Metrics: - P50 (median): 13132.50 ms - P90: 17308.70 ms - P95: 21197.30 ms - P99: 25167.92 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.3216 - P99/P50 ratio: 1.9165 - Mean rolling std (window=5): 1535.43 ms Throughput Metrics: - Mean TPS: 8.81 - TPS coefficient of variation: 27.97% Stability Assessment: - Overall stability score: 10.6/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 10.6/100) with significant variation between runs (CV: 21.37%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.32 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.92 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Reference) ================================================================================ Model: llama3_spinq Device: iphone15max_ios17 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-21 03:12:32+00:00 to 2025-05-15 02:43:34+00:00 Central Tendency Metrics: - Mean latency: 13118.40 ms - Median latency (P50): 12382.50 ms Dispersion Metrics: - Standard deviation: 2853.94 ms - Coefficient of variation (CV): 21.76% - Interquartile range (IQR): 680.50 ms Percentile Metrics: - P50 (median): 12382.50 ms - P90: 14481.00 ms - P95: 15865.05 ms - P99: 26265.08 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.7878 - P99/P50 ratio: 2.1211 - Mean rolling std (window=5): 1464.57 ms Throughput Metrics: - Mean TPS: 12.30 - TPS coefficient of variation: 21.24% Stability Assessment: - Overall stability score: 2.7/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 2.7/100) with significant variation between runs (CV: 21.76%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.79 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.12 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_xnnq8 Device: iphone15max_ios17 Dataset Overview: - Number of samples: 73 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 13.97 ms - Median latency (P50): 13.00 ms Dispersion Metrics: - Standard deviation: 4.74 ms - Coefficient of variation (CV): 33.93% - Interquartile range (IQR): 7.00 ms Percentile Metrics: - P50 (median): 13.00 ms - P90: 21.80 ms - P95: 22.00 ms - P99: 25.40 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 4.1429 - P99/P50 ratio: 1.9538 - Mean rolling std (window=5): 4.51 ms Stability Assessment: - Overall stability score: 1.2/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 1.2/100) with significant variation between runs (CV: 33.93%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 4.14 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.95 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_coreml+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_coreml Device: iphone15max_ios17 Dataset Overview: - Number of samples: 21 - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: mv3_mps+iphone15max_ios17 (Reference) ================================================================================ Model: mv3_mps Device: iphone15max_ios17 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.03 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.17 ms - Coefficient of variation (CV): 16.10% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 2.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.0000 - P99/P50 ratio: 2.0000 - Mean rolling std (window=5): 0.07 ms Stability Assessment: - Overall stability score: 12.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 12.5/100) with significant variation between runs (CV: 16.10%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.00 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.00 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15max_ios17_reference_time_series.png Latency Stability Analysis: llama3_qlora+iphone15_ios18 (Reference) ================================================================================ Model: llama3_qlora Device: iphone15_ios18 Dataset Overview: - Number of samples: 70 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 14429.20 ms - Median latency (P50): 14401.00 ms Dispersion Metrics: - Standard deviation: 593.06 ms - Coefficient of variation (CV): 4.11% - Interquartile range (IQR): 637.25 ms Percentile Metrics: - P50 (median): 14401.00 ms - P90: 14970.00 ms - P95: 15441.85 ms - P99: 16444.58 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.2195 - P99/P50 ratio: 1.1419 - Mean rolling std (window=5): 540.47 ms Throughput Metrics: - Mean TPS: 5.47 - TPS coefficient of variation: 13.24% Stability Assessment: - Overall stability score: 73.2/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 73.2/100) with noticeable variation between runs (CV: 4.11%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.22 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.14 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_qlora+iphone15_ios18_reference_time_series.png Latency Stability Analysis: llama3_spinq+iphone15_ios18 (Reference) ================================================================================ Model: llama3_spinq Device: iphone15_ios18 Dataset Overview: - Number of samples: 74 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 13820.34 ms - Median latency (P50): 13724.00 ms Dispersion Metrics: - Standard deviation: 662.49 ms - Coefficient of variation (CV): 4.79% - Interquartile range (IQR): 683.50 ms Percentile Metrics: - P50 (median): 13724.00 ms - P90: 14527.80 ms - P95: 14992.20 ms - P99: 15822.16 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.3302 - P99/P50 ratio: 1.1529 - Mean rolling std (window=5): 542.03 ms Throughput Metrics: - Mean TPS: 7.96 - TPS coefficient of variation: 14.45% Stability Assessment: - Overall stability score: 68.1/100 - Overall stability rating: Moderate Interpretation: The benchmark shows moderate stability (score: 68.1/100) with noticeable variation between runs (CV: 4.79%). While average performance is acceptable, occasional latency spikes may occur. The max/min ratio of 1.33 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 1.15 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/llama3_spinq+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_xnnq8+iphone15_ios18 (Reference) ================================================================================ Model: mv3_xnnq8 Device: iphone15_ios18 Dataset Overview: - Number of samples: 73 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 49.85 ms - Median latency (P50): 44.00 ms Dispersion Metrics: - Standard deviation: 20.47 ms - Coefficient of variation (CV): 41.06% - Interquartile range (IQR): 12.00 ms Percentile Metrics: - P50 (median): 44.00 ms - P90: 82.00 ms - P95: 100.20 ms - P99: 121.28 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 3.9355 - P99/P50 ratio: 2.7564 - Mean rolling std (window=5): 16.45 ms Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 41.06%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 3.94 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.76 suggests occasional latency spikes that could affect tail latency sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_coreml+iphone15_ios18 (Reference) ================================================================================ Model: mv3_coreml Device: iphone15_ios18 Dataset Overview: - Number of samples: 21 - Date range: 2025-05-01 03:29:21+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 1.00 ms - Median latency (P50): 1.00 ms Dispersion Metrics: - Standard deviation: 0.00 ms - Coefficient of variation (CV): 0.00% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 1.00 ms - P90: 1.00 ms - P95: 1.00 ms - P99: 1.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 1.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.00 ms Stability Assessment: - Overall stability score: 100.0/100 - Overall stability rating: Excellent Interpretation: The benchmark shows excellent stability (score: 100.0/100) with very low variation between runs (CV: 0.00%). This indicates highly consistent performance suitable for latency-sensitive applications. ================================================================================ Generated time series plot: stability_analysis_results/mv3_coreml+iphone15_ios18_reference_time_series.png Latency Stability Analysis: mv3_mps+iphone15_ios18 (Reference) ================================================================================ Model: mv3_mps Device: iphone15_ios18 Dataset Overview: - Number of samples: 72 - Date range: 2025-02-22 03:11:03+00:00 to 2025-05-22 02:53:58+00:00 Central Tendency Metrics: - Mean latency: 3.75 ms - Median latency (P50): 4.00 ms Dispersion Metrics: - Standard deviation: 0.67 ms - Coefficient of variation (CV): 17.76% - Interquartile range (IQR): 0.00 ms Percentile Metrics: - P50 (median): 4.00 ms - P90: 4.00 ms - P95: 4.00 ms - P99: 4.00 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 2.0000 - P99/P50 ratio: 1.0000 - Mean rolling std (window=5): 0.44 ms Stability Assessment: - Overall stability score: 37.5/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 37.5/100) with significant variation between runs (CV: 17.76%). Performance is unpredictable and may lead to inconsistent user experience. The max/min ratio of 2.00 indicates substantial performance differences between the best and worst runs. ================================================================================ Generated time series plot: stability_analysis_results/mv3_mps+iphone15_ios18_reference_time_series.png ==================================================================================================== ===== PRIVATE VS PUBLIC STABILITY COMPARISON ====================================================== ==================================================================================================== Matched: llama3_qlora+s22_android13 (Private) with llama3_qlora+s22_android13 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_qlora+s22_android13 Public Dataset: llama3_qlora+s22_android13 Model: llama3_qlora Private Device: s22_android13 Public Device: s22_android13 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 22502.10 ms | 23841.98 ms | -1339.88 ms | -5.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 22447.56 ms | 23381.83 ms | -934.27 ms | -4.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 595.01 ms | 2079.97 ms | -1484.97 ms | -71.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 2.64% | 8.72% | -6.08% | -69.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 858.26 ms | 3183.16 ms | -2324.90 ms | -73.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 23910.11 ms | 28001.62 ms | -4091.51 ms | -14.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.1423 | 1.4300 | -0.2877 | -20.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.0652 | 1.1976 | -0.1324 | -11.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 83.4/100 | 46.1/100 | 37.3 | 81.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Good | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 81.0% higher stability score. (Private: 83.4/100 vs Public: 46.1/100) Private environment has 69.7% lower coefficient of variation, indicating more consistent performance. Private environment has 5.6% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: llama3_spinq+s22_android13 (Private) with llama3_spinq+s22_android13 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_spinq+s22_android13 Public Dataset: llama3_spinq+s22_android13 Model: llama3_spinq Private Device: s22_android13 Public Device: s22_android13 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 21771.59 ms | 22774.60 ms | -1003.01 ms | -4.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 21668.24 ms | 22491.89 ms | -823.65 ms | -3.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 514.89 ms | 1947.04 ms | -1432.15 ms | -73.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 2.36% | 8.55% | -6.18% | -72.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 602.75 ms | 3455.61 ms | -2852.87 ms | -82.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 23104.76 ms | 26148.53 ms | -3043.77 ms | -11.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.1452 | 1.3483 | -0.2031 | -15.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.0663 | 1.1626 | -0.0963 | -8.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 84.7/100 | 48.8/100 | 35.9 | 73.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Good | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 73.4% higher stability score. (Private: 84.7/100 vs Public: 48.8/100) Private environment has 72.3% lower coefficient of variation, indicating more consistent performance. Private environment has 4.4% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_qnn+s22_android13 (Private) with mv3_qnn+s22_android13 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_qnn+s22_android13 Public Dataset: mv3_qnn+s22_android13 Model: mv3_qnn Private Device: s22_android13 Public Device: s22_android13 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 1.01 ms | 1.44 ms | -0.44 ms | -30.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.02 ms | 0.83 ms | -0.80 ms | -97.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 2.34% | 57.29% | -54.95% | -95.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.01 ms | 0.06 ms | -0.05 ms | -83.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 1.14 ms | 3.95 ms | -2.81 ms | -71.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.1919 | 4.5354 | -3.3434 | -73.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.1404 | 3.9482 | -2.8078 | -71.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 82.4/100 | 0.0/100 | 82.4 | Infinity | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Good | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability. (Private: 82.4/100 vs Public: 0.0/100) Private environment has 95.9% lower coefficient of variation, indicating more consistent performance. Private environment has 30.3% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_xnnq8+s22_android13 (Private) with mv3_xnnq8+s22_android13 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_xnnq8+s22_android13 Public Dataset: mv3_xnnq8+s22_android13 Model: mv3_xnnq8 Private Device: s22_android13 Public Device: s22_android13 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 2.73 ms | 1.92 ms | 0.81 ms | 42.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 2.65 ms | 1.06 ms | 1.59 ms | 150.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.63 ms | 1.06 ms | -0.43 ms | -40.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 23.03% | 55.09% | -32.06% | -58.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.95 ms | 1.63 ms | -0.68 ms | -41.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 4.46 ms | 4.63 ms | -0.18 ms | -3.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 2.4427 | 6.1313 | -3.6886 | -60.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.6812 | 4.3683 | -2.6871 | -61.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 14.9/100 | 0.0/100 | 14.9 | Infinity | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability. (Private: 14.9/100 vs Public: 0.0/100) Private environment has 58.2% lower coefficient of variation, indicating more consistent performance. Public environment has 42.1% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Warning: No matching reference dataset for llama3_qlora+s22ultra_android14 Matched: llama3_spinq+s22ultra_android14 (Private) with llama3_spinq+s22ultra_android12 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_spinq+s22ultra_android14 Public Dataset: llama3_spinq+s22ultra_android12 Model: llama3_spinq Private Device: s22ultra_android14 Public Device: s22ultra_android12 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 24761.78 ms | 24769.21 ms | -7.43 ms | -0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 25043.89 ms | 23249.93 ms | 1793.96 ms | 7.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 1552.25 ms | 2714.46 ms | -1162.21 ms | -42.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 6.27% | 10.96% | -4.69% | -42.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 1931.42 ms | 5002.67 ms | -3071.25 ms | -61.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 28868.51 ms | 29591.36 ms | -722.85 ms | -2.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.3648 | 1.4421 | -0.0773 | -5.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.1527 | 1.2728 | -0.1200 | -9.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 60.3/100 | 37.7/100 | 22.6 | 60.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Moderate | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 60.1% higher stability score. (Private: 60.3/100 vs Public: 37.7/100) Private environment has 42.8% lower coefficient of variation, indicating more consistent performance. Private environment has 0.0% lower mean latency, indicating better performance. Note: This comparison is between s22ultra with _android14 (Private) and s22ultra with _android12 (Public). OS version differences may contribute to observed stability variations. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_qnn+s22ultra_android14 (Private) with mv3_qnn+s22ultra_android12 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_qnn+s22ultra_android14 Public Dataset: mv3_qnn+s22ultra_android12 Model: mv3_qnn Private Device: s22ultra_android14 Public Device: s22ultra_android12 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 1.01 ms | 1.02 ms | -0.00 ms | -0.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 1.01 ms | 1.01 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.01 ms | 0.01 ms | -0.00 ms | -32.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 0.91% | 1.35% | -0.44% | -32.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.01 ms | 0.01 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 1.03 ms | 1.08 ms | -0.04 ms | -4.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.0900 | 1.0990 | -0.0090 | -0.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.0204 | 1.0646 | -0.0442 | -4.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 93.8/100 | 90.4/100 | 3.4 | 3.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Excellent | Excellent | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 3.8% higher stability score. (Private: 93.8/100 vs Public: 90.4/100) Private environment has 32.6% lower coefficient of variation, indicating more consistent performance. Private environment has 0.1% lower mean latency, indicating better performance. Note: This comparison is between s22ultra with _android14 (Private) and s22ultra with _android12 (Public). OS version differences may contribute to observed stability variations. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_xnnq8+s22ultra_android14 (Private) with mv3_xnnq8+s22ultra_android12 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_xnnq8+s22ultra_android14 Public Dataset: mv3_xnnq8+s22ultra_android12 Model: mv3_xnnq8 Private Device: s22ultra_android14 Public Device: s22ultra_android12 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 2.91 ms | 3.63 ms | -0.72 ms | -20.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 2.54 ms | 3.62 ms | -1.08 ms | -30.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 1.14 ms | 0.81 ms | 0.32 ms | 39.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 39.08% | 22.35% | 16.73% | 74.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.82 ms | 0.94 ms | -0.12 ms | -12.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 5.91 ms | 5.50 ms | 0.41 ms | 7.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 5.6103 | 2.7228 | 2.8875 | 106.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 2.3319 | 1.5193 | 0.8126 | 53.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 0.0/100 | 15.5/100 | -15.5 | -100.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Public environment shows better stability. (Private: 0.0/100 vs Public: 15.5/100) Public environment has 74.8% lower coefficient of variation, indicating more consistent performance. Private environment has 20.0% lower mean latency, indicating better performance. Note: This comparison is between s22ultra with _android14 (Private) and s22ultra with _android12 (Public). OS version differences may contribute to observed stability variations. Recommendation: The public environment provides better stability for this model+device combination. Consider investigating factors affecting stability in the private environment. ================================================================================ Warning: No matching reference dataset for mv3_xnnq8+pixel3_rooted_android Matched: llama3_qlora+iphone15max_ios17 (Private) with llama3_qlora+iphone15max_ios17 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_qlora+iphone15max_ios17 Public Dataset: llama3_qlora+iphone15max_ios17 Model: llama3_qlora Private Device: iphone15max_ios17 Public Device: iphone15max_ios17 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 12972.80 ms | 14133.01 ms | -1160.22 ms | -8.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 12774.50 ms | 13132.50 ms | -358.00 ms | -2.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 483.26 ms | 3019.85 ms | -2536.58 ms | -84.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 3.73% | 21.37% | -17.64% | -82.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 624.00 ms | 527.50 ms | 96.50 ms | 18.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 14730.49 ms | 25167.92 ms | -10437.43 ms | -41.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.1916 | 2.3216 | -1.1300 | -48.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.1531 | 1.9165 | -0.7633 | -39.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 75.2/100 | 10.6/100 | 64.6 | 611.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Moderate | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 611.1% higher stability score. (Private: 75.2/100 vs Public: 10.6/100) Private environment has 82.6% lower coefficient of variation, indicating more consistent performance. Private environment has 8.2% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: llama3_spinq+iphone15max_ios17 (Private) with llama3_spinq+iphone15max_ios17 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_spinq+iphone15max_ios17 Public Dataset: llama3_spinq+iphone15max_ios17 Model: llama3_spinq Private Device: iphone15max_ios17 Public Device: iphone15max_ios17 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 12195.41 ms | 13118.40 ms | -923.00 ms | -7.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 12104.50 ms | 12382.50 ms | -278.00 ms | -2.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 461.27 ms | 2853.94 ms | -2392.67 ms | -83.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 3.78% | 21.76% | -17.97% | -82.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 154.25 ms | 680.50 ms | -526.25 ms | -77.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 14052.31 ms | 26265.08 ms | -12212.77 ms | -46.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.3331 | 2.7878 | -1.4546 | -52.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.1609 | 2.1211 | -0.9602 | -45.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 72.9/100 | 2.7/100 | 70.2 | 2648.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Moderate | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 2648.0% higher stability score. (Private: 72.9/100 vs Public: 2.7/100) Private environment has 82.6% lower coefficient of variation, indicating more consistent performance. Private environment has 7.0% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_xnnq8+iphone15max_ios17 (Private) with mv3_xnnq8+iphone15max_ios17 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_xnnq8+iphone15max_ios17 Public Dataset: mv3_xnnq8+iphone15max_ios17 Model: mv3_xnnq8 Private Device: iphone15max_ios17 Public Device: iphone15max_ios17 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 13.98 ms | 13.97 ms | 0.01 ms | 0.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 14.00 ms | 13.00 ms | 1.00 ms | 7.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 3.44 ms | 4.74 ms | -1.30 ms | -27.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 24.60% | 33.93% | -9.33% | -27.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 4.00 ms | 7.00 ms | -3.00 ms | -42.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 21.94 ms | 25.40 ms | -3.46 ms | -13.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 3.2857 | 4.1429 | -0.8571 | -20.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.5671 | 1.9538 | -0.3867 | -19.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 10.8/100 | 1.2/100 | 9.7 | 837.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 837.9% higher stability score. (Private: 10.8/100 vs Public: 1.2/100) Private environment has 27.5% lower coefficient of variation, indicating more consistent performance. Public environment has 0.1% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_coreml+iphone15max_ios17 (Private) with mv3_coreml+iphone15max_ios17 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_coreml+iphone15max_ios17 Public Dataset: mv3_coreml+iphone15max_ios17 Model: mv3_coreml Private Device: iphone15max_ios17 Public Device: iphone15max_ios17 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.00 ms | 0.00 ms | 0.00 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 0.00% | 0.00% | 0.00% | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.00 ms | 0.00 ms | 0.00 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.0000 | 1.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.0000 | 1.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 100.0/100 | 100.0/100 | 0.0 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Excellent | Excellent | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Both environments show identical stability scores. Recommendation: Both environments provide similar stability. Other factors like cost or availability may be considered for choosing between them. ================================================================================ Matched: mv3_mps+iphone15max_ios17 (Private) with mv3_mps+iphone15max_ios17 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_mps+iphone15max_ios17 Public Dataset: mv3_mps+iphone15max_ios17 Model: mv3_mps Private Device: iphone15max_ios17 Public Device: iphone15max_ios17 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 1.25 ms | 1.03 ms | 0.23 ms | 22.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.44 ms | 0.17 ms | 0.27 ms | 166.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 35.07% | 16.10% | 18.97% | 117.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.50 ms | 0.00 ms | 0.50 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 2.00 ms | 2.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 2.0000 | 2.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 2.0000 | 2.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 12.5/100 | 12.5/100 | 0.0 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Both environments show identical stability scores. Public environment has 117.8% lower coefficient of variation, indicating more consistent performance. Public environment has 22.1% lower mean latency, indicating better performance. Recommendation: Both environments provide similar stability. Other factors like cost or availability may be considered for choosing between them. ================================================================================ Matched: llama3_qlora+iphone15_ios18 (Private) with llama3_qlora+iphone15_ios18 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_qlora+iphone15_ios18 Public Dataset: llama3_qlora+iphone15_ios18 Model: llama3_qlora Private Device: iphone15_ios18 Public Device: iphone15_ios18 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 23169.07 ms | 14429.20 ms | 8739.87 ms | 60.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 21328.00 ms | 14401.00 ms | 6927.00 ms | 48.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 5889.20 ms | 593.06 ms | 5296.15 ms | 893.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 25.42% | 4.11% | 21.31% | 518.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 8558.00 ms | 637.25 ms | 7920.75 ms | 1243.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 40256.40 ms | 16444.58 ms | 23811.82 ms | 144.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 3.0072 | 1.2195 | 1.7877 | 146.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.8875 | 1.1419 | 0.7456 | 65.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 2.8/100 | 73.2/100 | -70.3 | -96.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Moderate | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Public environment shows better stability with a 96.2% higher stability score. (Private: 2.8/100 vs Public: 73.2/100) Public environment has 518.4% lower coefficient of variation, indicating more consistent performance. Public environment has 60.6% lower mean latency, indicating better performance. Recommendation: The public environment provides better stability for this model+device combination. Consider investigating factors affecting stability in the private environment. ================================================================================ Matched: llama3_spinq+iphone15_ios18 (Private) with llama3_spinq+iphone15_ios18 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: llama3_spinq+iphone15_ios18 Public Dataset: llama3_spinq+iphone15_ios18 Model: llama3_spinq Private Device: iphone15_ios18 Public Device: iphone15_ios18 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 22076.03 ms | 13820.34 ms | 8255.70 ms | 59.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 20174.00 ms | 13724.00 ms | 6450.00 ms | 47.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 6076.94 ms | 662.49 ms | 5414.45 ms | 817.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 27.53% | 4.79% | 22.73% | 474.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 7826.00 ms | 683.50 ms | 7142.50 ms | 1045.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 37690.75 ms | 15822.16 ms | 21868.59 ms | 138.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 2.7320 | 1.3302 | 1.4018 | 105.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.8683 | 1.1529 | 0.7154 | 62.1% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 6.6/100 | 68.1/100 | -61.4 | -90.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Moderate | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Public environment shows better stability with a 90.2% higher stability score. (Private: 6.6/100 vs Public: 68.1/100) Public environment has 474.3% lower coefficient of variation, indicating more consistent performance. Public environment has 59.7% lower mean latency, indicating better performance. Recommendation: The public environment provides better stability for this model+device combination. Consider investigating factors affecting stability in the private environment. ================================================================================ Matched: mv3_xnnq8+iphone15_ios18 (Private) with mv3_xnnq8+iphone15_ios18 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_xnnq8+iphone15_ios18 Public Dataset: mv3_xnnq8+iphone15_ios18 Model: mv3_xnnq8 Private Device: iphone15_ios18 Public Device: iphone15_ios18 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 48.23 ms | 49.85 ms | -1.62 ms | -3.2% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 47.00 ms | 44.00 ms | 3.00 ms | 6.8% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 6.19 ms | 20.47 ms | -14.28 ms | -69.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 12.84% | 41.06% | -28.22% | -68.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 6.00 ms | 12.00 ms | -6.00 ms | -50.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 64.40 ms | 121.28 ms | -56.88 ms | -46.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 2.2973 | 3.9355 | -1.6382 | -41.6% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.3702 | 2.7564 | -1.3862 | -50.3% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 24.5/100 | 0.0/100 | 24.5 | Infinity | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Poor | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability. (Private: 24.5/100 vs Public: 0.0/100) Private environment has 68.7% lower coefficient of variation, indicating more consistent performance. Private environment has 3.2% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ Matched: mv3_coreml+iphone15_ios18 (Private) with mv3_coreml+iphone15_ios18 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_coreml+iphone15_ios18 Public Dataset: mv3_coreml+iphone15_ios18 Model: mv3_coreml Private Device: iphone15_ios18 Public Device: iphone15_ios18 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.00 ms | 0.00 ms | 0.00 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 0.00% | 0.00% | 0.00% | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.00 ms | 0.00 ms | 0.00 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 1.00 ms | 1.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.0000 | 1.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.0000 | 1.0000 | 0.0000 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 100.0/100 | 100.0/100 | 0.0 | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Excellent | Excellent | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Both environments show identical stability scores. Recommendation: Both environments provide similar stability. Other factors like cost or availability may be considered for choosing between them. ================================================================================ Matched: mv3_mps+iphone15_ios18 (Private) with mv3_mps+iphone15_ios18 (Public) Private vs Public Stability Comparison ================================================================================ Private Dataset: mv3_mps+iphone15_ios18 Public Dataset: mv3_mps+iphone15_ios18 Model: mv3_mps Private Device: iphone15_ios18 Public Device: iphone15_ios18 Metric Comparison: +-------------------------+---------------------+----------------------+--------------+------------+ | Metric | Private (Primary) | Public (Reference) | Difference | % Change | +=========================+=====================+======================+==============+============+ | Mean Latency (ms) | 4.01 ms | 3.75 ms | 0.26 ms | 6.9% | +-------------------------+---------------------+----------------------+--------------+------------+ | Median Latency (ms) | 4.00 ms | 4.00 ms | 0.00 ms | 0.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | Standard Deviation (ms) | 0.16 ms | 0.67 ms | -0.51 ms | -76.0% | +-------------------------+---------------------+----------------------+--------------+------------+ | CV (%) | 3.99% | 17.76% | -13.77% | -77.5% | +-------------------------+---------------------+----------------------+--------------+------------+ | IQR (ms) | 0.00 ms | 0.00 ms | 0.00 ms | Infinity% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99 (ms) | 4.83 ms | 4.00 ms | 0.83 ms | 20.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Max/Min Ratio | 1.6667 | 2.0000 | -0.3333 | -16.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | P99/P50 Ratio | 1.2075 | 1.0000 | 0.2075 | 20.7% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Score | 66.5/100 | 37.5/100 | 29.0 | 77.4% | +-------------------------+---------------------+----------------------+--------------+------------+ | Stability Rating | Moderate | Poor | N/A | N/A | +-------------------------+---------------------+----------------------+--------------+------------+ Interpretation: Private environment shows better stability with a 77.4% higher stability score. (Private: 66.5/100 vs Public: 37.5/100) Private environment has 77.5% lower coefficient of variation, indicating more consistent performance. Public environment has 6.9% lower mean latency, indicating better performance. Recommendation: The private environment provides better stability for this model+device combination. It is recommended for applications where consistent performance is critical. ================================================================================ ==================================================================================================== ===== INTRA-PRIMARY STABILITY COMPARISON ========================================================== ==================================================================================================== Intra-Primary Stability Comparison ================================================================================ Overall Summary: +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | Sheet | Model | Device | Mean Latency (ms) | CV (%) | Stability Score | Stability Rating | Max/Min Ratio | P99/P50 Ratio | +=================================+==============+=======================+=====================+==========+===================+====================+=================+=================+ | mv3_coreml+iphone15_ios18 | mv3_coreml | iphone15_ios18 | 1.00 | 0.00 | 100.00 | Excellent | 1.00 | 1.00 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_coreml+iphone15max_ios17 | mv3_coreml | iphone15max_ios17 | 1.00 | 0.00 | 100.00 | Excellent | 1.00 | 1.00 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_qnn+s22ultra_android14 | mv3_qnn | s22ultra_android14 | 1.01 | 0.91 | 93.81 | Excellent | 1.09 | 1.02 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_spinq+s22_android13 | llama3_spinq | s22_android13 | 21771.59 | 2.36 | 84.70 | Good | 1.15 | 1.07 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_qlora+s22_android13 | llama3_qlora | s22_android13 | 22502.10 | 2.64 | 83.37 | Good | 1.14 | 1.07 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_qnn+s22_android13 | mv3_qnn | s22_android13 | 1.01 | 2.34 | 82.41 | Good | 1.19 | 1.14 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_qlora+iphone15max_ios17 | llama3_qlora | iphone15max_ios17 | 12972.80 | 3.73 | 75.15 | Moderate | 1.19 | 1.15 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_spinq+iphone15max_ios17 | llama3_spinq | iphone15max_ios17 | 12195.41 | 3.78 | 72.90 | Moderate | 1.33 | 1.16 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_mps+iphone15_ios18 | mv3_mps | iphone15_ios18 | 4.01 | 3.99 | 66.53 | Moderate | 1.67 | 1.21 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra_android14 | 25022.84 | 6.18 | 62.54 | Moderate | 1.27 | 1.13 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra_android14 | 24761.78 | 6.27 | 60.28 | Moderate | 1.36 | 1.15 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8 | pixel3_rooted_android | 5.93 | 7.68 | 46.93 | Poor | 1.70 | 1.24 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_xnnq8+iphone15_ios18 | mv3_xnnq8 | iphone15_ios18 | 48.23 | 12.84 | 24.53 | Poor | 2.30 | 1.37 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_xnnq8+s22_android13 | mv3_xnnq8 | s22_android13 | 2.73 | 23.03 | 14.94 | Poor | 2.44 | 1.68 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_mps+iphone15max_ios17 | mv3_mps | iphone15max_ios17 | 1.25 | 35.07 | 12.50 | Poor | 2.00 | 2.00 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_xnnq8+iphone15max_ios17 | mv3_xnnq8 | iphone15max_ios17 | 13.98 | 24.60 | 10.82 | Poor | 3.29 | 1.57 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_spinq+iphone15_ios18 | llama3_spinq | iphone15_ios18 | 22076.03 | 27.53 | 6.64 | Poor | 2.73 | 1.87 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | llama3_qlora+iphone15_ios18 | llama3_qlora | iphone15_ios18 | 23169.07 | 25.42 | 2.81 | Poor | 3.01 | 1.89 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ | mv3_xnnq8+s22ultra_android14 | mv3_xnnq8 | s22ultra_android14 | 2.91 | 39.08 | 0.00 | Poor | 5.61 | 2.33 | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+ Best and Worst Performers: Best stability: mv3_coreml+iphone15_ios18 (Score: 100.0/100) Worst stability: mv3_xnnq8+s22ultra_android14 (Score: 0.0/100) Model-based Comparison: +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | Model | ('Stability Score', 'mean') | ('Stability Score', 'min') | ('Stability Score', 'max') | ('CV (%)', 'mean') | ('CV (%)', 'min') | ('CV (%)', 'max') | +==============+===============================+==============================+==============================+======================+=====================+=====================+ | mv3_coreml | 100.00 | 100.00 | 100.00 | 0.00 | 0.00 | 0.00 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | mv3_qnn | 88.11 | 82.41 | 93.81 | 1.62 | 0.91 | 2.34 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | llama3_spinq | 56.13 | 6.64 | 84.70 | 9.99 | 2.36 | 27.53 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | llama3_qlora | 55.97 | 2.81 | 83.37 | 9.49 | 2.64 | 25.42 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | mv3_mps | 39.52 | 12.50 | 66.53 | 19.53 | 3.99 | 35.07 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | mv3_xnnq8 | 19.44 | 0.00 | 46.93 | 21.45 | 7.68 | 39.08 | +--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ Most stable model: mv3_coreml (Avg. Score: 100.0/100) Device-based Comparison (Grouped by Base Device): +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | Device Base | ('Stability Score', 'mean') | ('Stability Score', 'min') | ('Stability Score', 'max') | ('CV (%)', 'mean') | ('CV (%)', 'min') | ('CV (%)', 'max') | +===============+===============================+==============================+==============================+======================+=====================+=====================+ | s22 | 66.36 | 14.94 | 84.70 | 7.59 | 2.34 | 23.03 | +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | iphone15max | 54.27 | 10.82 | 100.00 | 13.44 | 0.00 | 35.07 | +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | s22ultra | 54.16 | 0.00 | 93.81 | 13.11 | 0.91 | 39.08 | +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | pixel3 | 46.93 | 46.93 | 46.93 | 7.68 | 7.68 | 7.68 | +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | iphone15 | 40.10 | 2.81 | 100.00 | 13.95 | 0.00 | 27.53 | +---------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ Most stable device: s22 (Avg. Score: 66.4/100) OS Version Comparison: +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | OS Version | ('Stability Score', 'mean') | ('Stability Score', 'min') | ('Stability Score', 'max') | ('CV (%)', 'mean') | ('CV (%)', 'min') | ('CV (%)', 'max') | +=================+===============================+==============================+==============================+======================+=====================+=====================+ | _android13 | 66.36 | 14.94 | 84.70 | 7.59 | 2.34 | 23.03 | +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | _ios17 | 54.27 | 10.82 | 100.00 | 13.44 | 0.00 | 35.07 | +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | _android14 | 54.16 | 0.00 | 93.81 | 13.11 | 0.91 | 39.08 | +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | _rooted_android | 46.93 | 46.93 | 46.93 | 7.68 | 7.68 | 7.68 | +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ | _ios18 | 40.10 | 2.81 | 100.00 | 13.95 | 0.00 | 27.53 | +-----------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+ Most stable OS version: _android13 (Avg. Score: 66.4/100) Insights and Recommendations: - mv3_coreml shows the most consistent performance across devices. - mv3_xnnq8 shows more variability and may need further optimization. - s22 provides the most stable environment for model execution. - iphone15 shows higher variability and may not be ideal for latency-sensitive applications. - _android13 provides better stability than _ios18 across tested devices. - For critical applications requiring consistent performance, prefer: * Model: mv3_coreml * Device: s22 * OS Version: _android13 ================================================================================ ==================================================================================================== ===== COMPREHENSIVE STABILITY SUMMARY ============================================================= ==================================================================================================== Comprehensive Latency Stability Analysis Summary ================================================================================ Primary (Private) Datasets Summary: +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | Dataset | Model | Device | Mean Latency (ms) | CV (%) | Stability Score | Stability Rating | +=================================+==============+==========================+=====================+==========+===================+====================+ | mv3_coreml+iphone15_ios18 | mv3_coreml | iphone15 (_ios18) | 1.00 | 0.00 | 100.00 | Excellent | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_coreml+iphone15max_ios17 | mv3_coreml | iphone15max (_ios17) | 1.00 | 0.00 | 100.00 | Excellent | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_qnn+s22ultra_android14 | mv3_qnn | s22ultra (_android14) | 1.01 | 0.91 | 93.81 | Excellent | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+s22_android13 | llama3_spinq | s22 (_android13) | 21771.59 | 2.36 | 84.70 | Good | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+s22_android13 | llama3_qlora | s22 (_android13) | 22502.10 | 2.64 | 83.37 | Good | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_qnn+s22_android13 | mv3_qnn | s22 (_android13) | 1.01 | 2.34 | 82.41 | Good | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+iphone15max_ios17 | llama3_qlora | iphone15max (_ios17) | 12972.80 | 3.73 | 75.15 | Moderate | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+iphone15max_ios17 | llama3_spinq | iphone15max (_ios17) | 12195.41 | 3.78 | 72.90 | Moderate | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_mps+iphone15_ios18 | mv3_mps | iphone15 (_ios18) | 4.01 | 3.99 | 66.53 | Moderate | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra (_android14) | 25022.84 | 6.18 | 62.54 | Moderate | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra (_android14) | 24761.78 | 6.27 | 60.28 | Moderate | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8 | pixel3 (_rooted_android) | 5.93 | 7.68 | 46.93 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+iphone15_ios18 | mv3_xnnq8 | iphone15 (_ios18) | 48.23 | 12.84 | 24.53 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+s22_android13 | mv3_xnnq8 | s22 (_android13) | 2.73 | 23.03 | 14.94 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_mps+iphone15max_ios17 | mv3_mps | iphone15max (_ios17) | 1.25 | 35.07 | 12.50 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+iphone15max_ios17 | mv3_xnnq8 | iphone15max (_ios17) | 13.98 | 24.60 | 10.82 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+iphone15_ios18 | llama3_spinq | iphone15 (_ios18) | 22076.03 | 27.53 | 6.64 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+iphone15_ios18 | llama3_qlora | iphone15 (_ios18) | 23169.07 | 25.42 | 2.81 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+s22ultra_android14 | mv3_xnnq8 | s22ultra (_android14) | 2.91 | 39.08 | 0.00 | Poor | +---------------------------------+--------------+--------------------------+---------------------+----------+-------------------+--------------------+ Reference (Public) Datasets Summary: +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | Dataset | Model | Device | Mean Latency (ms) | CV (%) | Stability Score | Stability Rating | +=================================+==============+=======================+=====================+==========+===================+====================+ | mv3_coreml+iphone15max_ios17 | mv3_coreml | iphone15max (_ios17) | 1.00 | 0.00 | 100.00 | Excellent | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_coreml+iphone15_ios18 | mv3_coreml | iphone15 (_ios18) | 1.00 | 0.00 | 100.00 | Excellent | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_qnn+s22ultra_android12 | mv3_qnn | s22ultra (_android12) | 1.02 | 1.35 | 90.39 | Excellent | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+iphone15_ios18 | llama3_qlora | iphone15 (_ios18) | 14429.20 | 4.11 | 73.16 | Moderate | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+iphone15_ios18 | llama3_spinq | iphone15 (_ios18) | 13820.34 | 4.79 | 68.08 | Moderate | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+s22_android13 | llama3_spinq | s22 (_android13) | 22774.60 | 8.55 | 48.84 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+s22_android13 | llama3_qlora | s22 (_android13) | 23841.98 | 8.72 | 46.07 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+s22_android12 | llama3_spinq | s22 (_android12) | 23902.04 | 10.92 | 40.15 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+s22ultra_android12 | llama3_spinq | s22ultra (_android12) | 24769.21 | 10.96 | 37.66 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+s22Ultra5G_android | llama3_qlora | s22Ultra5G (_android) | 24685.50 | 10.84 | 37.62 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_mps+iphone15_ios18 | mv3_mps | iphone15 (_ios18) | 3.75 | 17.76 | 37.50 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+s22ultra_android12 | mv3_xnnq8 | s22ultra (_android12) | 3.63 | 22.35 | 15.48 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_mps+iphone15max_ios17 | mv3_mps | iphone15max (_ios17) | 1.03 | 16.10 | 12.50 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_qlora+iphone15max_ios17 | llama3_qlora | iphone15max (_ios17) | 14133.01 | 21.37 | 10.57 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | llama3_spinq+iphone15max_ios17 | llama3_spinq | iphone15max (_ios17) | 13118.40 | 21.76 | 2.65 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+iphone15max_ios17 | mv3_xnnq8 | iphone15max (_ios17) | 13.97 | 33.93 | 1.15 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+s22_android13 | mv3_xnnq8 | s22 (_android13) | 1.92 | 55.09 | 0.00 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_xnnq8+iphone15_ios18 | mv3_xnnq8 | iphone15 (_ios18) | 49.85 | 41.06 | 0.00 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ | mv3_qnn+s22_android13 | mv3_qnn | s22 (_android13) | 1.44 | 57.29 | 0.00 | Poor | +---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+ Private vs Public Comparison: +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | Dataset | Private Device | Public Device | Private Score | Public Score | Score Diff | Private CV (%) | Public CV (%) | CV Diff (%) | +=============================+=======================+=======================+=================+================+==============+==================+=================+===============+ | mv3_qnn on s22 | s22 (_android13) | s22 (_android13) | 82.41 | 0.00 | 82.41 | 2.34 | 57.29 | -54.95 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_spinq on iphone15max | iphone15max (_ios17) | iphone15max (_ios17) | 72.90 | 2.65 | 70.25 | 3.78 | 21.76 | -17.97 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_qlora on iphone15max | iphone15max (_ios17) | iphone15max (_ios17) | 75.15 | 10.57 | 64.58 | 3.73 | 21.37 | -17.64 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_qlora on s22 | s22 (_android13) | s22 (_android13) | 83.37 | 46.07 | 37.31 | 2.64 | 8.72 | -6.08 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_spinq on s22 | s22 (_android13) | s22 (_android13) | 84.70 | 48.84 | 35.87 | 2.36 | 8.55 | -6.18 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_mps on iphone15 | iphone15 (_ios18) | iphone15 (_ios18) | 66.53 | 37.50 | 29.03 | 3.99 | 17.76 | -13.77 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_xnnq8 on iphone15 | iphone15 (_ios18) | iphone15 (_ios18) | 24.53 | 0.00 | 24.53 | 12.84 | 41.06 | -28.22 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_spinq on s22ultra | s22ultra (_android14) | s22ultra (_android12) | 60.28 | 37.66 | 22.62 | 6.27 | 10.96 | -4.69 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_xnnq8 on s22 | s22 (_android13) | s22 (_android13) | 14.94 | 0.00 | 14.94 | 23.03 | 55.09 | -32.06 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_xnnq8 on iphone15max | iphone15max (_ios17) | iphone15max (_ios17) | 10.82 | 1.15 | 9.67 | 24.60 | 33.93 | -9.33 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_qnn on s22ultra | s22ultra (_android14) | s22ultra (_android12) | 93.81 | 90.39 | 3.42 | 0.91 | 1.35 | -0.44 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_coreml on iphone15max | iphone15max (_ios17) | iphone15max (_ios17) | 100.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_mps on iphone15max | iphone15max (_ios17) | iphone15max (_ios17) | 12.50 | 12.50 | 0.00 | 35.07 | 16.10 | 18.97 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_coreml on iphone15 | iphone15 (_ios18) | iphone15 (_ios18) | 100.00 | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | mv3_xnnq8 on s22ultra | s22ultra (_android14) | s22ultra (_android12) | 0.00 | 15.48 | -15.48 | 39.08 | 22.35 | 16.73 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_spinq on iphone15 | iphone15 (_ios18) | iphone15 (_ios18) | 6.64 | 68.08 | -61.44 | 27.53 | 4.79 | 22.73 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ | llama3_qlora on iphone15 | iphone15 (_ios18) | iphone15 (_ios18) | 2.81 | 73.16 | -70.35 | 25.42 | 4.11 | 21.31 | +-----------------------------+-----------------------+-----------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+ 
@pytorch-bot
Copy link

pytorch-bot bot commented May 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10982

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7b6d907 with merge base 0c9a4f5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2025
@guangy10 guangy10 requested review from huydhn and yangw-dev May 19, 2025 22:13
Copy link
Contributor

@yangw-dev yangw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommand to add the pip dependencies in requirements.txt next to the analyze_latency_stability.py

maybe it's good it has its own folder

@guangy10 guangy10 changed the title Script for benchmark satbility assessment [Not To Land] Script for benchmark satbility assessment May 19, 2025
@guangy10 guangy10 force-pushed the benchmark_assessment branch 2 times, most recently from cae229e to 3b8aa35 Compare May 27, 2025 20:27
@guangy10 guangy10 force-pushed the benchmark_assessment branch from 3b8aa35 to dd44b4e Compare June 4, 2025 17:45
@guangy10 guangy10 changed the title [Not To Land] Script for benchmark satbility assessment Script for benchmark stability assessment Jun 4, 2025
@guangy10 guangy10 force-pushed the benchmark_assessment branch from dd44b4e to cd676b8 Compare June 4, 2025 18:09
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 4, 2025

Fixed linter

@guangy10 guangy10 added the release notes: none Do not include this in the release notes label Jun 4, 2025
@guangy10 guangy10 marked this pull request as ready for review June 4, 2025 18:17
@guangy10
Copy link
Contributor Author

guangy10 commented Jun 4, 2025

As discussed with @yangw-dev offline, to make the stability assessment part of the benchmark infra as suggested in this post, I will merge this script under .ci/scripts together with other scripts used by CI and benchmark infra. @yangw-dev will take over from there and rework on the interface to

  1. directly piping the data from DB instead of requiring manual dumping to the .xlsx first
  2. support running stability assessment on any combination of time frame, devices, models, backends, etc.
  3. chrono jobs to run this stability assessment and visualize results in the dashboard UI
@guangy10 guangy10 force-pushed the benchmark_assessment branch from cd676b8 to 7b6d907 Compare June 4, 2025 18:30
@guangy10 guangy10 requested a review from yangw-dev June 4, 2025 18:31
@guangy10 guangy10 merged commit 2269160 into main Jun 4, 2025
190 checks passed
@guangy10 guangy10 deleted the benchmark_assessment branch June 4, 2025 23:41
yangw-dev added a commit that referenced this pull request Jun 23, 2025
# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in #10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <elainewy@meta.com>
hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request Jun 26, 2025
# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in pytorch#10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <elainewy@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes

4 participants