AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.


When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.


Click here to view other performance data.

MLPerf Inference v5.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
DeepSeek R1420,659 tokens/sec72x GB30072x GB300-288GB_aarch64, TensorRTNVIDIA GB30099% of FP16 (exact match 81.9132%)mlperf_deepseek_r1
289,712 tokens/sec72x GB20072x GB200-186GB_aarch64, TensorRTNVIDIA GB20099% of FP16 (exact match 81.9132%)mlperf_deepseek_r1
33,379 tokens/sec8x B200NVIDIA DGX B200 NVIDIA B20099% of FP16 (exact match 81.9132%)mlperf_deepseek_r1
Llama3.1 405B16,104 tokens/sec72x GB30072x GB300-288GB_aarch64, TensorRTNVIDIA GB30099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
14,774 tokens/sec72x GB20072x GB200-186GB_aarch64, TensorRTNVIDIA GB20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
1,660 tokens/sec8x B200Dell PowerEdge XE9685LNVIDIA B20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
553 tokens/sec8x H200Nebius H200NVIDIA H20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B51,737 tokens/sec4x GB2004x GB200-186GB_aarch64, TensorRTNVIDIA GB20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)OpenOrca (max_seq_len=1024)
102,909 tokens/sec8x B200ThinkSystem SR680a V3NVIDIA B20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)OpenOrca (max_seq_len=1024)
35,317 tokens/sec8x H200Dell PowerEdge XE9680NVIDIA H20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)OpenOrca (max_seq_len=1024)
Llama3.1 8B146,960 tokens/sec8x B200ThinkSystem SR780a V3NVIDIA B20099% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)CNN Dailymail (v3.0.0, max_seq_len=2048)
66,037 tokens/sec8x H200HPE Cray XD670NVIDIA H20099% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)CNN Dailymail (v3.0.0, max_seq_len=2048)
Whisper22,273 samples/sec4x GB200BM.GPU.GB200.4NVIDIA GB20099% of FP32 and 99.9% of FP32 (WER=2.0671%)LibriSpeech
45,333 samples/sec8x B200NVIDIA DGX B200NVIDIA B20099% of FP32 and 99.9% of FP32 (WER=2.0671%)LibriSpeech
34,451 samples/sec8x H200HPE Cray XD670NVIDIA H20099% of FP32 and 99.9% of FP32 (WER=2.0671%)LibriSpeech
Stable Diffusion XL33 samples/sec8x B200NVIDIA DGX B200NVIDIA B200FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
19 samples/sec8x H200QuantaGrid D74H-7UNVIDIA H200FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]Subset of coco-2014 val
RGAT651,230 samples/sec8x B200NVIDIA DGX B200NVIDIA B20099% of FP32 (72.86%)IGBH
RetinaNet14,997 samples/sec8x H200HPE Cray XD670NVIDIA H20099% of FP32 (0.3755 mAP)OpenImages (800x800)
DLRMv2647,861 samples/sec8x H200QuantaGrid D74H-7UNVIDIA H20099% of FP32 and 99.9% of FP32 (AUC=80.31%)Synthetic Multihot Criteo Dataset

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1209,328 tokens/sec72x GB30072x GB300-288GB_aarch64, TensorRTNVIDIA GB30099% of FP16 (exact match 81.9132%)TTFT/TPOT: 2000 ms/80 msmlperf_deepseek_r1
167,578 tokens/sec72x GB20072x GB200-186GB_aarch64, TensorRTNVIDIA GB20099% of FP16 (exact match 81.9132%)TTFT/TPOT: 2000 ms/80 msmlperf_deepseek_r1
18,592 tokens/sec8x B200NVIDIA DGX B200NVIDIA B20099% of FP16 (exact match 81.9132%)TTFT/TPOT: 2000 ms/80 msmlperf_deepseek_r1
Llama3.1 405B12,248 tokens/sec72x GB30072x GB300-288GB_aarch64, TensorRTNVIDIA GB30099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
11,614 tokens/sec72x GB20072x GB200-186GB_aarch64, TensorRTNVIDIA GB20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
1,280 tokens/sec8x B200Nebius B200NVIDIA B20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
296 tokens/sec8x H200QuantaGrid D74H-7UNVIDIA H20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 6000 ms/175 msSubset of LongBench, LongDataCollections, Ruler, GovReport
Llama3.1 405B Interactive9,921 tokens/sec72x GB20072x GB200-186GB_aarch64, TensorRTNVIDIA GB20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 4500 ms/80 msSubset of LongBench, LongDataCollections, Ruler, GovReport
771 tokens/sec8x B200Nebius B200NVIDIA B20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 4500 ms/80 msSubset of LongBench, LongDataCollections, Ruler, GovReport
203 tokens/sec8x H200Nebius H200NVIDIA H20099% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)TTFT/TPOT: 4500 ms/80 msSubset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B49,360 tokens/sec4x GB2004x GB200-186GB_aarch64, TensorRTNVIDIA GB20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/200 msOpenOrca (max_seq_len=1024)
101,611 tokens/sec8x B200Nebius B200NVIDIA B20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/200 msOpenOrca (max_seq_len=1024)
34,194 tokens/sec8x H200ASUSTeK ESC N8 H200NVIDIA H20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/200 msOpenOrca (max_seq_len=1024)
Llama2 70B Interactive29,746 tokens/sec4x GB2004x GB200-186GB_aarch64, TensorRTNVIDIA GB20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 450 ms/40 msOpenOrca (max_seq_len=1024)
62,851 tokens/sec8x B200G894-SD1NVIDIA B20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 450 ms/40 msOpenOrca (max_seq_len=1024)
23,080 tokens/sec8x H200Nebius H200NVIDIA H20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 450 ms/40 msOpenOrca (max_seq_len=1024)
Llama3.1 8B128,794 tokens/sec8x B200Dell PowerEdge XE9685LNVIDIA B20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/100 msOpenOrca (max_seq_len=1024)
64,915 tokens/sec8x H200HPE Cray XD670NVIDIA H20099.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)TTFT/TPOT: 2000 ms/100 msOpenOrca (max_seq_len=1024)
Llama3.1 8B Interactive122,269 tokens/sec8x B200AS-4126GS-NBR-LCCNVIDIA B20099% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)TTFT/TPOT: 500 ms/30 msCNN Dailymail (v3.0.0, max_seq_len=2048)
54,118 tokens/sec8x H200QuantaGrid D74H-7UNVIDIA H20099% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)TTFT/TPOT: 500 ms/30 msCNN Dailymail (v3.0.0, max_seq_len=2048)
Stable Diffusion XL29 queries/sec8x B200Supermicro SYS-422GA-NBRT-LCCNVIDIA B200FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
18 queries/sec8x H200QuantaGrid D74H-7UNVIDIA H200FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]20 sSubset of coco-2014 val
RetinaNet14,406 queries/sec8x H200ASUSTeK ESC N8 H200NVIDIA H20099% of FP32 (0.3755 mAP)100 msOpenImages (800x800)
DLRMv2591,162 queries/sec8x H200ASUSTeK ESC N8 H200NVIDIA H20099% of FP32 (AUC=80.31%)60 msSynthetic Multihot Criteo Dataset

MLPerf™ v5.1 Inference Closed: DeepSeek R1 99% of FP16, Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Stable Diffusion XL, Whisper, RetinaNet, RGAT, DLRM 99% of FP32 accuracy target: 5.1-0007, 5.1-0009, 5.1-0026, 5.1-0028, 5.1-0046, 5.1-0049, 5.1-0060, 5.1-0061, 5.1-0062, 5.1-0069, 5.1-0070, 5.1-0071, 5.1-0072, 5.1-0073, 5.1-0075, 5.1-0077, 5.1-0079, 5.1-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama3.1 8B Max Sequence Length = 2,048
Llama2 70B Max Sequence Length = 1,024
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

GPT OSS 120B - Max Throughput

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
GPT OSS 120BTP4EP41,0242,04884,611 output tokens/sec4x GB200NVIDIA GB200 NVL72FP4TensorRT-LLM 0.21NVIDIA GB200

Attention: Tensor Parallelism = 4
MoE: Expert Parallelism = 4
Input tokens not included in TPS calculations

DeepSeek R1 - Max Throughput

Model Attention MoE Input Length Output Length Throughput GPU Server Precision Framework GPU Version
DeepSeek R1 0528TP8EP81,0242,04843,146 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 0.20NVIDIA B200

Accuracy Evaluation:
Precision FP8 (AA Ref): MMLU Pro = 85 | GPQA Diamond = 81 | LiveCodeBench = 77 | SCICODE = 40 | MATH-500 = 98 | AIME 2024 = 89
Precision FP4: MMLU Pro = 84.2 | GPQA Diamond = 80 | LiveCodeBench = 76.3 | SCICODE = 40.1 | MATH-500 = 98.1 | AIME 2024 = 91.3
More details on Accuracy Evalution here
Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
Input tokens not included in TPS calculations

B200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B18128204866,057 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B18128409639,496 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B1820481287,329 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B1850005008,190 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B18500200057,117 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B181000100042,391 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B181000200034,105 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B182048204826,854 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 235B A22B182000020004,453 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B11128204837,844 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B11128409624,953 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B1120481286,251 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B1150005006,142 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B11500200027,817 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B111000100025,828 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B111000200022,051 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B112048204817,554 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Qwen3 30B A3B112000020002,944 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick181282048112,676 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick18128409668,170 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick18204812818,088 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick181000100079,617 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick181000200063,766 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick182048204852,195 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Maverick1820000200012,678 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout1112820484,481 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout1112840968,932 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout1120481283,137 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout1150005002,937 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout11500200011,977 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout111000100010,591 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout11100020009,356 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout11204820487,152 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v4 Scout112000020001,644 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
DeepSeek R118128204862,599 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
DeepSeek R118128409644,046 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
DeepSeek R1181000100037,634 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
DeepSeek R1182048204828,852 output tokens/sec8x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B1112820489,922 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B1112840966,831 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B1120481281,339 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B1150005001,459 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B1150020007,762 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B11100010007,007 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B11100020006,737 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 0.19.0NVIDIA B200
Llama v3.3 70B11204820484,783 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.3 70B11200002000665 output tokens/sec1x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B1412820488,020 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B1412840966,345 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B142048128749 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B1450005001,048 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B1450020006,244 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B14100010005,209 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B14100020004,933 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B14204820484,212 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200
Llama v3.1 405B14200002000672 output tokens/sec4x B200DGX B200FP4TensorRT-LLM 1.0NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout4112812817,857 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout4112820489,491 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout2212840966,281 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout4120481283,391 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout4150005002,496 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout4150020009,253 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout41100010008,121 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout41100020006,980 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout41204820484,939 output tokens/sec4x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B2112820484,776 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B2112840962,960 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B2150020004,026 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B21100010003,658 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B21100020003,106 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B21204820482,243 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B21200002000312 output tokens/sec2x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B811281284,866 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B8112820483,132 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B812048128588 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B815000500616 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B8150020002,468 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B81100010002,460 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B81100020002,009 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B81204820481,485 output tokens/sec8x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B1112812822,757 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B1112840967,585 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B1120481282,653 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B1150005002,283 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B11500200010,612 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B11100020008,000 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B11204820485,423 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B11200002000756 output tokens/sec1x RTX PRO 6000Supermicro SYS-521GE-TNRTFP4TensorRT-LLM 0.21.0NVIDIA RTX PRO 6000 Blackwell Server Edition

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H200 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Qwen3 235B A22B18128204842,821 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B18128409626,852 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B1820481283,331 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B1850005003,623 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B18500200028,026 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B181000100023,789 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B181000200022,061 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B182048204816,672 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Qwen3 235B A22B182000020001,876 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick18128204840,572 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick18128409624,616 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick1820481287,307 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick1850005008,456 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick18500200037,835 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick181000100031,782 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick181000200034,734 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick182048204820,957 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Maverick182000020004,106 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout14128204834,316 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout14128409621,332 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout1420481283,699 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout1450005004,605 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout14500200024,630 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout141000100021,636 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout141000200018,499 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout142048204814,949 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v4 Scout142000020002,105 output tokens/sec4x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B1112820484,336 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B1112840962,872 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B112048128442 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B115000500566 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B1150020003,666 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B11100010002,909 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B11100020002,994 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B11204820482,003 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.3 70B11200002000283 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B1812820485,661 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B1812840965,167 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 0.19.0NVIDIA H200
Llama v3.1 405B182048128456 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B185000500650 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B1850020004,724 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B18100010003,330 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B18100020003,722 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B18204820482,948 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 405B18200002000505 output tokens/sec8x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B11128204826,221 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B11128409618,027 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B1120481283,538 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B1150005003,902 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B11500200020,770 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B111000100017,744 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B111000200016,828 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B112048204812,194 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200
Llama v3.1 8B112000020001,804 output tokens/sec1x H200DGX H200FP8TensorRT-LLM 1.0NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

H100 Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v3.3 70B1212820486,651 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B1212840964,199 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B122048128762 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B125000500898 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B1250020005,222 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B12100010004,205 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B12100020004,146 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B12204820483,082 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.3 70B12200002000437 output tokens/sec2x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B1812820484,340 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B1812840963,116 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B182048128453 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B185000500610 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B1850020003,994 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B18100010002,919 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B18100020002,895 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B18204820482,296 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 405B18200002000345 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B11128204822,714 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B11128409614,325 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B1120481283,450 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B1150005003,459 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B11500200017,660 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B111000100015,220 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B111000200013,899 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B11204820489,305 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB
Llama v3.1 8B112000020001,351 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 1.0H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model PP TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
Llama v4 Scout2212820481,105 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout221284096707 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout412048128561 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout415000500307 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout2250020001,093 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout2210001000920 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout2210002000884 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v4 Scout2220482048615 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B4112820481,694 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B221284096972 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B4150020001,413 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B41100010001,498 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B41100020001,084 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.3 70B4120482048773 output tokens/sec4x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B111281288,471 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B1112840962,888 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B1120481281,017 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B115000500863 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B1150020004,032 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B11100020003,134 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B11204820482,148 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S
Llama v3.1 8B11200002000280 output tokens/sec1x L40SSupermicro SYS-521GE-TNRTFP8TensorRT-LLM 0.21.0NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5818,517 images/sec39 images/sec/watt0.431x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12857,280 images/sec58 images/sec/watt2.231x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
EfficientNet-B0810,861 images/sec30 images/sec/watt0.741x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12828,889 images/sec41 images/sec/watt4.431x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
EfficientNet-B482,634 images/sec5 images/sec/watt3.041x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
1284,101 images/sec5 images/sec/watt31.211x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
HF Swin Base86,062 samples/sec14 samples/sec/watt1.321x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
3211,319 samples/sec19 samples/sec/watt2.831x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
HF Swin Large84,742 samples/sec10 samples/sec/watt1.691x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
327,479 samples/sec11 samples/sec/watt4.281x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200
HF ViT Base811,267 samples/sec22 samples/sec/watt0.711x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
6421,688 samples/sec29 samples/sec/watt2.951x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
HF ViT Large85,171 samples/sec8 samples/sec/watt1.551x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
648,485 samples/sec10 samples/sec/watt7.541x B200DGX B20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA B200
QuartzNet87,787 samples/sec24 samples/sec/watt1.031x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
12825,034 samples/sec47 samples/sec/watt5.111x B200DGX B20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA B200
RetinaNet-RN3483,318 images/sec8 images/sec/watt2.411x B200DGX B20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA B200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,253 images/sec67 images/sec/watt0.381x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
12865,328 images/sec107 images/sec/watt1.961x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
EfficientNet-B0817,243 images/sec77 images/sec/watt0.461x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
12857,387 images/sec122 images/sec/watt2.231x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
EfficientNet-B484,613 images/sec14 images/sec/watt1.731x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
1289,018 images/sec15 images/sec/watt14.191x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF Swin Base85,040 samples/sec11 samples/sec/watt1.591x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
328,175 samples/sec12 samples/sec/watt3.911x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF Swin Large83,387 samples/sec6 samples/sec/watt2.361x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
324,720 samples/sec7 samples/sec/watt6.781x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
HF ViT Base88,847 samples/sec19 samples/sec/watt0.91x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
6415,611 samples/sec23 samples/sec/watt4.11x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
HF ViT Large83,667 samples/sec6 samples/sec/watt2.181x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
645,459 samples/sec8 samples/sec/watt11.721x H200DGX H20025.04-py3FP8SyntheticTensorRT 10.9NVIDIA H200
QuartzNet87,012 samples/sec25 samples/sec/watt1.141x H200DGX H20025.04-py3MixedSyntheticTensorRT 10.9NVIDIA H200
12834,359 samples/sec90 samples/sec/watt3.731x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200
RetinaNet-RN3483,025 images/sec9 images/sec/watt2.641x H200DGX H20025.04-py3INT8SyntheticTensorRT 10.9NVIDIA H200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,420 images/sec61 images/sec/watt0.371x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12866,276 images/sec105 images/sec/watt1.931x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
EfficientNet-B0817,198 images/sec68 images/sec/watt0.471x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12857,736 images/sec116 images/sec/watt2.221x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
EfficientNet-B484,622 images/sec13 images/sec/watt1.731x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
1289,015 images/sec15 images/sec/watt14.21x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
HF Swin Base85,023 samples/sec11 samples/sec/watt1.591x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
328,046 samples/sec12 samples/sec/watt3.981x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
HF Swin Large83,351 samples/sec6 samples/sec/watt2.391x GH200NVIDIA P388025.04-py3MixedSyntheticTensorRT 10.9NVIDIA GH200
324,502 samples/sec7 samples/sec/watt7.111x GH200NVIDIA P388025.04-py3MixedSyntheticTensorRT 10.9NVIDIA GH200
HF ViT Base88,746 samples/sec18 samples/sec/watt0.911x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
6415,167 samples/sec23 samples/sec/watt4.221x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
HF ViT Large83,360 samples/sec6 samples/sec/watt2.381x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
645,165 samples/sec8 samples/sec/watt12.391x GH200NVIDIA P388025.04-py3FP8SyntheticTensorRT 10.9NVIDIA GH200
QuartzNet87,038 samples/sec24 samples/sec/watt1.141x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
12834,280 samples/sec82 samples/sec/watt3.731x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200
RetinaNet-RN3482,955 images/sec5 images/sec/watt2.711x GH200NVIDIA P388025.04-py3INT8SyntheticTensorRT 10.9NVIDIA GH200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5821,912 images/sec65 images/sec/watt0.371x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
12856,829 images/sec119 images/sec/watt2.251x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
EfficientNet-B0817,208 images/sec63 images/sec/watt0.461x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
12852,455 images/sec191 images/sec/watt2.441x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
EfficientNet-B484,419 images/sec13 images/sec/watt1.811x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
1288,701 images/sec14 images/sec/watt14.711x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
HF Swin Base85,124 samples/sec9 samples/sec/watt1.561x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
327,348 samples/sec11 samples/sec/watt4.351x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
HF Swin Large83,147 samples/sec6 samples/sec/watt2.541x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
324,392 samples/sec6 samples/sec/watt7.291x H100DGX H10025.04-py3MixedSyntheticTensorRT 10.9H100 SXM5-80GB
HF ViT Base88,494 samples/sec17 samples/sec/watt0.941x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
6414,968 samples/sec22 samples/sec/watt4.281x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
HF ViT Large83,399 samples/sec5 samples/sec/watt2.351x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
645,195 samples/sec8 samples/sec/watt12.321x H100DGX H10025.04-py3FP8SyntheticTensorRT 10.9H100 SXM5-80GB
QuartzNet87,002 samples/sec23 samples/sec/watt1.141x H100DGX H10025.04-py3MixedSyntheticTensorRT 10.9H100 SXM5-80GB
12834,881 samples/sec95 samples/sec/watt3.671x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB
RetinaNet-RN3482,764 images/sec15 images/sec/watt2.891x H100DGX H10025.04-py3INT8SyntheticTensorRT 10.9H100 SXM5-80GB

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5823,025 images/sec71 images/sec/watt0.351x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
3229,073 images/sec84 images/sec/watt4.41x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientDet-D084,640 images/sec16 images/sec/watt1.721x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientNet-B0820,504 images/sec96 images/sec/watt0.391x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
3242,553 images/sec127 images/sec/watt3.011x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
EfficientNet-B485,135 images/sec17 images/sec/watt1.561x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
164,066 images/sec12 images/sec/watt31.481x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF Swin Base83,812 samples/sec11 samples/sec/watt2.11x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
164,236 samples/sec12 samples/sec/watt7.551x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF Swin Large81,939 samples/sec6 samples/sec/watt4.121x L40SSupermicro SYS-521GE-TNRT25.04-py3MixedSyntheticTensorRT 10.9NVIDIA L40S
162,027 samples/sec6 samples/sec/watt15.791x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
HF ViT Base86,247 samples/sec18 samples/sec/watt1.281x L40SSupermicro SYS-521GE-TNRT25.04-py3FP8SyntheticTensorRT 10.9NVIDIA L40S
HF ViT Large81,979 samples/sec6 samples/sec/watt4.041x L40SSupermicro SYS-521GE-TNRT25.04-py3FP8SyntheticTensorRT 10.9NVIDIA L40S
QuartzNet87,570 samples/sec31 samples/sec/watt1.061x L40SSupermicro SYS-521GE-TNRT25.04-py3MixedSyntheticTensorRT 10.9NVIDIA L40S
12822,478 samples/sec65 samples/sec/watt5.691x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S
RetinaNet-RN3481,477 images/sec6 images/sec/watt5.421x L40SSupermicro SYS-521GE-TNRT25.04-py3INT8SyntheticTensorRT 10.9NVIDIA L40S

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More