AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.

When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.

Click here to view other performance data.

MLPerf Inference v5.1 Performance Benchmarks

Offline Scenario, Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	Dataset
DeepSeek R1	420,659 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	289,712 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	33,379 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
Llama3.1 405B	16,104 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	14,774 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,660 tokens/sec	8x B200	Dell PowerEdge XE9685L	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	553 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	51,737 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
	102,909 tokens/sec	8x B200	ThinkSystem SR680a V3	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
	35,317 tokens/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
Llama3.1 8B	146,960 tokens/sec	8x B200	ThinkSystem SR780a V3	NVIDIA B200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
	66,037 tokens/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
Whisper	22,273 samples/sec	4x GB200	BM.GPU.GB200.4	NVIDIA GB200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
	45,333 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
	34,451 samples/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
Stable Diffusion XL	33 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
	19 samples/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
RGAT	651,230 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP32 (72.86%)	IGBH
RetinaNet	14,997 samples/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 (0.3755 mAP)	OpenImages (800x800)
DLRMv2	647,861 samples/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP32 and 99.9% of FP32 (AUC=80.31%)	Synthetic Multihot Criteo Dataset

Server Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	MLPerf Server Latency Constraints (ms)	Dataset
DeepSeek R1	209,328 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	167,578 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	18,592 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
Llama3.1 405B	12,248 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	11,614 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,280 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	296 tokens/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama3.1 405B Interactive	9,921 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	771 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	203 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	49,360 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	101,611 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	34,194 tokens/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
Llama2 70B Interactive	29,746 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	62,851 tokens/sec	8x B200	G894-SD1	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	23,080 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B	128,794 tokens/sec	8x B200	Dell PowerEdge XE9685L	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/100 ms	OpenOrca (max_seq_len=1024)
	64,915 tokens/sec	8x H200	HPE Cray XD670	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/100 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B Interactive	122,269 tokens/sec	8x B200	AS-4126GS-NBR-LCC	NVIDIA B200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
	54,118 tokens/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
Stable Diffusion XL	29 queries/sec	8x B200	Supermicro SYS-422GA-NBRT-LCC	NVIDIA B200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
	18 queries/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
RetinaNet	14,406 queries/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99% of FP32 (0.3755 mAP)	100 ms	OpenImages (800x800)
DLRMv2	591,162 queries/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99% of FP32 (AUC=80.31%)	60 ms	Synthetic Multihot Criteo Dataset

MLPerf™ v5.1 Inference Closed: DeepSeek R1 99% of FP16, Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Stable Diffusion XL, Whisper, RetinaNet, RGAT, DLRM 99% of FP32 accuracy target: 5.1-0007, 5.1-0009, 5.1-0026, 5.1-0028, 5.1-0046, 5.1-0049, 5.1-0060, 5.1-0061, 5.1-0062, 5.1-0069, 5.1-0070, 5.1-0071, 5.1-0072, 5.1-0073, 5.1-0075, 5.1-0077, 5.1-0079, 5.1-0086. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama3.1 8B Max Sequence Length = 2,048
Llama2 70B Max Sequence Length = 1,024
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

GPT OSS 120B - Max Throughput

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
GPT OSS 120B	TP4	EP4	1,024	2,048	84,611 output tokens/sec	4x GB200	NVIDIA GB200 NVL72	FP4	TensorRT-LLM 0.21	NVIDIA GB200

Attention: Tensor Parallelism = 4
MoE: Expert Parallelism = 4
Input tokens not included in TPS calculations

DeepSeek R1 - Max Throughput

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
DeepSeek R1 0528	TP8	EP8	1,024	2,048	43,146 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 0.20	NVIDIA B200

B200 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	1	8	128	2048	66,057 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	128	4096	39,496 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	2048	128	7,329 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	5000	500	8,190 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	500	2000	57,117 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	1000	1000	42,391 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	1000	2000	34,105 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	2048	2048	26,854 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 235B A22B	1	8	20000	2000	4,453 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

Qwen3 30B A3B	1	1	128	2048	37,844 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	128	4096	24,953 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	2048	128	6,251 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	5000	500	6,142 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	500	2000	27,817 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	1000	1000	25,828 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	1000	2000	22,051 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	2048	2048	17,554 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Qwen3 30B A3B	1	1	20000	2000	2,944 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

Llama v4 Maverick	1	8	128	2048	112,676 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	128	4096	68,170 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	2048	128	18,088 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	1000	1000	79,617 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	1000	2000	63,766 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	2048	2048	52,195 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Maverick	1	8	20000	2000	12,678 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

Llama v4 Scout	1	1	128	2048	4,481 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	128	4096	8,932 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	2048	128	3,137 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	5000	500	2,937 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	500	2000	11,977 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	1000	1000	10,591 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	1000	2000	9,356 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	2048	2048	7,152 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v4 Scout	1	1	20000	2000	1,644 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

DeepSeek R1	1	8	128	2048	62,599 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
DeepSeek R1	1	8	128	4096	44,046 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
DeepSeek R1	1	8	1000	1000	37,634 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
DeepSeek R1	1	8	2048	2048	28,852 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

Llama v3.3 70B	1	1	128	2048	9,922 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	128	4096	6,831 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	2048	128	1,339 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	5000	500	1,459 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	500	2000	7,762 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	1000	1000	7,007 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	1000	2000	6,737 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 0.19.0	NVIDIA B200
Llama v3.3 70B	1	1	2048	2048	4,783 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.3 70B	1	1	20000	2000	665 output tokens/sec	1x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

Llama v3.1 405B	1	4	128	2048	8,020 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	128	4096	6,345 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	2048	128	749 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	5000	500	1,048 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	500	2000	6,244 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	1000	1000	5,209 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	1000	2000	4,933 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	2048	2048	4,212 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200
Llama v3.1 405B	1	4	20000	2000	672 output tokens/sec	4x B200	DGX B200	FP4	TensorRT-LLM 1.0	NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v4 Scout	4	1	128	128	17,857 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	128	2048	9,491 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	2	2	128	4096	6,281 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	2048	128	3,391 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	5000	500	2,496 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	500	2000	9,253 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	1000	1000	8,121 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	1000	2000	6,980 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v4 Scout	4	1	2048	2048	4,939 output tokens/sec	4x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition

Llama v3.3 70B	2	1	128	2048	4,776 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	128	4096	2,960 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	500	2000	4,026 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	1000	1000	3,658 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	1000	2000	3,106 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	2048	2048	2,243 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.3 70B	2	1	20000	2000	312 output tokens/sec	2x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition

Llama v3.1 405B	8	1	128	128	4,866 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	128	2048	3,132 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	2048	128	588 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	5000	500	616 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	500	2000	2,468 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	1000	1000	2,460 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	1000	2000	2,009 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 405B	8	1	2048	2048	1,485 output tokens/sec	8x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition

Llama v3.1 8B	1	1	128	128	22,757 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	128	4096	7,585 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	2048	128	2,653 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	5000	500	2,283 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	500	2000	10,612 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	1000	2000	8,000 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	2048	2048	5,423 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition
Llama v3.1 8B	1	1	20000	2000	756 output tokens/sec	1x RTX PRO 6000	Supermicro SYS-521GE-TNRT	FP4	TensorRT-LLM 0.21.0	NVIDIA RTX PRO 6000 Blackwell Server Edition

H200 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Qwen3 235B A22B	1	8	128	2048	42,821 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	128	4096	26,852 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	2048	128	3,331 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	5000	500	3,623 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	500	2000	28,026 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	1000	1000	23,789 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	1000	2000	22,061 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	2048	2048	16,672 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Qwen3 235B A22B	1	8	20000	2000	1,876 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

Llama v4 Maverick	1	8	128	2048	40,572 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	128	4096	24,616 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	2048	128	7,307 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	5000	500	8,456 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	500	2000	37,835 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	1000	1000	31,782 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	1000	2000	34,734 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	2048	2048	20,957 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Maverick	1	8	20000	2000	4,106 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

Llama v4 Scout	1	4	128	2048	34,316 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	128	4096	21,332 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	2048	128	3,699 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	5000	500	4,605 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	500	2000	24,630 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	1000	1000	21,636 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	1000	2000	18,499 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	2048	2048	14,949 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v4 Scout	1	4	20000	2000	2,105 output tokens/sec	4x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

Llama v3.3 70B	1	1	128	2048	4,336 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	128	4096	2,872 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	2048	128	442 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	5000	500	566 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	500	2000	3,666 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	1000	1000	2,909 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	1000	2000	2,994 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	2048	2048	2,003 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.3 70B	1	1	20000	2000	283 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

Llama v3.1 405B	1	8	128	2048	5,661 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.19.0	NVIDIA H200
Llama v3.1 405B	1	8	128	4096	5,167 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.19.0	NVIDIA H200
Llama v3.1 405B	1	8	2048	128	456 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	5000	500	650 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	500	2000	4,724 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	1000	1000	3,330 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	1000	2000	3,722 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	2048	2048	2,948 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 405B	1	8	20000	2000	505 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

Llama v3.1 8B	1	1	128	2048	26,221 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	128	4096	18,027 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	128	3,538 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	5000	500	3,902 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	500	2000	20,770 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	1000	1000	17,744 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	1000	2000	16,828 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	2048	12,194 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200
Llama v3.1 8B	1	1	20000	2000	1,804 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 1.0	NVIDIA H200

H100 Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.3 70B	1	2	128	2048	6,651 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	128	4096	4,199 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	2048	128	762 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	5000	500	898 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	500	2000	5,222 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	1000	1000	4,205 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	1000	2000	4,146 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	2048	2048	3,082 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.3 70B	1	2	20000	2000	437 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB

Llama v3.1 405B	1	8	128	2048	4,340 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	128	4096	3,116 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	2048	128	453 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	5000	500	610 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	500	2000	3,994 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	1000	1000	2,919 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	1000	2000	2,895 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	2048	2048	2,296 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 405B	1	8	20000	2000	345 output tokens/sec	8x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB

Llama v3.1 8B	1	1	128	2048	22,714 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	128	4096	14,325 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	2048	128	3,450 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	5000	500	3,459 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	500	2000	17,660 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	1000	1000	15,220 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	1000	2000	13,899 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	2048	2048	9,305 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB
Llama v3.1 8B	1	1	20000	2000	1,351 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 1.0	H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - Max Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v4 Scout	2	2	128	2048	1,105 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	2	2	128	4096	707 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	4	1	2048	128	561 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	4	1	5000	500	307 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	2	2	500	2000	1,093 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	2	2	1000	1000	920 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	2	2	1000	2000	884 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v4 Scout	2	2	2048	2048	615 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S

Llama v3.3 70B	4	1	128	2048	1,694 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	2	2	128	4096	972 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	4	1	500	2000	1,413 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	4	1	1000	1000	1,498 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	4	1	1000	2000	1,084 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.3 70B	4	1	2048	2048	773 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S

Llama v3.1 8B	1	1	128	128	8,471 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	4096	2,888 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	128	1,017 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	5000	500	863 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	500	2000	4,032 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	1000	2000	3,134 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	2048	2,148 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S
Llama v3.1 8B	1	1	20000	2000	280 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.21.0	NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

B200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	18,517 images/sec	39 images/sec/watt	0.43	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
	128	57,280 images/sec	58 images/sec/watt	2.23	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
EfficientNet-B0	8	10,861 images/sec	30 images/sec/watt	0.74	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
	128	28,889 images/sec	41 images/sec/watt	4.43	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
EfficientNet-B4	8	2,634 images/sec	5 images/sec/watt	3.04	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
	128	4,101 images/sec	5 images/sec/watt	31.21	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
HF Swin Base	8	6,062 samples/sec	14 samples/sec/watt	1.32	1x B200	DGX B200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA B200
	32	11,319 samples/sec	19 samples/sec/watt	2.83	1x B200	DGX B200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA B200
HF Swin Large	8	4,742 samples/sec	10 samples/sec/watt	1.69	1x B200	DGX B200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA B200
	32	7,479 samples/sec	11 samples/sec/watt	4.28	1x B200	DGX B200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA B200
HF ViT Base	8	11,267 samples/sec	22 samples/sec/watt	0.71	1x B200	DGX B200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA B200
	64	21,688 samples/sec	29 samples/sec/watt	2.95	1x B200	DGX B200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA B200
HF ViT Large	8	5,171 samples/sec	8 samples/sec/watt	1.55	1x B200	DGX B200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA B200
	64	8,485 samples/sec	10 samples/sec/watt	7.54	1x B200	DGX B200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA B200
QuartzNet	8	7,787 samples/sec	24 samples/sec/watt	1.03	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
	128	25,034 samples/sec	47 samples/sec/watt	5.11	1x B200	DGX B200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA B200
RetinaNet-RN34	8	3,318 images/sec	8 images/sec/watt	2.41	1x B200	DGX B200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA B200

HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

H200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	21,253 images/sec	67 images/sec/watt	0.38	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
	128	65,328 images/sec	107 images/sec/watt	1.96	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
EfficientNet-B0	8	17,243 images/sec	77 images/sec/watt	0.46	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
	128	57,387 images/sec	122 images/sec/watt	2.23	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
EfficientNet-B4	8	4,613 images/sec	14 images/sec/watt	1.73	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
	128	9,018 images/sec	15 images/sec/watt	14.19	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
HF Swin Base	8	5,040 samples/sec	11 samples/sec/watt	1.59	1x H200	DGX H200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA H200
	32	8,175 samples/sec	12 samples/sec/watt	3.91	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
HF Swin Large	8	3,387 samples/sec	6 samples/sec/watt	2.36	1x H200	DGX H200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA H200
	32	4,720 samples/sec	7 samples/sec/watt	6.78	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
HF ViT Base	8	8,847 samples/sec	19 samples/sec/watt	0.9	1x H200	DGX H200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA H200
	64	15,611 samples/sec	23 samples/sec/watt	4.1	1x H200	DGX H200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA H200
HF ViT Large	8	3,667 samples/sec	6 samples/sec/watt	2.18	1x H200	DGX H200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA H200
	64	5,459 samples/sec	8 samples/sec/watt	11.72	1x H200	DGX H200	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA H200
QuartzNet	8	7,012 samples/sec	25 samples/sec/watt	1.14	1x H200	DGX H200	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA H200
	128	34,359 samples/sec	90 samples/sec/watt	3.73	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200
RetinaNet-RN34	8	3,025 images/sec	9 images/sec/watt	2.64	1x H200	DGX H200	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA H200

GH200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	21,420 images/sec	61 images/sec/watt	0.37	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
	128	66,276 images/sec	105 images/sec/watt	1.93	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
EfficientNet-B0	8	17,198 images/sec	68 images/sec/watt	0.47	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
	128	57,736 images/sec	116 images/sec/watt	2.22	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
EfficientNet-B4	8	4,622 images/sec	13 images/sec/watt	1.73	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
	128	9,015 images/sec	15 images/sec/watt	14.2	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
HF Swin Base	8	5,023 samples/sec	11 samples/sec/watt	1.59	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
	32	8,046 samples/sec	12 samples/sec/watt	3.98	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
HF Swin Large	8	3,351 samples/sec	6 samples/sec/watt	2.39	1x GH200	NVIDIA P3880	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA GH200
	32	4,502 samples/sec	7 samples/sec/watt	7.11	1x GH200	NVIDIA P3880	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA GH200
HF ViT Base	8	8,746 samples/sec	18 samples/sec/watt	0.91	1x GH200	NVIDIA P3880	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA GH200
	64	15,167 samples/sec	23 samples/sec/watt	4.22	1x GH200	NVIDIA P3880	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA GH200
HF ViT Large	8	3,360 samples/sec	6 samples/sec/watt	2.38	1x GH200	NVIDIA P3880	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA GH200
	64	5,165 samples/sec	8 samples/sec/watt	12.39	1x GH200	NVIDIA P3880	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA GH200
QuartzNet	8	7,038 samples/sec	24 samples/sec/watt	1.14	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
	128	34,280 samples/sec	82 samples/sec/watt	3.73	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200
RetinaNet-RN34	8	2,955 images/sec	5 images/sec/watt	2.71	1x GH200	NVIDIA P3880	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA GH200

H100 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	21,912 images/sec	65 images/sec/watt	0.37	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	128	56,829 images/sec	119 images/sec/watt	2.25	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
EfficientNet-B0	8	17,208 images/sec	63 images/sec/watt	0.46	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	128	52,455 images/sec	191 images/sec/watt	2.44	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
EfficientNet-B4	8	4,419 images/sec	13 images/sec/watt	1.81	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	128	8,701 images/sec	14 images/sec/watt	14.71	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
HF Swin Base	8	5,124 samples/sec	9 samples/sec/watt	1.56	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	32	7,348 samples/sec	11 samples/sec/watt	4.35	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
HF Swin Large	8	3,147 samples/sec	6 samples/sec/watt	2.54	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	32	4,392 samples/sec	6 samples/sec/watt	7.29	1x H100	DGX H100	25.04-py3	Mixed	Synthetic	TensorRT 10.9	H100 SXM5-80GB
HF ViT Base	8	8,494 samples/sec	17 samples/sec/watt	0.94	1x H100	DGX H100	25.04-py3	FP8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	64	14,968 samples/sec	22 samples/sec/watt	4.28	1x H100	DGX H100	25.04-py3	FP8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
HF ViT Large	8	3,399 samples/sec	5 samples/sec/watt	2.35	1x H100	DGX H100	25.04-py3	FP8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	64	5,195 samples/sec	8 samples/sec/watt	12.32	1x H100	DGX H100	25.04-py3	FP8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
QuartzNet	8	7,002 samples/sec	23 samples/sec/watt	1.14	1x H100	DGX H100	25.04-py3	Mixed	Synthetic	TensorRT 10.9	H100 SXM5-80GB
	128	34,881 samples/sec	95 samples/sec/watt	3.67	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB
RetinaNet-RN34	8	2,764 images/sec	15 images/sec/watt	2.89	1x H100	DGX H100	25.04-py3	INT8	Synthetic	TensorRT 10.9	H100 SXM5-80GB

L40S Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	23,025 images/sec	71 images/sec/watt	0.35	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
	32	29,073 images/sec	84 images/sec/watt	4.4	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
EfficientDet-D0	8	4,640 images/sec	16 images/sec/watt	1.72	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
EfficientNet-B0	8	20,504 images/sec	96 images/sec/watt	0.39	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
	32	42,553 images/sec	127 images/sec/watt	3.01	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
EfficientNet-B4	8	5,135 images/sec	17 images/sec/watt	1.56	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
	16	4,066 images/sec	12 images/sec/watt	31.48	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
HF Swin Base	8	3,812 samples/sec	11 samples/sec/watt	2.1	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
	16	4,236 samples/sec	12 samples/sec/watt	7.55	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
HF Swin Large	8	1,939 samples/sec	6 samples/sec/watt	4.12	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA L40S
	16	2,027 samples/sec	6 samples/sec/watt	15.79	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
HF ViT Base	8	6,247 samples/sec	18 samples/sec/watt	1.28	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA L40S
HF ViT Large	8	1,979 samples/sec	6 samples/sec/watt	4.04	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	FP8	Synthetic	TensorRT 10.9	NVIDIA L40S
QuartzNet	8	7,570 samples/sec	31 samples/sec/watt	1.06	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	Mixed	Synthetic	TensorRT 10.9	NVIDIA L40S
	128	22,478 samples/sec	65 samples/sec/watt	5.69	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S
RetinaNet-RN34	8	1,477 images/sec	6 images/sec/watt	5.42	1x L40S	Supermicro SYS-521GE-TNRT	25.04-py3	INT8	Synthetic	TensorRT 10.9	NVIDIA L40S

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More