doc: update cuda_graph_config usage part in DS R1 docs #5796

nv-guomingz · 2025-07-07T13:32:20Z

Fix the missing typos on coda_graph_config.

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

nv-guomingz · 2025-07-07T13:36:00Z

/bot run

tensorrt-cicd · 2025-07-07T13:41:35Z

PR_Github #11154 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-07T16:20:39Z

PR_Github #11154 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8248 completed with status: 'SUCCESS'

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

Copilot

Pull Request Overview

This PR replaces deprecated CUDA Graph flags (cuda_graph_padding_enabled and cuda_graph_batch_sizes) with the new nested cuda_graph_config structure across tests and documentation.

Updated integration tests to use the new cuda_graph_config dict in model configs.
Revised tech blog and best practice docs to show cuda_graph_config usage.
Ensured consistency of config examples across YAML snippets.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
tests/integration/defs/perf/pytorch_model_config.py	Replaced flat CUDA Graph flags with nested `cuda_graph_config`.
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md	Updated inline code example to reference `cuda_graph_config`.
docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md	Refactored YAML snippets to use the new `cuda_graph_config` map.

Comments suppressed due to low confidence (3)

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md:154

The inline code snippet includes a literal \n which may not render as intended. Consider using a fenced code block or multi-line inline code for cuda_graph_config to improve readability and rendering.

 This had a significant **22% E2E performance impact** for throughput scenarios. CUDA Graphs allow capturing a sequence of CUDA operations and launching them as a single unit, drastically reducing kernel launch overheads. This is particularly beneficial for models with many small kernels, and particularly on the PyTorch flow, because the python host code normally executes slower than C++. Since the CUDA Graph freezes the kernel launch parameters, which is normally associated with the tensor shapes, it can only be safely used with static shape, meaning that different CUDA graphs need to be captured for different batch sizes. Each graph will have some cost of memory usage, and capturing time, thus we cannot capture every possible CUDA graph for all possible batches. For the non-captured batch sizes, PyTorch eager mode code will be executed. There is a feature called CUDA Graph padding in TensorRT-LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation. Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n padding_enabled: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md:198

The YAML example drops the leading hyphens used for mapping keys and may not be valid YAML as shown. Update the snippet to present a proper mapping, e.g.:```yaml
use_cuda_graph: true
cuda_graph_config:
padding_enabled: true
batch_sizes:
- 896
- 512
- 256

cuda_graph_config:

**docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md:265** * Similarly, in the second YAML snippet ensure consistent YAML mapping syntax rather than list-like hyphens before each key. Use a code fence to clearly separate document structure from list items.

cuda_graph_config:

</details>

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz · 2025-07-08T07:08:53Z

/bot reuse-pipeline

tensorrt-cicd · 2025-07-08T07:13:56Z

PR_Github #11242 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-07-08T07:26:55Z

PR_Github #11242 [ reuse-pipeline ] completed with state SUCCESS
Release Check Pipeline #1411 failed
Reusing PR_Github #11154 for commit f1235b4

…ghput_on_NVIDIA_Blackwell_GPUs.md Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz · 2025-07-08T07:39:43Z

/bot reuse-pipeline

tensorrt-cicd · 2025-07-08T07:44:52Z

PR_Github #11250 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-07-08T07:54:43Z

PR_Github #11250 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #11154 for commit a077921

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Yuxin <yuxinz@nvidia.com>

nv-guomingz requested review from Kefeng-Duan, kaiyux and litaotju July 7, 2025 13:32

nv-guomingz commented Jul 7, 2025

View reviewed changes

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md Outdated Show resolved Hide resolved

nv-guomingz changed the title ~~doc: update docs with right cuda_graph_config~~ doc: update cuda_graph_config usage part in DS R1 docs Jul 7, 2025

Kefeng-Duan approved these changes Jul 8, 2025

View reviewed changes

kaiyux approved these changes Jul 8, 2025

View reviewed changes

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md Outdated Show resolved Hide resolved

kaiyux requested a review from Copilot July 8, 2025 06:51

Copilot AI reviewed Jul 8, 2025

View reviewed changes

doc: update docs with right cuda_graph_config

aadf935

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz force-pushed the user/guomingz/update_cuda_graph_config_doc branch from ecebf3d to f1235b4 Compare July 8, 2025 07:08

nv-guomingz enabled auto-merge (squash) July 8, 2025 07:09

Update docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throu…

a077921

…ghput_on_NVIDIA_Blackwell_GPUs.md Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz force-pushed the user/guomingz/update_cuda_graph_config_doc branch from f1235b4 to a077921 Compare July 8, 2025 07:39

nv-guomingz merged commit c8fa08d into NVIDIA:main Jul 8, 2025
3 checks passed

nv-guomingz deleted the user/guomingz/update_cuda_graph_config_doc branch September 30, 2025 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

doc: update cuda_graph_config usage part in DS R1 docs #5796

doc: update cuda_graph_config usage part in DS R1 docs #5796

Uh oh!

nv-guomingz commented Jul 7, 2025

Uh oh!

nv-guomingz commented Jul 7, 2025

tensorrt-cicd commented Jul 7, 2025

tensorrt-cicd commented Jul 7, 2025

Uh oh!

Copilot AI left a comment

nv-guomingz commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

nv-guomingz commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Labels

4 participants

doc: update cuda_graph_config usage part in DS R1 docs #5796

doc: update cuda_graph_config usage part in DS R1 docs #5796

Uh oh!

Conversation

nv-guomingz commented Jul 7, 2025

Uh oh!

nv-guomingz commented Jul 7, 2025

tensorrt-cicd commented Jul 7, 2025

tensorrt-cicd commented Jul 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

nv-guomingz commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

nv-guomingz commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Labels

4 participants