- Notifications
You must be signed in to change notification settings - Fork 2k
doc: update cuda_graph_config usage part in DS R1 docs #5796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: update cuda_graph_config usage part in DS R1 docs #5796
Conversation
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md Outdated Show resolved Hide resolved
| /bot run |
| PR_Github #11154 [ run ] triggered by Bot |
| PR_Github #11154 [ run ] completed with state |
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md Outdated Show resolved Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR replaces deprecated CUDA Graph flags (cuda_graph_padding_enabled and cuda_graph_batch_sizes) with the new nested cuda_graph_config structure across tests and documentation.
- Updated integration tests to use the new
cuda_graph_configdict in model configs. - Revised tech blog and best practice docs to show
cuda_graph_configusage. - Ensured consistency of config examples across YAML snippets.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/integration/defs/perf/pytorch_model_config.py | Replaced flat CUDA Graph flags with nested cuda_graph_config. |
| docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md | Updated inline code example to reference cuda_graph_config. |
| docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md | Refactored YAML snippets to use the new cuda_graph_config map. |
Comments suppressed due to low confidence (3)
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md:154
- The inline code snippet includes a literal
\nwhich may not render as intended. Consider using a fenced code block or multi-line inline code forcuda_graph_configto improve readability and rendering.
This had a significant **22% E2E performance impact** for throughput scenarios. CUDA Graphs allow capturing a sequence of CUDA operations and launching them as a single unit, drastically reducing kernel launch overheads. This is particularly beneficial for models with many small kernels, and particularly on the PyTorch flow, because the python host code normally executes slower than C++. Since the CUDA Graph freezes the kernel launch parameters, which is normally associated with the tensor shapes, it can only be safely used with static shape, meaning that different CUDA graphs need to be captured for different batch sizes. Each graph will have some cost of memory usage, and capturing time, thus we cannot capture every possible CUDA graph for all possible batches. For the non-captured batch sizes, PyTorch eager mode code will be executed. There is a feature called CUDA Graph padding in TensorRT-LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation. Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n padding_enabled: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41) docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md:198
- The YAML example drops the leading hyphens used for mapping keys and may not be valid YAML as shown. Update the snippet to present a proper mapping, e.g.:```yaml
use_cuda_graph: true
cuda_graph_config:
padding_enabled: true
batch_sizes:- 896
- 512
- 256
cuda_graph_config:
**docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md:265** * Similarly, in the second YAML snippet ensure consistent YAML mapping syntax rather than list-like hyphens before each key. Use a code fence to clearly separate document structure from list items. cuda_graph_config:
</details> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
ecebf3d to f1235b4 Compare | /bot reuse-pipeline |
| PR_Github #11242 [ reuse-pipeline ] triggered by Bot |
| PR_Github #11242 [ reuse-pipeline ] completed with state |
…ghput_on_NVIDIA_Blackwell_GPUs.md Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
f1235b4 to a077921 Compare | /bot reuse-pipeline |
| PR_Github #11250 [ reuse-pipeline ] triggered by Bot |
| PR_Github #11250 [ reuse-pipeline ] completed with state |
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Yuxin <yuxinz@nvidia.com>
Fix the missing typos on coda_graph_config.