|
2 | 2 |
|
3 | 3 | The LLM API is a high-level Python API designed to streamline LLM inference workflows. |
4 | 4 |
|
5 | | -It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server). |
| 5 | +It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo). |
6 | 6 |
|
7 | 7 | While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime. |
8 | 8 |
|
9 | | -## Table of Contents |
10 | | -- [Quick Start Example](#quick-start-example) |
11 | | -- [Supported Models](#supported-models) |
12 | | -- [Tips and Troubleshooting](#tips-and-troubleshooting) |
13 | 9 |
|
14 | 10 | ## Quick Start Example |
15 | 11 | A simple inference example with TinyLlama using the LLM API: |
@@ -53,39 +49,6 @@ llm = LLM(model=<local_path_to_model>) |
53 | 49 | > **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading. |
54 | 50 |
|
55 | 51 |
|
56 | | -## Supported Models |
57 | | - |
58 | | - |
59 | | -| Models | [Model Class Name](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/models) | HuggingFace Model ID Example | Modality | |
60 | | -| :-------------------------------------------------------------- | :----------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------ | :------: | |
61 | | -| BERT-based | `BertForSequenceClassification` | `textattack/bert-base-uncased-yelp-polarity` | L | |
62 | | -| DeepSeek-V3 | `DeepseekV3ForCausalLM` | `deepseek-ai/DeepSeek-V3 ` | L | |
63 | | -| Gemma3 | `Gemma3ForCausalLM` | `google/gemma-3-1b-it` | L | |
64 | | -| HyperCLOVAX-SEED-Vision | `HCXVisionForCausalLM` | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | L + V | |
65 | | -| VILA | `LlavaLlamaModel` | `Efficient-Large-Model/NVILA-8B` | L + V | |
66 | | -| LLaVA-NeXT | `LlavaNextForConditionalGeneration` | `llava-hf/llava-v1.6-mistral-7b-hf` | L + V | |
67 | | -| Llama 3 <br> Llama 3.1 <br> Llama 2 <br> LLaMA | `LlamaForCausalLM` | `meta-llama/Meta-Llama-3.1-70B` | L | |
68 | | -| Llama 4 Scout <br> Llama 4 Maverick | `Llama4ForConditionalGeneration` | `meta-llama/Llama-4-Scout-17B-16E-Instruct` <br> `meta-llama/Llama-4-Maverick-17B-128E-Instruct` | L + V | |
69 | | -| Mistral | `MistralForCausalLM` | `mistralai/Mistral-7B-v0.1` | L | |
70 | | -| Mixtral | `MixtralForCausalLM` | `mistralai/Mixtral-8x7B-v0.1` | L | |
71 | | -| Llama 3.2 | `MllamaForConditionalGeneration` | `meta-llama/Llama-3.2-11B-Vision` | L | |
72 | | -| Nemotron-3 <br> Nemotron-4 <br> Minitron | `NemotronForCausalLM` | `nvidia/Minitron-8B-Base` | L | |
73 | | -| Nemotron-H | `NemotronHForCausalLM` | `nvidia/Nemotron-H-8B-Base-8K` <br> `nvidia/Nemotron-H-47B-Base-8K` <br> `nvidia/Nemotron-H-56B-Base-8K` | L | |
74 | | -| LLamaNemotron <br> LlamaNemotron Super <br> LlamaNemotron Ultra | `NemotronNASForCausalLM` | `nvidia/Llama-3_1-Nemotron-51B-Instruct` <br> `nvidia/Llama-3_3-Nemotron-Super-49B-v1` <br> `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | L | |
75 | | -| QwQ, Qwen2 | `Qwen2ForCausalLM` | `Qwen/Qwen2-7B-Instruct` | L | |
76 | | -| Qwen2-based | `Qwen2ForProcessRewardModel` | `Qwen/Qwen2.5-Math-PRM-7B` | L | |
77 | | -| Qwen2-based | `Qwen2ForRewardModel` | `Qwen/Qwen2.5-Math-RM-72B` | L | |
78 | | -| Qwen2-VL | `Qwen2VLForConditionalGeneration` | `Qwen/Qwen2-VL-7B-Instruct` | L + V | |
79 | | -| Qwen2.5-VL | `Qwen2_5_VLForConditionalGeneration` | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V | |
80 | | - |
81 | | - |
82 | | -- **L**: Language model only |
83 | | -- **L + V**: Language and Vision multimodal support |
84 | | -- Llama 3.2 accepts vision input, but our support currently limited to text only. |
85 | | - |
86 | | -> **Note:** For the most up-to-date list of supported models, you may refer to the [TensorRT-LLM model definitions](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/models). |
87 | | -
|
88 | | - |
89 | 52 | ## Tips and Troubleshooting |
90 | 53 |
|
91 | 54 | The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM: |
|
0 commit comments