- Notifications
You must be signed in to change notification settings - Fork 4.7k
vLLM AI inference serving example #566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /cc @janetkuo |
| | ||
| ## 📚 Table of Contents | ||
| | ||
| - [Prerequisites](#prerequisites) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this link is broken (the section title below has an emoji); same for some other table of contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
ai/vllm-deployment/README.md Outdated
| | ||
| ## ⚙️ Prerequisites | ||
| | ||
| - Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clarifies that GKE was used for testing and that the example can be adapted to other Kubernetes clusters with NVIDIA GPUs. The node selectors being commented out is helpful.
It would be even better to briefly explain what a user on a different platform might need to adjust (e.g., node labels for GPU type).
Given the direction that the community is moving towards, this example could switch to use DRA, but that can be updated in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section on Cloud Provider Prerequisites (as a sub-section within the Prerequisites section), which specifies what is necessary from the three largest cloud providers. Please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I find the new section a bit long, so suggest merging into this bullet point so we don't spend too much text talking about cloud configurations.
| - Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32). | |
| - A Kubernetes cluster with access to NVIDIA GPUs. This example was tested on GKE, but can be adapted for other cloud providers like EKS and AKS by ensuring you have a GPU-enabled node pool and have deployed the Nvidia device plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed. Please have a look.
| spec: | ||
| containers: | ||
| - name: inference-server | ||
| image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an equivalent official vLLM image available on another more vendor-neutral registry? If using this specific image is necessary (e.g., due to optimizations or pre-configurations crucial for the example), this dependency and the reason for it should be clearly documented in the README.md under "Prerequisites."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've updated to the more generic version of this image: vllm/vllm-openai:latest
162dd69 to a60f29d Compare
janetkuo left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits and lgtm otherwise
ai/vllm-deployment/README.md Outdated
| | ||
| ## ⚙️ Prerequisites | ||
| | ||
| - Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I find the new section a bit long, so suggest merging into this bullet point so we don't spend too much text talking about cloud configurations.
| - Kubernetes cluster which has access to Nvidia GPU's (tested with GKE autopilot cluster v1.32). | |
| - A Kubernetes cluster with access to NVIDIA GPUs. This example was tested on GKE, but can be adapted for other cloud providers like EKS and AKS by ensuring you have a GPU-enabled node pool and have deployed the Nvidia device plugin. |
| - Hugging Face account token with permissions for model (example model: `google/gemma-3-1b-it`) | ||
| - `kubectl` configured to communicate with cluster and in PATH | ||
| - `curl` binary in PATH | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Note for GKE users:** To target specific GPU types, you can uncomment the GKE-specific `nodeSelector` in `vllm-deployment.yaml`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed. Please have a look.
| spec: | ||
| containers: | ||
| - name: inference-server | ||
| image: vllm/vllm-openai:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a version tag instead of latest, ref https://github.com/kubernetes/examples/blob/master/guidelines.md#manifests-and-configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed. I tested by pulling the image. Please have a look.
| spec: | ||
| containers: | ||
| - name: inference-server | ||
| image: vllm/vllm-openai@sha256:05a31dc4185b042e91f4d2183689ac8a87bd845713d5c3f987563c5899878271 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHA isn't readable, so suggest adding a comment here for clarity
| image: vllm/vllm-openai@sha256:05a31dc4185b042e91f4d2183689ac8a87bd845713d5c3f987563c5899878271 | |
| # vllm/vllm-openai:v0.10.0 | |
| image: vllm/vllm-openai@sha256:05a31dc4185b042e91f4d2183689ac8a87bd845713d5c3f987563c5899878271 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea--done.
janetkuo left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: janetkuo, seans3 The full list of commands accepted by this bot can be found here. The pull request process is described here Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
vLLMserver.