According to the page: vLLM | NVIDIA NGC
Start the HTTP inference server inside the container OK:
(APIServer pid=157) INFO: Started server process(157)
(APIServer pid=157) INFO: Waiting for application startup.
(APIServer pid=157) INFO: Application startup complete.
open a new console, run the client:
curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user", "content": "What is NVIDIA famous for?"}] }'
but, report the error:
Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused
Hi,
Could you share the log when starting the inference server so we can know more about the issue?
Thanks.
root@74ffe1684e65:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 2>&1 | tee Llama-3.1-8B-Instruct-FP8.txt
Llama-3.1-8B-Instruct-FP8.txt (16.7 KB)
What is your docker run command? I think --net=host helps.
for example
docker run --name vllm --rm -it --network host \ --runtime=nvidia --gpus all --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \ -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \ -v "$HOME/.cache:/root/.cache" \ nvcr.io/nvidia/vllm:25.09-py3
Don’t know it’s required but I tell vllm the host and port to be sure
python -m vllm.entrypoints.openai.api_server \ --model nvidia/Llama-3.1-8B-Instruct-FP8 \ --host 0.0.0.0 --port 8000 \ --tensor-parallel-size 1 \ --max-model-len 512 \ --max-num-seqs 2 \ --gpu-memory-utilization 0.25 \ --kv-cache-dtype=auto
Then this query works for me
curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP4", "messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}] }' |jq
First, my docker commands are same with the Page :
docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3
then, Follow your instructions exactly, it still report the error:
Unable to round-trip http request to upstream: dial tcp 127.0.0.1:8000: connect: connection refused
there is specific configuration for the docker proxy:
nvidia@ThorA:~$ cat .docker/config.json
{
“proxies”: {
“default”: {
“httpProxy”: “http://192.168.2.211:39355”,
“httpsProxy”: “http://192.168.2.211:39355”,
“noProxy”: “127.0.0.0/8”
}
}
}
Is that the reason? but i must need the proxy, or it will report the connection error.
Following may work with your proxy. I don’t have a proxy to test.
# replace with your actual proxy and any networks you don't want proxied. you don't have to do the exports but it may help. export HTTP_PROXY=http://192.168.2.211:39355 export HTTPS_PROXY=http://192.168.2.211:39355 export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12
docker run -d --rm \ --name vllm \ --runtime nvidia \ --gpus all \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 6678:6678 \ nvcr.io/nvidia/vllm:25.09-py3
python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 6678 \ --model nvidia/Llama-3.1-8B-Instruct-FP8
curl http://localhost:6678/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user","content":"Say hello in one sentence."}], "temperature": 0.2 }' |jq
if localhost doesn’t work in query try 127.0.0.1 or 0.0.0.0
Hi,
Just test the same command on our local device, and it can work.
So the command is fine.
$ docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3
root@8d0005b63337:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 ... (APIServer pid=362) INFO: Started server process [362] (APIServer pid=362) INFO: Waiting for application startup. (APIServer pid=362) INFO: Application startup complete.
Here is the output of the file you mentioned from our side, which doesn’t have the proxy setting.
$ cat .docker/config.json { "auths": { "nvcr.io": { "auth": "xxx" } } }
Could you try the @whitesscott suggestion to see if you can access the vLLM serving?
Thanks.
docker and server in docker are OK on my device too.
please have a try to connect the server with a client, like page writes:
open a new console, run the client:
curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user", "content": "What is NVIDIA famous for?"}] }'
on my side, it report the error:
Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused
nvidia@ThorA:~$ export HTTP_PROXY=http://192.168.2.211:39355 nvidia@ThorA:~$ export HTTPS_PROXY=http://192.168.2.211:39355 nvidia@ThorA:~$ export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12 nvidia@ThorA:~$ docker run -d --rm –name vllm –runtime nvidia –gpus all –ulimit memlock=-1 –ulimit stack=67108864 -p 6678:6678 1770b6d7d9f72039d8dd45913b1541af225dd9a8dbb39814c0f76be8af8a8640 nvidia@ThorA:~$ nvidia@ThorA:~$ python -m vllm.entrypoints.openai.api_server –host 0.0.0.0 –port 6678 –model nvidia/Llama-3.1-8B-Instruct-FP8 /usr/bin/python: Error while finding module specification for ‘vllm.entrypoints.openai.api_server’ (ModuleNotFoundError: No module named ‘vllm.entrypoints’) nvidia@ThorA:~$
docker running in background now.
Try following for now just to get it to work and not running the container as a -d daemon.
docker run -it --rm --net=host --runtime nvidia --privileged \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -v $HOME/.cache:/root/.cache \ -v $PWD:/workspace \ --workdir /workspace \ -e $HF_TOKEN \ nvcr.io/nvidia/vllm:25.09-py3
Then in the vllm docker container confirm this exists
ls /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/
Still in container, try this
python -m vllm.entrypoints.openai.api_server \ --model nvidia/Llama-3.1-8B-Instruct-FP8 \ --host 0.0.0.0 \ --port 6678 \ --tensor-parallel-size 1 \ --max-model-len 512 \ --max-num-seqs 2 \ --gpu-memory-utilization 0.25 \ --kv-cache-dtype=auto \ --model nvidia/Llama-3.1-8B-Instruct-FP8
After you see
(APIServer pid=157) INFO: Started server process [157] (APIServer pid=157) INFO: Waiting for application startup. (APIServer pid=157) INFO: Application startup complete.
In a second terminal
curl http://0.0.0.0:6678/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user","content":"Say hello in one sentence."}], "temperature": 0.2 }' |jq