Vllm client connection refused

liu.jialu · October 29, 2025, 5:37am

According to the page: vLLM | NVIDIA NGC

Start the HTTP inference server inside the container OK:

(APIServer pid=157) INFO: Started server process(157)

(APIServer pid=157) INFO: Waiting for application startup.

(APIServer pid=157) INFO: Application startup complete.

open a new console, run the client:

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user", "content": "What is NVIDIA famous for?"}] }'

but, report the error:

Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused

AastaLLL · October 29, 2025, 7:26am

Hi,

Could you share the log when starting the inference server so we can know more about the issue?

Thanks.

liu.jialu · October 29, 2025, 8:40am

root@74ffe1684e65:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 2>&1 | tee Llama-3.1-8B-Instruct-FP8.txt

Llama-3.1-8B-Instruct-FP8.txt (16.7 KB)

whitesscott · October 30, 2025, 5:07am

What is your docker run command? I think --net=host helps.

for example

docker run --name vllm --rm -it --network host \ --runtime=nvidia --gpus all --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \ -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \ -v "$HOME/.cache:/root/.cache" \ nvcr.io/nvidia/vllm:25.09-py3

Don’t know it’s required but I tell vllm the host and port to be sure

python -m vllm.entrypoints.openai.api_server \ --model nvidia/Llama-3.1-8B-Instruct-FP8 \ --host 0.0.0.0 --port 8000 \ --tensor-parallel-size 1 \ --max-model-len 512 \ --max-num-seqs 2 \ --gpu-memory-utilization 0.25 \ --kv-cache-dtype=auto

Then this query works for me

curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP4", "messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}] }' |jq

liu.jialu · October 30, 2025, 6:34am

First, my docker commands are same with the Page :

docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3

then, Follow your instructions exactly, it still report the error:

Unable to round-trip http request to upstream: dial tcp 127.0.0.1:8000: connect: connection refused

there is specific configuration for the docker proxy:

nvidia@ThorA:~$ cat .docker/config.json
{
“proxies”: {
“default”: {
“httpProxy”: “http://192.168.2.211:39355”,
“httpsProxy”: “http://192.168.2.211:39355”,
“noProxy”: “127.0.0.0/8”
}
}
}

Is that the reason? but i must need the proxy, or it will report the connection error.

whitesscott · October 30, 2025, 7:10am

Following may work with your proxy. I don’t have a proxy to test.

# replace with your actual proxy and any networks you don't want proxied. you don't have to do the exports but it may help. export HTTP_PROXY=http://192.168.2.211:39355 export HTTPS_PROXY=http://192.168.2.211:39355 export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12

docker run -d --rm \ --name vllm \ --runtime nvidia \ --gpus all \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 6678:6678 \ nvcr.io/nvidia/vllm:25.09-py3

 python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 6678 \ --model nvidia/Llama-3.1-8B-Instruct-FP8

curl http://localhost:6678/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user","content":"Say hello in one sentence."}], "temperature": 0.2 }' |jq

if localhost doesn’t work in query try 127.0.0.1 or 0.0.0.0

AastaLLL · October 30, 2025, 7:47am

Hi,

Just test the same command on our local device, and it can work.
So the command is fine.

$ docker run --gpus all -it --rm nvcr.io/nvidia/vllm:25.09-py3

root@8d0005b63337:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85 ... (APIServer pid=362) INFO: Started server process [362] (APIServer pid=362) INFO: Waiting for application startup. (APIServer pid=362) INFO: Application startup complete.

Here is the output of the file you mentioned from our side, which doesn’t have the proxy setting.

$ cat .docker/config.json {	"auths": {	"nvcr.io": {	"auth": "xxx"	}	} }

Could you try the @whitesscott suggestion to see if you can access the vLLM serving?

Thanks.

liu.jialu · October 30, 2025, 7:57am

docker and server in docker are OK on my device too.

please have a try to connect the server with a client, like page writes:

open a new console, run the client:

curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user", "content": "What is NVIDIA famous for?"}] }'

on my side, it report the error:

Unable to round-trip http request to upstream: dial tcp 0.0.0.0:8000: connect: connection refused

liu.jialu · October 30, 2025, 10:53am

nvidia@ThorA:~$ export HTTP_PROXY=http://192.168.2.211:39355 nvidia@ThorA:~$ export HTTPS_PROXY=http://192.168.2.211:39355 nvidia@ThorA:~$ export NO_PROXY=127.0.0.1,localhost,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12 nvidia@ThorA:~$ docker run -d --rm –name vllm –runtime nvidia –gpus all –ulimit memlock=-1 –ulimit stack=67108864 -p 6678:6678 1770b6d7d9f72039d8dd45913b1541af225dd9a8dbb39814c0f76be8af8a8640 nvidia@ThorA:~$ nvidia@ThorA:~$ python -m vllm.entrypoints.openai.api_server –host 0.0.0.0 –port 6678 –model nvidia/Llama-3.1-8B-Instruct-FP8 /usr/bin/python: Error while finding module specification for ‘vllm.entrypoints.openai.api_server’ (ModuleNotFoundError: No module named ‘vllm.entrypoints’) nvidia@ThorA:~$

docker running in background now.

whitesscott · October 31, 2025, 1:13am

Try following for now just to get it to work and not running the container as a -d daemon.

docker run -it --rm --net=host --runtime nvidia --privileged \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -v $HOME/.cache:/root/.cache \ -v $PWD:/workspace \ --workdir /workspace \ -e $HF_TOKEN \ nvcr.io/nvidia/vllm:25.09-py3

Then in the vllm docker container confirm this exists
ls /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/

Still in container, try this

python -m vllm.entrypoints.openai.api_server \ --model nvidia/Llama-3.1-8B-Instruct-FP8 \ --host 0.0.0.0 \ --port 6678 \ --tensor-parallel-size 1 \ --max-model-len 512 \ --max-num-seqs 2 \ --gpu-memory-utilization 0.25 \ --kv-cache-dtype=auto \ --model nvidia/Llama-3.1-8B-Instruct-FP8

After you see

(APIServer pid=157) INFO: Started server process [157] (APIServer pid=157) INFO: Waiting for application startup. (APIServer pid=157) INFO: Application startup complete.

In a second terminal

curl http://0.0.0.0:6678/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-3.1-8B-Instruct-FP8", "messages": [{"role":"user","content":"Say hello in one sentence."}], "temperature": 0.2 }' |jq

Topic		Replies	Views
vLLM error handling model response in Continue returns: “request to localhost failed, reason: connect ECONNREFUSED localhost” DGX Spark / GB10	1	41	November 10, 2025
tlt-pull pull giving URL not reachable Docker and NVIDIA Docker	0	865	April 26, 2019
An error occurred when logging in to nvcr.io NGC GPU Cloud	0	594	November 11, 2020
Aunch NVIDIA NIM (llama3-8b-instruct) for LLMs locally Access/Accounts nim , llama3-8b-instruct	3	178	November 8, 2024
Install vllm in Thor failed Jetson Thor generative_ai	6	598	October 16, 2025
VSS local deployment single gpu: Failed to load VIA stream handler - Guardrails / CA-RAG setup failed Visual AI Agent nim , llama-31-8b-instruct , llama	4	142	July 4, 2025
API connect Models nim , llama-31-8b-instruct , llama	1	233	September 20, 2024
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	311	July 9, 2024
Jetson-reference - cannot initiate docker container Jetson Nano jetson-inference , docker	2	1732	October 18, 2021
Cannot use VILA on jetson-containers Jetson AGX Orin generative_ai	4	478	June 4, 2024

Vllm client connection refused

Related topics