Skip to content

[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670

@yurhett

Description

@yurhett

🐛 Describe the bug

When trying to load the Qwen3-Reranker-4B model with tensor parallelism enabled (tensor_parallel_size=2), the model initialization fails due to a tensor dimension mismatch error.

Environment

  • vLLM version: 0.9.2
  • Model: Qwen/Qwen3-Reranker-4B
  • GPU configuration: 2 GPUs with tensor parallelism
  • CUDA version: 12.9

Steps to reproduce

  1. Run vLLM with the following configuration:
--model Qwen/Qwen3-Reranker-4B --task score --enforce_eager True --served_model_name Qwen/Qwen3-Reranker-4B-30k --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' --tensor_parallel_size 2 --gpu_memory_utilization 0.97 

Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.

Actual behavior The model fails to load with the following error:

RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 

Full stack trace shows the error occurs in the load_weights_using_from_2_way_softmax function when attempting to copy weights to the score layer:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax model.score.weight.data.copy_(weight) RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 

Possible workaround The model loads successfully when using tensor_parallel_size=1 (no tensor parallelism).

Full Log:

INFO 07-09 00:56:59 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:02 [api_server.py:1395] vLLM API server version 0.9.2 INFO 07-09 00:57:02 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Reranker-4B', 'task': 'score', 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen3-Reranker-4B-30k'], 'hf_overrides': {'architectures': ['Qwen3ForSequenceClassification'], 'classifier_from_token': ['no', 'yes'], 'is_original_qwen3_reranker': True}, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.97} INFO 07-09 00:57:09 [config.py:1472] Using max model len 40960 INFO 07-09 00:57:09 [arg_utils.py:1596] (Disabling) chunked prefill by default INFO 07-09 00:57:09 [arg_utils.py:1599] (Disabling) prefix caching by default WARNING 07-09 00:57:09 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 07-09 00:57:09 [config.py:4601] Only "last" pooling supports chunked prefill and prefix caching; disabling both. INFO 07-09 00:57:14 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:17 [core.py:526] Waiting for init message from front-end. INFO 07-09 00:57:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B-30k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null} WARNING 07-09 00:57:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 07-09 00:57:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_86991475'), local_subscribe_addr='ipc:///tmp/d4a143f0-8d21-4c46-a4fd-667a62dba5dd', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b4da7d2'), local_subscribe_addr='ipc:///tmp/03a0090e-ee69-42d1-a8d5-d900afccb3b0', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=1 pid=213) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f40c3804'), local_subscribe_addr='ipc:///tmp/1fe57a13-1ef5-480f-a578-f67b4f015124', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorker rank=1 pid=213) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorker rank=0 pid=212) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_468dc0a1'), local_subscribe_addr='ipc:///tmp/0cdef032-c307-4c3a-b4df-1c75d7eca008', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B... (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt'] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] (VllmWorker rank=1 pid=213) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt'] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.33it/s] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.37it/s] (VllmWorker rank=0 pid=212) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] WorkerProc failed to start. (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] Traceback (most recent call last): (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.worker.load_model() (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model_runner.load_model() (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model = model_loader.load_model( (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.load_weights(model, model_config) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] loaded_weights = model.load_weights( (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 256, in load_weights (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return seq_cls_model_loader(self, weights) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 375, in seq_cls_model_loader (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return SEQ_CLS_LOAD_METHODS[method](model, weights) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] model.score.weight.data.copy_(weight) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 84487e05a96e:213:213 [1] NCCL INFO cudaDriverVersion 12090 84487e05a96e:213:213 [1] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO NCCL version 2.26.2+cuda12.2 84487e05a96e:213:213 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin. 84487e05a96e:213:213 [1] NCCL INFO NET/IB : No device found. 84487e05a96e:213:213 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 84487e05a96e:213:213 [1] NCCL INFO Using network Socket 84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init START 84487e05a96e:213:213 [1] NCCL INFO RAS client listening socket at ::1<28028> 84487e05a96e:213:213 [1] NCCL INFO Bootstrap timings total 0.001015 (create 0.000035, send 0.000131, recv 0.000371, ring 0.000024, delay 0.000000) 84487e05a96e:213:213 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. [2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order [2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order 84487e05a96e:213:213 [1] NCCL INFO comm 0xe54d5d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 84487e05a96e:213:213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 84487e05a96e:213:213 [1] NCCL INFO P2P Chunksize set to 131072 84487e05a96e:213:262 [1] NCCL INFO [Proxy Service] Device 1 CPU core 7 84487e05a96e:213:265 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 9 84487e05a96e:213:213 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct 84487e05a96e:213:213 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct 84487e05a96e:213:213 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 84487e05a96e:213:213 [1] NCCL INFO Connected all trees 84487e05a96e:213:266 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 12 84487e05a96e:213:213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 84487e05a96e:213:213 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 84487e05a96e:213:213 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. 84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init COMPLETE 84487e05a96e:213:213 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00) 84487e05a96e:212:212 [0] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO cudaDriverVersion 12090 84487e05a96e:212:212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2 84487e05a96e:212:212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin. 84487e05a96e:212:212 [0] NCCL INFO NET/IB : No device found. 84487e05a96e:212:212 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 84487e05a96e:212:212 [0] NCCL INFO Using network Socket 84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init START 84487e05a96e:212:212 [0] NCCL INFO RAS client listening socket at ::1<28028> 84487e05a96e:212:212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000029, send 0.000126, recv 0.000313, ring 0.000026, delay 0.000000) 84487e05a96e:212:212 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. [2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order [2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order 84487e05a96e:212:212 [0] NCCL INFO comm 0xe64fff0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 84487e05a96e:212:212 [0] NCCL INFO Channel 00/02 : 0 1 84487e05a96e:212:212 [0] NCCL INFO Channel 01/02 : 0 1 84487e05a96e:212:212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 84487e05a96e:212:212 [0] NCCL INFO P2P Chunksize set to 131072 84487e05a96e:212:212 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0 84487e05a96e:212:264 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9 84487e05a96e:212:263 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7 84487e05a96e:212:212 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct 84487e05a96e:212:212 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct 84487e05a96e:212:212 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 84487e05a96e:212:212 [0] NCCL INFO Connected all trees 84487e05a96e:212:267 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 13 84487e05a96e:212:212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 84487e05a96e:212:212 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 84487e05a96e:212:212 [0] NCCL INFO CC Off, workFifoBytes 1048576 84487e05a96e:212:212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. 84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init COMPLETE 84487e05a96e:212:212 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00) [rank0]:[W709 00:57:29.669899852 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) ERROR 07-09 00:57:30 [core.py:586] EngineCore failed to start. ERROR 07-09 00:57:30 [core.py:586] Traceback (most recent call last): ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core ERROR 07-09 00:57:30 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__ ERROR 07-09 00:57:30 [core.py:586] super().__init__(vllm_config, executor_class, log_stats, ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__ ERROR 07-09 00:57:30 [core.py:586] self.model_executor = executor_class(vllm_config) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__ ERROR 07-09 00:57:30 [core.py:586] self._init_executor() ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor ERROR 07-09 00:57:30 [core.py:586] self.workers = WorkerProc.wait_for_ready(unready_workers) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready ERROR 07-09 00:57:30 [core.py:586] raise e from None ERROR 07-09 00:57:30 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Process EngineCore_0: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core raise e File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__ super().__init__(vllm_config, executor_class, log_stats, File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__ self.model_executor = executor_class(vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__ self._init_executor() File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor self.workers = WorkerProc.wait_for_ready(unready_workers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready raise e from None Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module> uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client return AsyncMPClient(*client_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__ super().__init__( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__ with launch_core_engines(vllm_config, executor_class, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__ next(self.gen) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines wait_for_engine_startup( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' 

Your current environment

The output of python collect_env.py
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect CMake version : version 4.0.3 Libc version : glibc-2.35 ============================== PyTorch Info ============================== PyTorch version : 2.7.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A ============================== Python Environment ============================== Python version : 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime) Python platform : Linux-5.15.0-43-generic-x86_64-with-glibc2.35 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : 12.8.93 CUDA_MODULE_LOADING set to : LAZY GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3080 GPU 1: NVIDIA GeForce RTX 3080 Nvidia driver version : 575.64.03 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True ============================== CPU Info ============================== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 2 Stepping: 7 BogoMIPS: 6000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 128 MiB (32 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.2.6.post1+cu128torch2.7 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.3.14 [pip3] nvidia-cuda-cupti-cu12==12.8.57 [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 [pip3] nvidia-cuda-runtime-cu12==12.8.57 [pip3] nvidia-cudnn-cu12==9.7.1.26 [pip3] nvidia-cufft-cu12==11.3.3.41 [pip3] nvidia-cufile-cu12==1.13.0.11 [pip3] nvidia-curand-cu12==10.3.9.55 [pip3] nvidia-cusolver-cu12==11.7.2.55 [pip3] nvidia-cusparse-cu12==12.5.7.53 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.26.2 [pip3] nvidia-nvjitlink-cu12==12.8.61 [pip3] nvidia-nvtx-cu12==12.8.55 [pip3] pyzmq==27.0.0 [pip3] torch==2.7.0+cu128 [pip3] torchaudio==2.7.0+cu128 [pip3] torchvision==0.22.0+cu128 [pip3] transformers==4.53.1 [pip3] triton==3.3.0 [conda] Could not collect ============================== vLLM Info ============================== ROCM Version : Could not collect Neuron SDK Version : N/A vLLM Version : 0.9.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-31 0-1 N/A GPU1 PHB X 0-31 0-1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ============================== Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 NCCL_VERSION=2.25.1-1 NVIDIA_DRIVER_CAPABILITIES=compute,utility NCCL_DEBUG=INFO NVIDIA_PRODUCT_NAME=CUDA VLLM_USAGE_SOURCE=production-docker-image CUDA_VERSION=12.8.1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True LD_LIBRARY_PATH=/usr/local/cuda/lib64 NCCL_CUMEM_ENABLE=0 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 CUDA_MODULE_LOADING=LAZY``` </details> ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions. 

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions