-
- Notifications
You must be signed in to change notification settings - Fork 11.2k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Describe the bug
When trying to load the Qwen3-Reranker-4B model with tensor parallelism enabled (tensor_parallel_size=2), the model initialization fails due to a tensor dimension mismatch error.
Environment
- vLLM version: 0.9.2
- Model: Qwen/Qwen3-Reranker-4B
- GPU configuration: 2 GPUs with tensor parallelism
- CUDA version: 12.9
Steps to reproduce
- Run vLLM with the following configuration:
--model Qwen/Qwen3-Reranker-4B --task score --enforce_eager True --served_model_name Qwen/Qwen3-Reranker-4B-30k --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' --tensor_parallel_size 2 --gpu_memory_utilization 0.97 Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.
Actual behavior The model fails to load with the following error:
RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 Full stack trace shows the error occurs in the load_weights_using_from_2_way_softmax function when attempting to copy weights to the score layer:
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax model.score.weight.data.copy_(weight) RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 Possible workaround The model loads successfully when using tensor_parallel_size=1 (no tensor parallelism).
Full Log:
INFO 07-09 00:56:59 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:02 [api_server.py:1395] vLLM API server version 0.9.2 INFO 07-09 00:57:02 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Reranker-4B', 'task': 'score', 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen3-Reranker-4B-30k'], 'hf_overrides': {'architectures': ['Qwen3ForSequenceClassification'], 'classifier_from_token': ['no', 'yes'], 'is_original_qwen3_reranker': True}, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.97} INFO 07-09 00:57:09 [config.py:1472] Using max model len 40960 INFO 07-09 00:57:09 [arg_utils.py:1596] (Disabling) chunked prefill by default INFO 07-09 00:57:09 [arg_utils.py:1599] (Disabling) prefix caching by default WARNING 07-09 00:57:09 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 07-09 00:57:09 [config.py:4601] Only "last" pooling supports chunked prefill and prefix caching; disabling both. INFO 07-09 00:57:14 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:17 [core.py:526] Waiting for init message from front-end. INFO 07-09 00:57:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B-30k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null} WARNING 07-09 00:57:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 07-09 00:57:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_86991475'), local_subscribe_addr='ipc:///tmp/d4a143f0-8d21-4c46-a4fd-667a62dba5dd', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda. INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b4da7d2'), local_subscribe_addr='ipc:///tmp/03a0090e-ee69-42d1-a8d5-d900afccb3b0', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=1 pid=213) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f40c3804'), local_subscribe_addr='ipc:///tmp/1fe57a13-1ef5-480f-a578-f67b4f015124', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorker rank=1 pid=213) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorker rank=0 pid=212) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_468dc0a1'), local_subscribe_addr='ipc:///tmp/0cdef032-c307-4c3a-b4df-1c75d7eca008', remote_subscribe_addr=None, remote_addr_ipv6=False) (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B... (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch... (VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine. (VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine. (VllmWorker rank=0 pid=212) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt'] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] (VllmWorker rank=1 pid=213) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt'] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.33it/s] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s] (VllmWorker rank=0 pid=212) Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.37it/s] (VllmWorker rank=0 pid=212) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] WorkerProc failed to start. (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] Traceback (most recent call last): (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.worker.load_model() (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model_runner.load_model() (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model = model_loader.load_model( (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.load_weights(model, model_config) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] loaded_weights = model.load_weights( (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 256, in load_weights (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return seq_cls_model_loader(self, weights) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 375, in seq_cls_model_loader (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return SEQ_CLS_LOAD_METHODS[method](model, weights) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] model.score.weight.data.copy_(weight) (VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 84487e05a96e:213:213 [1] NCCL INFO cudaDriverVersion 12090 84487e05a96e:213:213 [1] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO NCCL version 2.26.2+cuda12.2 84487e05a96e:213:213 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin. 84487e05a96e:213:213 [1] NCCL INFO NET/IB : No device found. 84487e05a96e:213:213 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0> 84487e05a96e:213:213 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 84487e05a96e:213:213 [1] NCCL INFO Using network Socket 84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init START 84487e05a96e:213:213 [1] NCCL INFO RAS client listening socket at ::1<28028> 84487e05a96e:213:213 [1] NCCL INFO Bootstrap timings total 0.001015 (create 0.000035, send 0.000131, recv 0.000371, ring 0.000024, delay 0.000000) 84487e05a96e:213:213 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. [2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order [2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order 84487e05a96e:213:213 [1] NCCL INFO comm 0xe54d5d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 84487e05a96e:213:213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 84487e05a96e:213:213 [1] NCCL INFO P2P Chunksize set to 131072 84487e05a96e:213:262 [1] NCCL INFO [Proxy Service] Device 1 CPU core 7 84487e05a96e:213:265 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 9 84487e05a96e:213:213 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct 84487e05a96e:213:213 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct 84487e05a96e:213:213 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 84487e05a96e:213:213 [1] NCCL INFO Connected all trees 84487e05a96e:213:266 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 12 84487e05a96e:213:213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 84487e05a96e:213:213 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 84487e05a96e:213:213 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. 84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init COMPLETE 84487e05a96e:213:213 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00) 84487e05a96e:212:212 [0] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO cudaDriverVersion 12090 84487e05a96e:212:212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2 84487e05a96e:212:212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin. 84487e05a96e:212:212 [0] NCCL INFO NET/IB : No device found. 84487e05a96e:212:212 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0> 84487e05a96e:212:212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 84487e05a96e:212:212 [0] NCCL INFO Using network Socket 84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init START 84487e05a96e:212:212 [0] NCCL INFO RAS client listening socket at ::1<28028> 84487e05a96e:212:212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000029, send 0.000126, recv 0.000313, ring 0.000026, delay 0.000000) 84487e05a96e:212:212 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. [2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order [2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order 84487e05a96e:212:212 [0] NCCL INFO comm 0xe64fff0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 84487e05a96e:212:212 [0] NCCL INFO Channel 00/02 : 0 1 84487e05a96e:212:212 [0] NCCL INFO Channel 01/02 : 0 1 84487e05a96e:212:212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 84487e05a96e:212:212 [0] NCCL INFO P2P Chunksize set to 131072 84487e05a96e:212:212 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0 84487e05a96e:212:264 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9 84487e05a96e:212:263 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7 84487e05a96e:212:212 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct 84487e05a96e:212:212 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct 84487e05a96e:212:212 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1 84487e05a96e:212:212 [0] NCCL INFO Connected all trees 84487e05a96e:212:267 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 13 84487e05a96e:212:212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 84487e05a96e:212:212 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 84487e05a96e:212:212 [0] NCCL INFO CC Off, workFifoBytes 1048576 84487e05a96e:212:212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin. 84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init COMPLETE 84487e05a96e:212:212 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00) [rank0]:[W709 00:57:29.669899852 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) ERROR 07-09 00:57:30 [core.py:586] EngineCore failed to start. ERROR 07-09 00:57:30 [core.py:586] Traceback (most recent call last): ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core ERROR 07-09 00:57:30 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__ ERROR 07-09 00:57:30 [core.py:586] super().__init__(vllm_config, executor_class, log_stats, ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__ ERROR 07-09 00:57:30 [core.py:586] self.model_executor = executor_class(vllm_config) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__ ERROR 07-09 00:57:30 [core.py:586] self._init_executor() ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor ERROR 07-09 00:57:30 [core.py:586] self.workers = WorkerProc.wait_for_ready(unready_workers) ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready ERROR 07-09 00:57:30 [core.py:586] raise e from None ERROR 07-09 00:57:30 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Process EngineCore_0: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core raise e File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__ super().__init__(vllm_config, executor_class, log_stats, File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__ self.model_executor = executor_class(vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__ self._init_executor() File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor self.workers = WorkerProc.wait_for_ready(unready_workers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready raise e from None Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module> uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client return AsyncMPClient(*client_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__ super().__init__( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__ with launch_core_engines(vllm_config, executor_class, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__ next(self.gen) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines wait_for_engine_startup( File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Your current environment
The output of python collect_env.py
Collecting environment information... ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version : Could not collect CMake version : version 4.0.3 Libc version : glibc-2.35 ============================== PyTorch Info ============================== PyTorch version : 2.7.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A ============================== Python Environment ============================== Python version : 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime) Python platform : Linux-5.15.0-43-generic-x86_64-with-glibc2.35 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : 12.8.93 CUDA_MODULE_LOADING set to : LAZY GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3080 GPU 1: NVIDIA GeForce RTX 3080 Nvidia driver version : 575.64.03 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True ============================== CPU Info ============================== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 2 Stepping: 7 BogoMIPS: 6000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 128 MiB (32 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.2.6.post1+cu128torch2.7 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.3.14 [pip3] nvidia-cuda-cupti-cu12==12.8.57 [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 [pip3] nvidia-cuda-runtime-cu12==12.8.57 [pip3] nvidia-cudnn-cu12==9.7.1.26 [pip3] nvidia-cufft-cu12==11.3.3.41 [pip3] nvidia-cufile-cu12==1.13.0.11 [pip3] nvidia-curand-cu12==10.3.9.55 [pip3] nvidia-cusolver-cu12==11.7.2.55 [pip3] nvidia-cusparse-cu12==12.5.7.53 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.26.2 [pip3] nvidia-nvjitlink-cu12==12.8.61 [pip3] nvidia-nvtx-cu12==12.8.55 [pip3] pyzmq==27.0.0 [pip3] torch==2.7.0+cu128 [pip3] torchaudio==2.7.0+cu128 [pip3] torchvision==0.22.0+cu128 [pip3] transformers==4.53.1 [pip3] triton==3.3.0 [conda] Could not collect ============================== vLLM Info ============================== ROCM Version : Could not collect Neuron SDK Version : N/A vLLM Version : 0.9.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-31 0-1 N/A GPU1 PHB X 0-31 0-1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ============================== Environment Variables ============================== NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 NCCL_VERSION=2.25.1-1 NVIDIA_DRIVER_CAPABILITIES=compute,utility NCCL_DEBUG=INFO NVIDIA_PRODUCT_NAME=CUDA VLLM_USAGE_SOURCE=production-docker-image CUDA_VERSION=12.8.1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True LD_LIBRARY_PATH=/usr/local/cuda/lib64 NCCL_CUMEM_ENABLE=0 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 CUDA_MODULE_LOADING=LAZY``` </details> ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions. Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working