fix: gptq marlin weight loading failure #23066

simon-mo · 2025-08-17T22:53:41Z

This is failing is nightly test. https://buildkite.com/vllm/ci/builds/27350/steps/canvas?sid=0198b8bc-aa83-48ad-baca-8c2f92971518#0198b9f4-4abb-4ae2-9b5b-95f123fe5447/43-3098 and cauesd by missed refactoring in 8ad7285#diff-4cd107be9da84ea39a46c289542823c40f5627b86e4815ef2ee2e84e9971e8c9

(vllm) simonmo@gcp5-h100-3-9:~/vllm/tests/weight_loading$ cd .. (vllm) simonmo@gcp5-h100-3-9:~/vllm/tests$ bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-channel-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W8A16-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/test-w4a16-mixtral-actorder-group, main === === SKIPPING MODEL: #gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, main === === SKIPPING MODEL: #gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, gptq-8bit-128g-actorder_True === === SKIPPING MODEL: #awq_marlin, casperhansen/deepseek-coder-v2-instruct-awq, main === === SKIPPING MODEL: #compressed-tensors, RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16, main === === RUNNING MODEL: gptq_marlin, /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, main === INFO 08-17 22:39:02 [__init__.py:241] Automatically detected platform cuda. ===================================================================================================== test session starts ===================================================================================================== platform linux -- Python 3.11.11, pytest-8.4.1, pluggy-1.6.0 rootdir: /home/simonmo/vllm configfile: pyproject.toml plugins: anyio-4.10.0 collecting ... WARNING 08-17 22:39:04 [interface.py:523] Current platform cuda does not have '__test__' attribute. WARNING 08-17 22:39:04 [interface.py:523] Current platform cuda does not have '__bases__' attribute. WARNING 08-17 22:39:04 [interface.py:523] Current platform cuda does not have '__test__' attribute. collected 1 item weight_loading/test_weight_loading.py INFO 08-17 22:39:04 [utils.py:326] non-default args: {'model': '/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', 'trust_remote_code': True, 'seed': 0, 'max_model_len': 1024, 'tensor_parallel_size': 2, 'block_size': 16, 'disable_log_stats': True, 'revision': 'main', 'quantization': 'gptq_marlin', 'enable_chunked_prefill': False} The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. INFO 08-17 22:39:25 [__init__.py:711] Resolved architecture: MixtralForCausalLM INFO 08-17 22:39:25 [__init__.py:1750] Using max model len 1024 INFO 08-17 22:39:26 [gptq_marlin.py:170] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 08-17 22:39:28 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=16384. (EngineCore_0 pid=3927913) INFO 08-17 22:39:29 [core.py:636] Waiting for init message from front-end. (EngineCore_0 pid=3927913) INFO 08-17 22:39:29 [core.py:74] Initializing a V1 LLM engine (v0.10.1.dev680+g79899b63f.d20250817) with config: model='/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', speculative_config=None, tokenizer='/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null} (EngineCore_0 pid=3927913) WARNING 08-17 22:39:29 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_0 pid=3927913) INFO 08-17 22:39:29 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_e1bb4bde'), local_subscribe_addr='ipc:///tmp/24bcfdb6-c2d9-4704-857b-a395797956ce', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7d2c161b'), local_subscribe_addr='ipc:///tmp/9ea0a084-3806-4339-a969-eff904cab021', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ed5ae9e9'), local_subscribe_addr='ipc:///tmp/15b28b24-1117-48b1-851a-954ce547864e', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:33 [__init__.py:1418] Found nccl from library libnccl.so.2 (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:33 [pynccl.py:70] vLLM is using nccl==2.26.2 (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:33 [__init__.py:1418] Found nccl from library libnccl.so.2 (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:33 [pynccl.py:70] vLLM is using nccl==2.26.2 (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report. (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:34 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report. (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_94808a41'), local_subscribe_addr='ipc:///tmp/e0305bbc-cdda-4ebc-92fe-87e19baeef8a', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:34 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling. (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:34 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling. (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [gpu_model_runner.py:1951] Starting to load model /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq... (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:34 [gpu_model_runner.py:1951] Starting to load model /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq... (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [gpu_model_runner.py:1983] Loading model from scratch... (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:34 [gptq_marlin.py:266] Using MacheteLinearKernel for GPTQMarlinLinearMethod (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:35 [gpu_model_runner.py:1983] Loading model from scratch... (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:35 [gptq_marlin.py:266] Using MacheteLinearKernel for GPTQMarlinLinearMethod (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) INFO 08-17 22:39:35 [cuda.py:328] Using Flash Attention backend on V1 engine. (EngineCore_0 pid=3927913) (VllmWorker TP0 pid=3927919) INFO 08-17 22:39:35 [cuda.py:328] Using Flash Attention backend on V1 engine. (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] WorkerProc failed to start. (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] Traceback (most recent call last): (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] worker = WorkerProc(*args, **kwargs) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 402, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.worker.load_model() (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/v1/worker/gpu_worker.py", line 212, in load_model (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.model_runner.load_model(eep_scale_up=eep_scale_up) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/v1/worker/gpu_model_runner.py", line 1984, in load_model (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.model = model_loader.load_model( (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] model = initialize_model(vllm_config=vllm_config, (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] return model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/mixtral.py", line 442, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.model = MixtralModel(vllm_config=vllm_config, (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/compilation/decorators.py", line 183, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/mixtral.py", line 278, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.start_layer, self.end_layer, self.layers = make_layers( (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] [PPMissingLayer() for _ in range(start_layer)] + [ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/utils.py", line 641, in <listcomp> (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/mixtral.py", line 280, in <lambda> (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] lambda prefix: MixtralDecoderLayer( (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/mixtral.py", line 217, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.block_sparse_moe = MixtralMoE( (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/models/mixtral.py", line 89, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] self.experts = FusedMoE(num_experts=num_experts, (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 845, in __init__ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] else quant_config.get_quant_method(self, prefix)) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 191, in get_quant_method (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] return get_moe_quant_method(self, layer, prefix, (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] File "/home/simonmo/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 59, in get_moe_quant_method (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] return moe_method_cls(cloned_config) (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) (VllmWorker TP1 pid=3927921) ERROR 08-17 22:39:35 [multiproc_executor.py:559] TypeError: GPTQMarlinMoEMethod.__init__() missing 1 required positional argument: 'moe' (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] EngineCore failed to start. (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] Traceback (most recent call last): (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 691, in run_engine_core (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] engine_core = EngineCoreProc(*args, **kwargs) (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 492, in __init__ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] super().__init__(vllm_config, executor_class, log_stats, (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 80, in __init__ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] self.model_executor = executor_class(vllm_config) (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/executor/executor_base.py", line 54, in __init__ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] self._init_executor() (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] self.workers = WorkerProc.wait_for_ready(unready_workers) (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] raise e from None (EngineCore_0 pid=3927913) ERROR 08-17 22:39:35 [core.py:700] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. (EngineCore_0 pid=3927913) Process EngineCore_0: (EngineCore_0 pid=3927913) Traceback (most recent call last): (EngineCore_0 pid=3927913) File "/home/simonmo/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_0 pid=3927913) self.run() (EngineCore_0 pid=3927913) File "/home/simonmo/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/multiprocessing/process.py", line 108, in run (EngineCore_0 pid=3927913) self._target(*self._args, **self._kwargs) (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 704, in run_engine_core (EngineCore_0 pid=3927913) raise e (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 691, in run_engine_core (EngineCore_0 pid=3927913) engine_core = EngineCoreProc(*args, **kwargs) (EngineCore_0 pid=3927913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 492, in __init__ (EngineCore_0 pid=3927913) super().__init__(vllm_config, executor_class, log_stats, (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/engine/core.py", line 80, in __init__ (EngineCore_0 pid=3927913) self.model_executor = executor_class(vllm_config) (EngineCore_0 pid=3927913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/executor/executor_base.py", line 54, in __init__ (EngineCore_0 pid=3927913) self._init_executor() (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor (EngineCore_0 pid=3927913) self.workers = WorkerProc.wait_for_ready(unready_workers) (EngineCore_0 pid=3927913) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_0 pid=3927913) File "/home/simonmo/vllm/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready (EngineCore_0 pid=3927913) raise e from None (EngineCore_0 pid=3927913) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. /home/simonmo/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' F ========================================================================================================== FAILURES =========================================================================================================== _____________________________________________________________________________________________________ test_weight_loading _____________________________________________________________________________________________________ vllm_runner = <class 'tests.conftest.VllmRunner'> @pytest.mark.skipif( MODEL_NAME == "casperhansen/deepseek-coder-v2-instruct-awq", reason="OOM in the CI") @pytest.mark.skipif( not current_platform.has_device_capability(int(MIN_CAPABILITY)), reason="Current system does not have minimum capability.") def test_weight_loading(vllm_runner): """ Test parameter weight loading with tp>1. """ # MoE models need fp16. NEEDS_FP16 = (QUANTIZATION == "gptq" or MODEL_NAME == "nm-testing/test-w4a16-mixtral-actorder-group") > with vllm_runner( model_name=MODEL_NAME, revision=REVISION, dtype=torch.half if NEEDS_FP16 else "auto", quantization=None if QUANTIZATION == "None" else QUANTIZATION, max_model_len=MAX_MODEL_LEN, tensor_parallel_size=2) as model: weight_loading/test_weight_loading.py:33: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ conftest.py:788: in __init__ self.llm = LLM( ../vllm/entrypoints/llm.py:285: in __init__ self.llm_engine = LLMEngine.from_engine_args( ../vllm/engine/llm_engine.py:490: in from_engine_args return engine_cls.from_vllm_config( ../vllm/v1/engine/llm_engine.py:127: in from_vllm_config return cls(vllm_config=vllm_config, ../vllm/v1/engine/llm_engine.py:104: in __init__ self.engine_core = EngineCoreClient.make_client( ../vllm/v1/engine/core_client.py:80: in make_client return SyncMPClient(vllm_config, executor_class, log_stats) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ../vllm/v1/engine/core_client.py:600: in __init__ super().__init__( ../vllm/v1/engine/core_client.py:446: in __init__ with launch_core_engines(vllm_config, executor_class, ../../.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py:144: in __exit__ next(self.gen) ../vllm/v1/engine/utils.py:706: in launch_core_engines wait_for_engine_startup( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x790624e0b1c0 closed> addresses = EngineZmqAddresses(inputs=['ipc:///tmp/552117b7-314e-42a5-b7bf-92b58a2233b8'], outputs=['ipc:///tmp/47c499c8-a43c-4172-8278-3bfbc1954613'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None) core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x790624c40810>] parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, data_parallel_size_local=1, dat...r', sd_worker_cls='auto', worker_extension_cls='', world_size=2, rank=0, enable_multimodal_encoder_data_parallel=False) cache_config = CacheConfig(block_size=16, gpu_memory_utilization=0.9, swap_space=4.0, cache_dtype='auto', is_attention_free=False, nu...he_dtype='auto', mamba_ssm_cache_dtype='auto', num_gpu_blocks=None, num_cpu_blocks=None, kv_sharing_fast_prefill=False) proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x7906255a98d0>, coord_process = None def wait_for_engine_startup( handshake_socket: zmq.Socket, addresses: EngineZmqAddresses, core_engines: list[CoreEngine], parallel_config: ParallelConfig, cache_config: CacheConfig, proc_manager: Optional[CoreEngineProcManager], coord_process: Optional[Process], ): # Wait for engine core process(es) to send ready messages. local_count = parallel_config.data_parallel_size_local remote_count = len(core_engines) - local_count # [local, remote] counts conn_pending, start_pending = [local_count, remote_count], [0, 0] poller = zmq.Poller() poller.register(handshake_socket, zmq.POLLIN) remote_should_be_headless = not parallel_config.data_parallel_hybrid_lb \ and not parallel_config.data_parallel_external_lb if proc_manager is not None: for sentinel in proc_manager.sentinels(): poller.register(sentinel, zmq.POLLIN) if coord_process is not None: poller.register(coord_process.sentinel, zmq.POLLIN) while any(conn_pending) or any(start_pending): events = poller.poll(STARTUP_POLL_PERIOD_MS) if not events: if any(conn_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) " "to connect.", *conn_pending) if any(start_pending): logger.debug( "Waiting for %d local, %d remote core engine proc(s) " "to start.", *start_pending) continue if len(events) > 1 or events[0][0] != handshake_socket: # One of the local core processes exited. finished = proc_manager.finished_procs() if proc_manager else {} if coord_process is not None and coord_process.exitcode is not None: finished[coord_process.name] = coord_process.exitcode > raise RuntimeError("Engine core initialization failed. " "See root cause above. " f"Failed core proc(s): {finished}") E RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_0': 1} ../vllm/v1/engine/utils.py:759: RuntimeError ====================================================================================================== warnings summary ======================================================================================================= tests/weight_loading/test_weight_loading.py::test_weight_loading <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute tests/weight_loading/test_weight_loading.py::test_weight_loading <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =================================================================================================== short test summary info =================================================================================================== FAILED weight_loading/test_weight_loading.py::test_weight_loading - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_0': 1} =============================================================================================== 1 failed, 2 warnings in 31.53s ================================================================================================ sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute === FAILED MODEL: gptq_marlin, /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, main === (vllm) simonmo@gcp5-h100-3-9:~/vllm/tests$ (vllm) simonmo@gcp5-h100-3-9:~/vllm/tests$ bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-channel-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W8A16-quantized, main === === SKIPPING MODEL: #compressed-tensors, nm-testing/test-w4a16-mixtral-actorder-group, main === === SKIPPING MODEL: #gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, main === === SKIPPING MODEL: #gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, gptq-8bit-128g-actorder_True === === SKIPPING MODEL: #awq_marlin, casperhansen/deepseek-coder-v2-instruct-awq, main === === SKIPPING MODEL: #compressed-tensors, RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16, main === === RUNNING MODEL: gptq_marlin, /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, main === INFO 08-17 22:47:47 [__init__.py:241] Automatically detected platform cuda. ===================================================================================================== test session starts ===================================================================================================== platform linux -- Python 3.11.11, pytest-8.4.1, pluggy-1.6.0 rootdir: /home/simonmo/vllm configfile: pyproject.toml plugins: anyio-4.10.0 collecting ... WARNING 08-17 22:47:51 [interface.py:523] Current platform cuda does not have '__test__' attribute. WARNING 08-17 22:47:51 [interface.py:523] Current platform cuda does not have '__bases__' attribute. WARNING 08-17 22:47:51 [interface.py:523] Current platform cuda does not have '__test__' attribute. collected 1 item weight_loading/test_weight_loading.py INFO 08-17 22:47:51 [utils.py:326] non-default args: {'model': '/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', 'trust_remote_code': True, 'seed': 0, 'max_model_len': 1024, 'tensor_parallel_size': 2, 'block_size': 16, 'disable_log_stats': True, 'revision': 'main', 'quantization': 'gptq_marlin', 'enable_chunked_prefill': False} The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. INFO 08-17 22:48:10 [__init__.py:711] Resolved architecture: MixtralForCausalLM INFO 08-17 22:48:10 [__init__.py:1750] Using max model len 1024 INFO 08-17 22:48:12 [gptq_marlin.py:170] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 08-17 22:48:13 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=16384. (EngineCore_0 pid=3933741) INFO 08-17 22:48:14 [core.py:636] Waiting for init message from front-end. (EngineCore_0 pid=3933741) INFO 08-17 22:48:14 [core.py:74] Initializing a V1 LLM engine (v0.10.1.dev680+g79899b63f.d20250817) with config: model='/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', speculative_config=None, tokenizer='/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config={}, tokenizer_revision=main, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null} (EngineCore_0 pid=3933741) WARNING 08-17 22:48:14 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_0 pid=3933741) INFO 08-17 22:48:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_5386025c'), local_subscribe_addr='ipc:///tmp/5f0a429a-cbb7-4bce-90a8-e68e2be84de0', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_76513ef8'), local_subscribe_addr='ipc:///tmp/3f25a6a4-17cf-42ca-b372-f1ed7b30ac7e', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:16 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_cbda7948'), local_subscribe_addr='ipc:///tmp/04a01f1c-2824-4de3-ace8-060f5270a615', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:17 [__init__.py:1418] Found nccl from library libnccl.so.2 (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:17 [__init__.py:1418] Found nccl from library libnccl.so.2 (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:17 [pynccl.py:70] vLLM is using nccl==2.26.2 (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:17 [pynccl.py:70] vLLM is using nccl==2.26.2 (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report. (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report. (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_4b9f0dcb'), local_subscribe_addr='ipc:///tmp/0919eae2-522b-4664-b922-c9b731dcb512', remote_subscribe_addr=None, remote_addr_ipv6=False) (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling. (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling. (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [gpu_model_runner.py:1951] Starting to load model /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq... (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [gpu_model_runner.py:1951] Starting to load model /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq... (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [gpu_model_runner.py:1983] Loading model from scratch... (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [gptq_marlin.py:266] Using MacheteLinearKernel for GPTQMarlinLinearMethod (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:18 [cuda.py:328] Using Flash Attention backend on V1 engine. (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [gpu_model_runner.py:1983] Loading model from scratch... (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [gptq_marlin.py:266] Using MacheteLinearKernel for GPTQMarlinLinearMethod (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:18 [cuda.py:328] Using Flash Attention backend on V1 engine. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.62s/it] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.62s/it] (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:21 [default_loader.py:262] Loading weights took 2.63 seconds (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:21 [default_loader.py:262] Loading weights took 2.60 seconds (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:22 [gpu_model_runner.py:2005] Model loading took 11.0802 GiB and 3.106979 seconds (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:22 [gpu_model_runner.py:2005] Model loading took 11.0802 GiB and 3.001256 seconds (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:38 [backends.py:548] Using cache directory: /home/simonmo/.cache/vllm/torch_compile_cache/719717795c/rank_1_0/backbone for vLLM's torch.compile (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:38 [backends.py:548] Using cache directory: /home/simonmo/.cache/vllm/torch_compile_cache/719717795c/rank_0_0/backbone for vLLM's torch.compile (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:38 [backends.py:559] Dynamo bytecode transform time: 15.90 s (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:38 [backends.py:559] Dynamo bytecode transform time: 15.94 s (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:48:43 [backends.py:194] Cache the graph for dynamic shape for later use (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:48:43 [backends.py:194] Cache the graph for dynamic shape for later use (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:49:18 [backends.py:215] Compiling a graph for dynamic shape takes 39.53 s (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:49:18 [backends.py:215] Compiling a graph for dynamic shape takes 40.07 s (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) WARNING 08-17 22:49:20 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/simonmo/vllm/vllm/model_executor/layers/fused_moe/configs/E=8,N=8192,device_name=NVIDIA_H100_80GB_HBM3.json'] (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) WARNING 08-17 22:49:20 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/simonmo/vllm/vllm/model_executor/layers/fused_moe/configs/E=8,N=8192,device_name=NVIDIA_H100_80GB_HBM3.json'] (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:49:23 [monitor.py:34] torch.compile takes 55.48 s in total (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:49:23 [monitor.py:34] torch.compile takes 55.97 s in total (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:50:03 [gpu_worker.py:276] Available KV cache memory: 56.80 GiB (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:50:03 [gpu_worker.py:276] Available KV cache memory: 56.80 GiB (EngineCore_0 pid=3933741) INFO 08-17 22:50:04 [kv_cache_utils.py:849] GPU KV cache size: 930,592 tokens (EngineCore_0 pid=3933741) INFO 08-17 22:50:04 [kv_cache_utils.py:853] Maximum concurrency for 1,024 tokens per request: 894.80x (EngineCore_0 pid=3933741) INFO 08-17 22:50:04 [kv_cache_utils.py:849] GPU KV cache size: 930,592 tokens (EngineCore_0 pid=3933741) INFO 08-17 22:50:04 [kv_cache_utils.py:853] Maximum concurrency for 1,024 tokens per request: 894.80x Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:03<00:00, 19.14it/s] (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:50:08 [custom_all_reduce.py:196] Registering 4355 cuda graph addresses (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:50:08 [custom_all_reduce.py:196] Registering 4355 cuda graph addresses (EngineCore_0 pid=3933741) (VllmWorker TP1 pid=3933749) INFO 08-17 22:50:08 [gpu_model_runner.py:2706] Graph capturing finished in 4 secs, took 0.80 GiB (EngineCore_0 pid=3933741) (VllmWorker TP0 pid=3933747) INFO 08-17 22:50:08 [gpu_model_runner.py:2706] Graph capturing finished in 4 secs, took 0.80 GiB (EngineCore_0 pid=3933741) INFO 08-17 22:50:08 [core.py:214] init engine (profile, create kv cache, warmup model) took 106.06 seconds INFO 08-17 22:50:08 [llm.py:298] Supported_tasks: ['generate'] Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 1301.70it/s] Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 63.15it/s, est. speed input: 126.39 toks/s, output: 1263.90 toks/s] [([1, 382, 904, 481, 22062, 28725, 543, 11471, 2189, 340, 2635, 22796, 955, 427, 5589, 1772, 264, 543, 16111, 18687, 28725, 10129], 'Hoy en día, la mayoría de las personas que se dedican a la fotografía, ya'), ([1, 317, 28733, 4202, 365, 19322, 2487, 393, 28742, 830, 3772, 536, 634, 7585, 20408, 340, 543, 6380, 911, 340, 305, 28742], "e-PRO Bâtiment L'annuaire des professionnels de la construction et de l'"), ([1, 305, 327, 733, 28740, 28725, 28705, 28750, 28725, 28705, 28770, 28725, 28705, 28781, 28725, 28705, 28782, 28725, 28705, 28784, 28725, 28705], 'l = [1, 2, 3, 4, 5, 6, '), ([1, 305, 327, 733, 28740, 28725, 28705, 28750, 28725, 28705, 28770, 28725, 28705, 28781, 28725, 28705, 28782, 28725, 28705, 28784, 28725, 28705], 'l = [1, 2, 3, 4, 5, 6, '), ([1, 289, 2432, 28709, 28723, 3380, 857, 1065, 28730, 1441, 614, 28723, 8676, 647, 908, 325, 6351, 28731, 371, 13, 2287, 464], "odoo.define('pos_retail.models', function (require) {\n '"), ([1, 259, 415, 28705, 28750, 28734, 28740, 28783, 28733, 28750, 28734, 28740, 28774, 2052, 879, 349, 805, 298, 264, 1598, 1149, 28808], ' The 2018-2019 school year is off to a great start!'), ([1, 275, 527, 4449, 1508, 6222, 28723, 675, 28748, 28719, 5002, 28748, 266, 19736, 8863, 28748, 23682, 28748, 1013, 28713, 28748, 12586], 'wget https://github.com/microsoft/onnxruntime/archive/refs/tags'), ([1, 289, 2432, 28709, 28723, 3380, 857, 1065, 28730, 1441, 614, 28723, 8676, 647, 908, 325, 6351, 28731, 371, 13, 2287, 464], "odoo.define('pos_retail.models', function (require) {\n '"), ([1, 408, 18552, 13, 13, 2287, 851, 4597, 5876, 272, 875, 354, 272, 28705, 28750, 28757, 28733, 28770, 28757, 28733, 28750, 28757], 'r"""\n\n This module contains the class for the 2D-3D-2D'), ([1, 305, 327, 733, 28740, 28725, 28705, 28750, 28725, 28705, 28770, 28725, 28705, 28781, 28725, 28705, 28782, 28725, 28705, 28784, 28725, 28705], 'l = [1, 2, 3, 4, 5, 6, '), ([1, 281, 28770, 28723, 3371, 618, 1056, 28748, 1056, 28723, 3371, 548, 908, 28732, 1056, 28731, 371, 13, 13, 28705, 963, 9829], 'd3.json("data/data.json", function(data) {\n\n var margin'), ([1, 918, 28792, 335, 1661, 7453, 12362, 28732, 3606, 20244, 28705, 28774, 4753, 3409, 25531, 3597, 28793, 28767, 13, 13, 28755, 28723], '![if (!IE)|(gt IE 9)]><![endif]>\n\nM.')] . ====================================================================================================== warnings summary ======================================================================================================= tests/weight_loading/test_weight_loading.py::test_weight_loading <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute tests/weight_loading/test_weight_loading.py::test_weight_loading <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================== 1 passed, 2 warnings in 138.33s (0:02:18) ========================================================================================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute === PASSED MODEL: gptq_marlin, /mnt/localdisk/the-bloke-mixtral-8x7b-v0.1-gptq, main ===

github-actions · 2025-08-17T22:53:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses a critical bug that caused gptq_marlin weight loading to fail, as observed in nightly tests. The fix is a single-line change in vllm/model_executor/layers/quantization/gptq_marlin.py that correctly passes the moe_config argument when instantiating the GPTQMarlinMoEMethod. This resolves the TypeError and aligns the implementation with the method's signature. The change is correct, targeted, and effectively resolves the reported crash. No further issues were found in this change.

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

Signed-off-by: Xiao Yu <xiao.yu@amd.com>

fix: gptq marlin weight loading failure

745e638

simon-mo requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 17, 2025 22:53

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

simon-mo merged commit 0fc8fa7 into vllm-project:main Aug 17, 2025
7 of 11 checks passed

divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

0f94eec

cyang49 pushed a commit to cyang49/vllm that referenced this pull request Aug 20, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

2a624e8

djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

071fdbf

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

f272ace

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

a29b7a1

Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

fix: gptq marlin weight loading failure (vllm-project#23066)

eb48a7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: gptq marlin weight loading failure #23066

fix: gptq marlin weight loading failure #23066

Uh oh!

simon-mo commented Aug 17, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Aug 17, 2025

gemini-code-assist bot left a comment

Uh oh!

Labels

1 participant

Uh oh!

fix: gptq marlin weight loading failure #23066

fix: gptq marlin weight loading failure #23066

Uh oh!

Conversation

simon-mo commented Aug 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Aug 17, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Labels

1 participant

simon-mo commented Aug 17, 2025 •

edited by github-actions bot

Loading