Skip to content

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

@lksj92hs

Description

@lksj92hs

Your current environment

OS: Ubuntu Server 22.04 LTS
GPU: Nvidia H200
Driver: 550.127.08

The output of `pip list`
Package Version --------------------------------- ------------------- aiohappyeyeballs 2.4.4 aiohttp 3.11.11 aiohttp-cors 0.7.0 aiosignal 1.3.2 airportsdata 20241001 annotated-types 0.7.0 anyio 4.8.0 argcomplete 3.4.0 astor 0.8.1 attrs 24.3.0 bitsandbytes 0.45.0 blake3 1.0.2 cachetools 5.5.0 certifi 2024.12.14 charset-normalizer 3.4.1 click 8.1.8 cloudpickle 3.1.0 colorful 0.5.6 compressed-tensors 0.8.1 datasets 3.2.0 depyf 0.18.0 dill 0.3.8 diskcache 5.6.3 distlib 0.3.9 distro 1.9.0 einops 0.8.0 fastapi 0.115.6 filelock 3.16.1 flashinfer 0.1.6+cu121torch2.4 frozenlist 1.5.0 fsspec 2024.9.0 gguf 0.10.0 google-api-core 2.24.0 google-auth 2.37.0 googleapis-common-protos 1.66.0 grpcio 1.57.0 grpcio-tools 1.57.0 h11 0.14.0 httpcore 1.0.7 httptools 0.6.4 httpx 0.28.1 huggingface-hub 0.24.5 idna 3.10 importlib_metadata 8.5.0 iniconfig 2.0.0 interegular 0.3.3 Jinja2 3.1.5 jiter 0.8.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 lark 1.2.2 linkify-it-py 2.0.3 llvmlite 0.43.0 lm-format-enforcer 0.10.9 markdown-it-py 3.0.0 MarkupSafe 3.0.2 mdit-py-plugins 0.4.2 mdurl 0.1.2 memray 1.15.0 mistral_common 1.5.1 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.19.0 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.4.2 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 openai 1.59.7 opencensus 0.11.4 opencensus-context 0.1.3 opencv-python-headless 4.10.0.84 outlines 0.1.11 outlines_core 0.1.26 packaging 24.2 pandas 2.2.3 partial-json-parser 0.2.1.1.post5 pillow 10.4.0 pip 24.3.1 platformdirs 4.3.6 pluggy 1.5.0 prometheus_client 0.21.1 prometheus-fastapi-instrumentator 7.0.0 propcache 0.2.1 proto-plus 1.25.0 protobuf 4.25.5 psutil 6.1.1 py-cpuinfo 9.0.0 py-spy 0.4.0 py3nvml 0.2.7 pyairports 2.1.1 pyarrow 18.1.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pybind11 2.13.6 pycountry 24.6.1 pydantic 2.10.5 pydantic_core 2.27.2 Pygments 2.19.1 PyJWT 2.7.0 pytest 8.3.4 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 ray 2.40.0 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 rich 13.9.4 rpds-py 0.22.3 rsa 4.9 safetensors 0.5.2 sentencepiece 0.2.0 setuptools 75.8.0 six 1.17.0 smart-open 7.1.0 sniffio 1.3.1 starlette 0.41.3 sympy 1.13.1 textual 1.0.0 tiktoken 0.7.0 tokenizers 0.21.0 torch 2.5.1 torchvision 0.20.1 tqdm 4.67.1 transformers 4.48.0 triton 3.1.0 typing_extensions 4.12.2 tzdata 2024.2 uc-micro-py 1.0.3 urllib3 2.3.0 uvicorn 0.34.0 uvloop 0.21.0 virtualenv 20.28.1 vllm 0.6.6.post1 watchfiles 1.0.4 websockets 14.1 wheel 0.45.1 wrapt 1.17.2 xformers 0.0.28.post3 xgrammar 0.1.9 xmltodict 0.14.2 xxhash 3.5.0 yarl 1.18.3 zipp 3.21.0 

Model Input Dumps

No response

🐛 Describe the bug

When starting vllm like this:

python -m vllm.entrypoints.openai.api_server --model /models/llama-3.3-70b-instruct-fp8-dynamic --host localhost --port 10000 

The following error occurs:

INFO 01-14 13:24:33 api_server.py:712] vLLM API server version 0.6.6.post1 INFO 01-14 13:24:33 api_server.py:713] args: Namespace(host='localhost', port=10000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/textgen_cache/models/llama-3.3-70b-instruct-fp8-dynamic', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 01-14 13:24:33 api_server.py:199] Started engine process with PID 11114 INFO 01-14 13:24:37 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'. WARNING 01-14 13:24:38 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 01-14 13:24:38 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048. Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 774, in <module> uvloop.run(run_server(args)) File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/runners.py", line 120, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args engine_config = engine_args.create_engine_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1238, in create_engine_config config = VllmConfig( ^^^^^^^^^^^ File "<string>", line 18, in __init__ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3114, in __post_init__ self.quant_config = VllmConfig._get_quantization_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3058, in _get_quantization_config quant_config = get_quant_config(model_config, load_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 151, in get_quant_config return quant_cls.from_config(hf_quant_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 96, in from_config sparsity_scheme_map = cls._sparsity_scheme_map_from_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 119, in _sparsity_scheme_map_from_config sparsity_config = SparsityCompressionConfig.model_validate( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/venv/lib/python3.11/site-packages/pydantic/main.py", line 627, in model_validate return cls.__pydantic_validator__.validate_python( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 1 validation error for SparsityCompressionConfig format Field required [type=missing, input_value={}, input_type=dict] For further information visit https://errors.pydantic.dev/2.10/v/missing 

The model in question is cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic from huggingface: https://huggingface.co/cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic/blob/main/config.json

The error does not occur with vllm 0.6.4.post1, and 0.6.5, it starts to happen with 0.6.6.

When the line containing "sparsity_config": {} is removed from the model's config.json the error doesn't happen and the model works fine even with 0.6.6.post1. While this may be considered a workaround, it's better to fix this issue as there are potentially many models with empty sparsity_config.

{ "_name_or_path": "/output/Llama-3.3-70B-Instruct-FP8-Dynamic", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "head_dim": 128, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 28672, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 8, "pretraining_tp": 1, "quantization_config": { "config_groups": { "group_0": { "input_activations": { "actorder": null, "block_structure": null, "dynamic": true, "group_size": null, "num_bits": 8, "observer": null, "observer_kwargs": {}, "strategy": "token", "symmetric": true, "type": "float" }, "output_activations": null, "targets": [ "Linear" ], "weights": { "actorder": null, "block_structure": null, "dynamic": false, "group_size": null, "num_bits": 8, "observer": "minmax", "observer_kwargs": {}, "strategy": "channel", "symmetric": true, "type": "float" } } }, "format": "float-quantized", "global_compression_ratio": 1.463543865167781, "ignore": [ "lm_head" ], "kv_cache_scheme": null, "quant_method": "compressed-tensors", "quantization_status": "compressed", "sparsity_config": {} }, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 8.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.47.0", "use_cache": true, "vocab_size": 128256 } 

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions