[Core] Changes to support 0.2.0 flashinfer #11314

pavanimajety · 2024-12-19T00:58:59Z

Dataype and wrapper changes for 0.2.0 flashinfer

Related: #11194

github-actions · 2024-12-19T00:59:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

…anged Signed-off-by: Pavani Majety <pmajety@nvidia.com>

JaheimLee · 2024-12-20T07:23:04Z

I found flashinfer 0.2.0 uses more memory on rank 0 when tp>1. I built it from source with AOT mode. Is that normal?

pavanimajety · 2024-12-23T21:15:20Z

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

JaheimLee · 2024-12-25T14:56:08Z

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

Still have this problem. And I got another error

INFO 12-25 22:54:21 config.py:478] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'. INFO 12-25 22:54:22 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. INFO 12-25 22:54:22 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor INFO 12-25 22:54:22 config.py:1216] Defaulting to use mp for distributed inference WARNING 12-25 22:54:22 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used WARNING 12-25 22:54:22 config.py:604] Async output processing is not supported on the current platform type cuda. INFO 12-25 22:54:22 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, WARNING 12-25 22:54:22 multiproc_worker_utils.py:280] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing for more information. WARNING 12-25 22:54:22 multiproc_worker_utils.py:312] Reducing Torch parallelism from 28 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 12-25 22:54:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager INFO 12-25 22:54:22 selector.py:155] Using Flashinfer backend. WARNING 12-25 22:54:22 registry.py:262] `mm_limits` has already been set for model=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, and will be overwritten by the new values. INFO 12-25 22:54:37 config.py:478] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'. INFO 12-25 22:54:38 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. INFO 12-25 22:54:38 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor INFO 12-25 22:54:38 config.py:1216] Defaulting to use mp for distributed inference WARNING 12-25 22:54:38 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used WARNING 12-25 22:54:38 config.py:604] Async output processing is not supported on the current platform type cuda. INFO 12-25 22:54:38 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main exitcode = _main(fd, parent_sentinel) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 131, in _main prepare(preparation_data) File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path main_content = runpy.run_path(main_path, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen runpy>", line 287, in run_path File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code File "/data/lijinghui/uv_projects/LLM/test.py", line 19, in <module> llm = LLM( ^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/utils.py", line 990, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 230, in __init__ self.llm_engine = self.engine_class.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 532, in from_engine_args engine = cls( ^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ super().__init__(*args, **kwargs) File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_gpu_executor.py", line 58, in _init_executor worker = ProcessWorkerWrapper( ^^^^^^^^^^^^^^^^^^^^^ File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 167, in __init__ self.process.start() File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) ^^^^^^^^^^^^^^^^^ File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/context.py", line 289, in _Popen return Popen(process_obj) ^^^^^^^^^^^^^^^^^^ File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 164, in get_preparation_data _check_not_importing_main() File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 140, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.html Exception ignored in: <function LLM.__del__ at 0x7f1a6c5fc180> Traceback (most recent call last): File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 236, in __del__ if self.llm_engine and hasattr(self.llm_engine, "shutdown"): ^^^^^^^^^^^^^^^ AttributeError: 'LLM' object has no attribute 'llm_engine' ERROR 12-25 22:54:39 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 2759640 died, exit code: 1 INFO 12-25 22:54:39 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Here is my code

import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" os.environ["VLLM_USE_FLASHINFER_SAMPLER"] = "1" from transformers import AutoTokenizer from vllm import LLM, SamplingParams model_name = "Qwen2.5-72B-Instruct-AWQ" model_path = os.path.join("/data/pretrained_models", model_name) # Initialize the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) # Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct # max_tokens is for the maximum length for generation. sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512) # Input the model name or path. Can be GPTQ or AWQ models. llm = LLM( model=model_path, gpu_memory_utilization=0.97, tensor_parallel_size=2, kv_cache_dtype="fp8", enforce_eager=True, enable_prefix_caching=True, num_scheduler_steps=8, ) # Prepare your prompts prompt = "Tell me something about large language models." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # generate outputs outputs = llm.generate([text], sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

mergify · 2025-03-13T03:21:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pavanimajety.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin · 2025-04-22T17:09:14Z

Out of date

[Core] Changes to support 0.2.0 flashinfer

690ae5d

pavanimajety marked this pull request as draft December 19, 2024 01:00

Revert to using begin_forward/forward because plan/run inputs have ch…

5439e7d

…anged Signed-off-by: Pavani Majety <pmajety@nvidia.com>

pavanimajety force-pushed the flashinfer-0.2-changes branch from c1e4b21 to 5439e7d Compare December 19, 2024 02:29

pavanimajety marked this pull request as ready for review December 19, 2024 02:37

noooop mentioned this pull request Dec 19, 2024

[Performance]: decoding speed on long context #11286

Closed

1 task

mergify bot added the needs-rebase label Mar 13, 2025

mgoin closed this Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Changes to support 0.2.0 flashinfer #11314

[Core] Changes to support 0.2.0 flashinfer #11314

Uh oh!

pavanimajety commented Dec 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 19, 2024

JaheimLee commented Dec 20, 2024

pavanimajety commented Dec 23, 2024

JaheimLee commented Dec 25, 2024

mergify bot commented Mar 13, 2025

mgoin commented Apr 22, 2025

Labels

3 participants

Uh oh!

[Core] Changes to support 0.2.0 flashinfer #11314

[Core] Changes to support 0.2.0 flashinfer #11314

Uh oh!

Conversation

pavanimajety commented Dec 19, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Dec 19, 2024

JaheimLee commented Dec 20, 2024

pavanimajety commented Dec 23, 2024

JaheimLee commented Dec 25, 2024

mergify bot commented Mar 13, 2025

mgoin commented Apr 22, 2025

Labels

3 participants

pavanimajety commented Dec 19, 2024 •

edited by github-actions bot

Loading