[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334

wwl2755 · 2025-10-07T05:13:00Z

Fix some of #23888

Enable audio in video in Qwen2.5-Omni in V1 engine.

Same purpose as #26156, but using a different and simpler method from @ywang96 . Basic idea is to create two placeholders for video and audio with the same start_idx, but use "is_embed" to differetiate them.

Basic flow

<|im_start|>user\n<|vision_bos|><|VIDEO|><|vision_eos|>Describe the content of the video<|im_end|> # no audio placeholder in the prompt -> "video": [ PlaceholderFeaturesInfo( start_idx=4, tokens=[151659, 151655, 151655, 151654, 151654, 151660], is_embed=[False, True, True, False, False, False] ) ] -> "audio": [ PlaceholderFeaturesInfo( start_idx=4, tokens=[151659, 151655, 151655, 151654, 151654, 151660], is_embed=[False, False, False, True, True, False] ) ] -> <|im_start|>user\n<|vision_bos|><|audio_bos|><|VIDEO|>*2<|AUDIO|>*2<|audio_eos|><|vision_eos|>Describe the content of the video<|im_end|>

Known limitation

This PR assumes the number of video and audio would exactly match to enable use_audio_in_video as in the example.

Test

python examples/offline_inference/qwen2_5_omni/only_thinker.py -q use_audio_in_video INFO 10-09 04:02:38 [llm.py:340] Supported_tasks: ['generate'] Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.42s/it] Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.14s/it, est. speed input: 2370.76 toks/s, output: 80.69 toks/s] The video shows a baby sitting on a bed, wearing glasses, and holding a book. The baby seems to be looking at the book and turning the pages. I'm not sure what the baby says, but it could be something like "book" or "read". So, the text of what the baby says is "book" or "read". If you have any other questions about the video or anything else, feel free to let me know.

mergify · 2025-10-07T05:13:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wwl2755.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify · 2025-10-08T12:08:37Z

Documentation preview: https://vllm--26334.org.readthedocs.build/en/26334/

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 · 2025-10-09T03:58:35Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

 use_audio_in_video = all(
 item["use_audio_in_video"].data for item in video_items
 )


This existing code seems to assume all video inputs should have a paired audio to enable use_audio_in_video.

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-09T04:07:51Z

vllm/v1/worker/gpu_model_runner.py

 second_per_grid_ts.append(t)
 if (t := mm_input.get("audio_feature_lengths")) is not None:
 audio_feature_lengths.append(t)
- if mm_input.get("use_audio_in_video") is True:
- use_audio_in_video = True
+ # Check for use_audio_in_video
+ use_audio_in_video_value = mm_input.get("use_audio_in_video")
+ if use_audio_in_video_value is not None:
+ use_audio_in_video = bool(use_audio_in_video_value.item())


Preserve any use_audio_in_video flag across batch

The new loop in _init_mrope_positions overwrites use_audio_in_video on every multimodal item (use_audio_in_video = bool(use_audio_in_video_value.item())). When a batch mixes requests that require audio-in-video with ones that do not, the last item processed can reset the flag to False, so get_mrope_input_positions is called without audio-in-video handling even though earlier requests needed it. This yields incorrect rotary positions for those prompts. The flag should be accumulated (e.g., OR’ed) instead of overwritten so that any request enabling audio-in-video keeps the global flag true.

Useful? React with 👍 / 👎.

How to handle use_audio_in_video and non_use_audio_in_video fixed in a request is a problem. This PR's scope is to assume all video items have the same attribute in this field.

wwl2755 · 2025-10-09T04:29:39Z

This should be ready to review. Please free feel to take a look when you are free~ @DarkLight1337 @ywang96 @Isotr0py

DarkLight1337 · 2025-10-09T04:43:47Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

+ (
+ prompt_ids,
+ mm_placeholders,
+ ) = self._apply_prompt_updates(


Suggested change

(

prompt_ids,

mm_placeholders,

) = self._apply_prompt_updates(

prompt_ids, mm_placeholders = self._apply_prompt_updates(

Nit: Avoid unnecessary lines. Same below, and can also do the same for self._validate_mm_placeholders

DarkLight1337 · 2025-10-09T04:44:38Z

vllm/model_executor/models/qwen2_5_omni_thinker.py

+ if num_audios != num_videos:
+ raise ValueError(
+ f"use_audio_in_video requires equal number of audio and video items, "
+ f"got audio={num_audios}, video={num_videos}"


Suggested change

f"got audio={num_audios}, video={num_videos}"

f"got {num_audios=}, {num_videos=}"

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify · 2025-10-14T04:31:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wwl2755.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify bot added documentation Improvements or additions to documentation qwen Related to Qwen models v1 labels Oct 7, 2025

mergify bot added the needs-rebase label Oct 7, 2025

wwl2755 force-pushed the mm-omni-2 branch from 243bba6 to acb006d Compare October 7, 2025 05:28

mergify bot removed the needs-rebase label Oct 7, 2025

init

4fdbf8e

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 force-pushed the mm-omni-2 branch from acb006d to 4fdbf8e Compare October 7, 2025 05:46

fix

ab88a46

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

DarkLight1337 mentioned this pull request Oct 7, 2025

[V0 Deprecation] Remove VLLM_USE_V1 from docs and scripts #26336

Merged

5 tasks

wwl2755 added 2 commits October 9, 2025 02:59

cleanup

db41805

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

cleanup

8df7bc3

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 commented Oct 9, 2025

View reviewed changes

add validation for matched number

1eec6a0

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

wwl2755 marked this pull request as ready for review October 9, 2025 04:04

wwl2755 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, sighingnow and ywang96 as code owners October 9, 2025 04:04

chatgpt-codex-connector bot reviewed Oct 9, 2025

View reviewed changes

DarkLight1337 reviewed Oct 9, 2025

View reviewed changes

comment

9693739

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify bot added the needs-rebase label Oct 14, 2025

merge from main

d8030fa

Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>

mergify bot removed the needs-rebase label Oct 16, 2025

DarkLight1337 mentioned this pull request Oct 16, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334

[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334

wwl2755 commented Oct 7, 2025 •

edited by github-actions bot

Loading

mergify bot commented Oct 7, 2025

mergify bot commented Oct 8, 2025

wwl2755 Oct 9, 2025

chatgpt-codex-connector bot left a comment

chatgpt-codex-connector bot Oct 9, 2025

wwl2755 Oct 9, 2025

wwl2755 commented Oct 9, 2025

DarkLight1337 Oct 9, 2025 •

edited

Loading

DarkLight1337 Oct 9, 2025

mergify bot commented Oct 14, 2025

Labels

2 participants

	f"got audio={num_audios}, video={num_videos}"
	f"got {num_audios=}, {num_videos=}"

Uh oh!

[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334

Are you sure you want to change the base?

[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334

Conversation

wwl2755 commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic flow

Known limitation

Test

mergify bot commented Oct 7, 2025

mergify bot commented Oct 8, 2025

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

chatgpt-codex-connector bot Oct 9, 2025

Choose a reason for hiding this comment

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

wwl2755 commented Oct 9, 2025

DarkLight1337 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

DarkLight1337 Oct 9, 2025

Choose a reason for hiding this comment

mergify bot commented Oct 14, 2025

Labels

2 participants

wwl2755 commented Oct 7, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Oct 9, 2025 •

edited

Loading