-
- Notifications
You must be signed in to change notification settings - Fork 10.6k
[MM][Feat] Add support for audio in video in Qwen2.5-Omni #26334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
Documentation preview: https://vllm--26334.org.readthedocs.build/en/26334/ |
use_audio_in_video = all( | ||
item["use_audio_in_video"].data for item in video_items | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This existing code seems to assume all video inputs should have a paired audio to enable use_audio_in_video
.
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
second_per_grid_ts.append(t) | ||
if (t := mm_input.get("audio_feature_lengths")) is not None: | ||
audio_feature_lengths.append(t) | ||
if mm_input.get("use_audio_in_video") is True: | ||
use_audio_in_video = True | ||
# Check for use_audio_in_video | ||
use_audio_in_video_value = mm_input.get("use_audio_in_video") | ||
if use_audio_in_video_value is not None: | ||
use_audio_in_video = bool(use_audio_in_video_value.item()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve any
use_audio_in_video
flag across batch
The new loop in _init_mrope_positions
overwrites use_audio_in_video
on every multimodal item (use_audio_in_video = bool(use_audio_in_video_value.item())
). When a batch mixes requests that require audio-in-video with ones that do not, the last item processed can reset the flag to False
, so get_mrope_input_positions
is called without audio-in-video handling even though earlier requests needed it. This yields incorrect rotary positions for those prompts. The flag should be accumulated (e.g., OR’ed) instead of overwritten so that any request enabling audio-in-video keeps the global flag true.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to handle use_audio_in_video
and non_use_audio_in_video
fixed in a request is a problem. This PR's scope is to assume all video items have the same attribute in this field.
This should be ready to review. Please free feel to take a look when you are free~ @DarkLight1337 @ywang96 @Isotr0py |
( | ||
prompt_ids, | ||
mm_placeholders, | ||
) = self._apply_prompt_updates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( | |
prompt_ids, | |
mm_placeholders, | |
) = self._apply_prompt_updates( | |
prompt_ids, mm_placeholders = self._apply_prompt_updates( |
Nit: Avoid unnecessary lines. Same below, and can also do the same for self._validate_mm_placeholders
if num_audios != num_videos: | ||
raise ValueError( | ||
f"use_audio_in_video requires equal number of audio and video items, " | ||
f"got audio={num_audios}, video={num_videos}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"got audio={num_audios}, video={num_videos}" | |
f"got {num_audios=}, {num_videos=}" |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
Fix some of #23888
Enable
audio in video
in Qwen2.5-Omni in V1 engine.Same purpose as #26156, but using a different and simpler method from @ywang96 . Basic idea is to create two placeholders for video and audio with the same start_idx, but use "is_embed" to differetiate them.
Basic flow
Known limitation
This PR assumes the number of video and audio would exactly match to enable
use_audio_in_video
as in the example.Test