-
- Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bugfix] Multiple fixes for gpt-oss Chat Completion prompting #28729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This fixes multiple issues with how we were prompting gpt-oss models for Chat Completion requests. A summary of the fixes: - We were not setting the recipient to "assistant" on tool responses - We were not mapping the tool_call_id to the proper function name on tool responses, resulting in the model getting confused about which tool response matched which tool call. - When the model generates tool calls during a turn, we return any model output to the commentary channel as the `content` of that tool call response. We were not properly restoring this content as a commentary message when those tool calls were passed back in via Chat Completion messages. - When the model generates reasoning messages during a tool call turn, we return that as a `reasoning` properly on the Chat Completion response. We were not properly restoring that reasoning content as a message to the analysis channel when subsequent turns passed that back in. - When non-tool reasoning responses from the model were passed back in via the `reasoning` extra_body in subsequent turns, we were not properly turning those back into analysis messages. This is the same problem we had with reasoning in tool calls, but a slightly different fix for the non-tool call path. - We were not properly marking the channel as `final` for non-tool, non-reasoning model output when generating the Harmony messages for subsequent turns. - We were not dropping analysis messages that precede a model output to the `final` channel in all cases, resulting in us often showing the model analysis messages from previous turns where it has already generated a final output leading to increased token usage, context pollution, and not following the best practices for handling raw chain of thought outlined by OpenAI. I added a series of tests that pass in Chat Completion requests and thoroughly validate the exact Harmony messages generated from those requests to test for all of the cases above. Signed-off-by: Ben Browning <bbrownin@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive set of fixes for prompting gpt-oss models for Chat Completion, addressing several issues related to tool calls, reasoning, and multi-turn conversations. The changes significantly improve the model's performance on function calling benchmarks. The addition of a thorough test suite to validate these fixes is a great contribution. The code is well-structured, and the fixes in vllm/entrypoints/harmony_utils.py are robust. I have one suggestion to improve the code by removing a side effect, which will enhance maintainability.
| Before this fix, when running the bfcl multi_turn and single_turn suits (4,441 tests) I received 8 HarmonyErrors of So, I can also confidently say that by fixing these issues in how we were prompting the gpt-oss-120b model, we are able to reduce the frequency that it does not follow its Harmony format in its responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Are we able to get this merged before v0.11.1? Looks very critical. |
| Can this be merged please 🙏? |
| I'm going to make some minor changes to address the review feedback from the two code review bots and to isolate this change entirely from the Responses API path so that we can merge this quicker without worry of it impacting Responses handling in any way. I'll get that done, re-tested, and push the updates today. |
Instead of doing my own workaround for Pydantic's poor handling of Iterable fields, re-use the existing maybe_serialize_tool_calls method that the MistralTokenizer work added to turn these ValidatorIterator instances into actual lists. Signed-off-by: Ben Browning <bbrownin@redhat.com>
This isolates the fixes made for turning Chat Completion Requests into Harmony Messages from the code path that Responses uses for previous_input_messages so that Chat Completions support can be iterated and improved without worrying about regressing Responses previous input handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>
Note that even without this change, the gpt-oss support on latest merged main branch is substantially better than what v0.11.0 or v0.10.2 shipped with. I don't know the timing or cutoff for fixes in v0.11.1, but even if this does not make that release there will still be measurably better gpt-oss support in v0.11.1. With that said, I support getting this in if the window for additional fixes is still open for v0.11.1. |
| Ok, from my perspective I've addressed the review bot comments, entirely isolated these changes to only impact the Chat Completions path, expanded the existing test_harmony_utils.py test suite so that it tests both Chat Completions and Responses paths with a common set of tests while moving logic specific to each into its own test class, and re-run my BFCL evaluation on gpt-oss-120b with the latest changes to ensure the numbers are approximately equal as before (approximately because there is a bit of run-to-run variance in this test suite as even though the inputs are identical in every test run the outputs are not entirely deterministic). I'll keep an eye out for any other review feedback or comments required to get this in. |
| While I agree that tool calling and various other quirks in GPT-OSS have improved from basically unusable to a usable state before this PR, this should be designated as a milestone for v0.11.1. |
| /cc @yeqcharlotte PTAL. |
alecsolder left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I think we're finally landing on an implementation that takes the learnings from all of us over the past few months :)
| for i, msg in enumerate(msgs): | ||
| if ( | ||
| i < last_assistant_final_index | ||
| and msg.author.role == "assistant" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should restrict for the role of assistant, they don't in harmony. Primarily this is to save context and remove tool calls and tool call outputs after there is a message to the final channel after them, and requiring the role to be assistant would prevent that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing but if we want to provide a reference for this, we can use this. It says the index of the "last assistant message", where message is the important word because the message type in responses API means final channel, so this is where our logic comes from.
| for item in content | ||
| if isinstance(item, dict) and item.get("type") == "text" | ||
| ) | ||
| if content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this case for? If tool calls is populated, I can understand reasoning being populated, but not this (which would imply it would be more like a "final" message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right that this content closer matches the final content, but now you made me double-check how we generate the outputs and we have some issues in mapping of reasoning and tool call outputs when generating Chat Completion responses. In my head, the content here matched commentary channel responses from tool calls while the reasoning key matches analysis channel responses. However, that's not necessarily the case and it looks like there are some edge cases in our reasoning and tool call parsers for gpt-oss models that can sometimes end up with this, sometimes end up with analysis or commentary messages missed entirely, or most often end up with analyis messages as reasoning and final messages as the tool call content.
In practice, this still seems to work quite well as-is. But, we can likely pick up a bit more accuracy by doing a once-over of Chat Completions reasoning/tool parsers and chat_completion_*_generator paths for harmony models to ensure streaming/non-streaming cases are all aligned with exactly how we parse input. I probably need a new test suite here that starts with Harmony messages, goes through our response parsing to Chat Completion responses, sends those back in as Chat Completion requests, and ensures the Harmony messages generated from that match what we originally sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of what worries me here is on the response generation side, where it makes assumptions about which messages to map as reasoning vs content solely based on order of the output messages without checking of any channels:
vllm/vllm/entrypoints/harmony_utils.py
Lines 522 to 534 in 63fed55
| if len(output_msgs) == 0: | |
| # The generation has stopped during reasoning. | |
| reasoning = parser.current_content | |
| final_content = None | |
| elif len(output_msgs) == 1: | |
| # The generation has stopped during final message. | |
| reasoning = output_msgs[0].content[0].text | |
| final_content = parser.current_content | |
| else: | |
| reasoning_msg = output_msgs[:-1] | |
| final_msg = output_msgs[-1] | |
| reasoning = "\n".join([msg.content[0].text for msg in reasoning_msg]) | |
| final_content = final_msg.content[0].text |
Then, we overwrite he content from that if using a tool call parser at
vllm/vllm/entrypoints/openai/serving_chat.py
Lines 1340 to 1351 in 63fed55
| tool_call_info = tool_parser.extract_tool_calls( | |
| "", | |
| request=request, | |
| token_ids=token_ids, # type: ignore | |
| ) | |
| content = tool_call_info.content | |
| message = ChatMessage( | |
| role=role, | |
| reasoning=reasoning, | |
| content=content, | |
| tool_calls=tool_call_info.tool_calls, | |
| ) |
The tool call parser at https://github.com/vllm-project/vllm/blob/63fed5550609b96b578d2512aefced09efe76e1e/vllm/entrypoints/openai/tool_parsers/openai_tool_parser.py only looks at harmony messages with function recipients or to the final channel.
So, we may end up mixing analysis and commentary channel messages in reasoning or even dropping one of those entirely when making the Chat Completion response. To get this request mapping to Harmony messages exactly right, I'll need to clean up and pair this with some adjustments to the Harmony message to response mapping path so I can do this deterministically.
That will take a bit of time to match up, add some round-trip tests to ensure it doesn't deviate in the future, and re-run the eval. I can tackle this next week, either as part of this or a folllow-up based on timing preferences.
| analysis_msg = Message.from_role_and_content( | ||
| Role.ASSISTANT, reasoning_content | ||
| ) | ||
| analysis_msg = analysis_msg.with_channel("analysis") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs counter to how I feel like it should work, but I think is correct.
Trying to understand OpenAI's implementation, when reasoning is output to the commentary channel, that information is meant to signal that this reasoning should be shown to the user as-is (vs needing a summarization) i.e they want to provide the actual reasoning for the tool call so the user can approve it or not vs only a summarization. But then when it is provided as input again, it doesn't matter so they just always change it back to analysis so that it can be dropped.
| tool_call_id = chat_msg.get("tool_call_id", "") | ||
| name = tool_id_names.get(tool_call_id, "") | ||
| content = chat_msg.get("content", "") or "" | ||
| if isinstance(content, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be pulled out because it is repeated multiple times
| msg = Message.from_role_and_contents(role, contents) | ||
| # Send non-tool assistant messages to the final channel if they don't have a | ||
| # channel already. | ||
| if not msg.channel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this if case not trigger? I think from_role_and_contents always returns None channel
| msg = Message.from_role_and_content(Role.ASSISTANT, arguments) | ||
| msg = msg.with_channel("commentary") | ||
| msg = msg.with_recipient(f"functions.{name}") | ||
| msg = msg.with_content_type("json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could use a comment as it differs from the OpenAI implementation, however I think there has been enough proof at this point that it increases benchmark scores so should be kept
| | ||
| def parse_input_to_harmony_message(chat_msg) -> list[Message]: | ||
| """ | ||
| Parse a message from request.preview_input_messages in the Responsees API to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: previous?
| if not msg.channel: | ||
| msg = msg.with_channel("final") | ||
| msgs.append(msg) | ||
| # For user/system/developer messages, add them directly even if no content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think system and developer cases here need more complex logic.
I think from a high level:
- We should never have more than 1 system message
- We should never have more than 1 developer message
(Can this validation be added to a test?)
Since we create system message and developer message every time in _make_request_with_harmony I'm worried about what happens with system messages here.
Since there is no "instructions" field like in responses, we need to merge the developer message created with the included tools with the developer message here, but not duplicating the tools for example. Complicated and definitely worth the tests I think.
Purpose
This fixes multiple issues with how we were prompting gpt-oss models for Chat Completion requests. A summary of the fixes:
contentof that tool call response. We were not properly restoring this content as a commentary message when those tool calls were passed back in via Chat Completion messages.reasoningproperly on the Chat Completion response. We were not properly restoring that reasoning content as a message to the analysis channel when subsequent turns passed that back in.reasoningextra_body in subsequent turns, we were not properly turning those back into analysis messages. This is the same problem we had with reasoning in tool calls, but a slightly different fix for the non-tool call path.finalfor non-tool, non-reasoning model output when generating the Harmony messages for subsequent turns.finalchannel in all cases, resulting in us often showing the model analysis messages from previous turns where it has already generated a final output leading to increased token usage, context pollution, and not following the best practices for handling raw chain of thought outlined by OpenAI.I added a series of tests that pass in Chat Completion requests and thoroughly validate the exact Harmony messages generated from those requests to test for all of the cases above.
Test Plan
Unit Tests
Berkeley Function Calling Leaderboard v4 Results
Run vLLM built with this fix:
Run
bfcl generateagainst that vLLM:Test Result
Unit Tests
Berkeley Function Calling Leaderboard v4 Results
Baseline, from vLLM main as of Nov 12, 2025
commit 3044195
45.0% overall accuracy
With this fix on top of same commit from main