- Notifications
You must be signed in to change notification settings - Fork 245
feat: add base_model eval type and completions support #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add base_model eval type and completions support #1069
Conversation
Summary of ChangesHello @TrinityXI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly extends the evaluation framework by introducing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for evaluating base models via the /completions endpoint by adding a new EvalType.BASE_MODEL. The implementation is comprehensive, covering prompt construction, streaming support, and response normalization.
My review focuses on a few key areas:
- A potential breaking change in the default configuration for existing service types.
- Missing feature support for
logprobsin completion responses. - Minor improvements for robustness and correctness in utility functions.
Overall, this is a great addition that expands the evaluation capabilities of the framework. The suggested changes aim to improve correctness and avoid unintended side effects.
| elif self.eval_type in (EvalType.SERVICE, EvalType.BASE_MODEL): | ||
| self.generation_config = { | ||
| 'temperature': 0.0, | ||
| 'max_tokens': 512, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a default max_tokens: 512 for EvalType.SERVICE in addition to the new EvalType.BASE_MODEL. Previously, EvalType.SERVICE did not have a default max_tokens, so it would use the model's default. This is a behavioral change that could unexpectedly truncate outputs for existing users of EvalType.SERVICE. The pull request description states "Zero breaking changes to existing server/openai_api behavior", but this change seems to contradict that. To avoid breaking changes, you might want to apply this default only to EvalType.BASE_MODEL.
| elif self.eval_type in (EvalType.SERVICE, EvalType.BASE_MODEL): | |
| self.generation_config = { | |
| 'temperature': 0.0, | |
| 'max_tokens': 512, | |
| } | |
| elif self.eval_type == EvalType.SERVICE: | |
| self.generation_config = { | |
| 'temperature': 0.0, | |
| } | |
| elif self.eval_type == EvalType.BASE_MODEL: | |
| self.generation_config = { | |
| 'temperature': 0.0, | |
| 'max_tokens': 512, | |
| } |
| if not hasattr(self, '_valid_completion_params'): | ||
| self._valid_completion_params = get_supported_params(self.client.completions.create) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to cache the result of get_supported_params to avoid repeated and potentially expensive reflection calls. However, this caching is implemented on the instance (self). If multiple instances of OpenAIBaseModelAPI are created, this check will be performed for each one. Consider caching this at the class level to optimize further, as the supported parameters for self.client.completions.create will be the same across all instances of this class.
| if not hasattr(self, '_valid_completion_params'): | |
| self._valid_completion_params = get_supported_params(self.client.completions.create) | |
| if not hasattr(OpenAIBaseModelAPI, '_valid_completion_params'): | |
| OpenAIBaseModelAPI._valid_completion_params = get_supported_params(self.client.completions.create) |
| """Flatten chat messages into a simple text prompt for completions API.""" | ||
| parts: List[str] = [] | ||
| for message in messages: | ||
| role = getattr(message, 'role', 'user') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using getattr(message, 'role', 'user') is defensive, but ChatMessage is a Union of types that all define a role attribute. Relying on getattr with a default might mask potential issues where a message object is missing its role, which would cause it to silently default to 'user' and lead to incorrect prompt formatting. It would be more robust and clearer to directly access message.role. If there's a scenario where role can be missing, it might be better to handle that case explicitly or adjust the type hints.
| role = getattr(message, 'role', 'user') | |
| role = message.role |
| ChatCompletionChoice( | ||
| message=ChatMessageAssistant(content=(choice.text or ''), model=response.model, source='generate'), | ||
| stop_reason=as_stop_reason(choice.finish_reason), | ||
| logprobs=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logprobs field is hardcoded to None. The OpenAI completions API can return log probabilities if they are requested in generation_config. To fully support the features of the completions endpoint, it would be beneficial to parse the logprobs from the choice object and populate the logprobs field in ChatCompletionChoice. This would be consistent with how chat_choices_from_openai handles logprobs for chat completions.
| CompletionChoice( | ||
| finish_reason=finish_reasons.get(index, 'stop'), | ||
| index=index, | ||
| logprobs=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the non-streaming case, logprobs are hardcoded to None when reconstructing choices from a stream. The streaming chunks can contain logprob information which is currently being discarded. To provide full feature support, please consider collecting and reconstructing the logprobs from the stream as well.
| choices=choices, | ||
| created=last_chunk.created, | ||
| model=last_chunk.model, | ||
| object=getattr(last_chunk, 'object', 'completion'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object attribute of the reconstructed Completion object is set using getattr(last_chunk, 'object', 'completion'). According to the OpenAI API documentation, the object type for a completion object is 'text_completion'. Using 'completion' might be incorrect. For consistency with the OpenAI spec and with how collect_stream_response handles chat completions (it sets object='chat.completion'), this should probably be set to 'text_completion'. The object for a streaming chunk is also 'text_completion', so you could just use last_chunk.object.
| object=getattr(last_chunk, 'object', 'completion'), | |
| object='text_completion', |
| @TrinityXI Thanks for the PR! A few things to address:
pip install pre-commit pre-commit install pre-commit run --all-files |
Yunnglin left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TrinityXI Thanks for the PR! A few things to address as commented.
New Feature: Support for
base_modelEvaluation TypeEvalType.BASE_MODEL is now introduced to support evaluation of non-chat (base/completion-style) models via the OpenAI-compatible
/completionsendpoint, clearly distinguished from the existingserver/openai_apitype that uses/chat/completions.Key Implementation Details
EvalType.BASE_MODELOpenAIBaseModelAPI, using the completions call chainassistant:suffix preserved to trigger continuation)/completionsModelOutputstructure, ensuring seamless integration with existing evaluation pipelinesDefault Configuration & Compatibility
max_tokens=512(can be overridden viageneration_config, e.g.,max_tokens=2048)/chat/completionsand/completionsto prevent duplicationtools,tool_choice, etc., are ignored/completionsrequests are automatically moved toextra_bodyfor compatibilityUsage Example (CLI)
--eval-type base_model \ --api-url https://api.example.com/v1 \ --api-key sk-xxx \ --generation-config '{"max_tokens": 2048, "temperature": 0}'Important Note for Few-Shot Evaluation
When using few-shot prompting with the
/completionsendpoint, it is strongly recommended to explicitly specifystop_seqsto prevent the model from continuing to generate subsequent examples:Scope of Changes
constants.py,config.py,models/model_apis.py,models/openai_compatible.py,models/utils/openai.pyserver/openai_apibehaviorThis addition enables accurate and standardized evaluation of base (non-instruction-tuned) models through OpenAI-compatible inference services, closing a long-standing gap in evaluation coverage.