Skip to content

Conversation

@mushenL
Copy link
Contributor

@mushenL mushenL commented Dec 23, 2025

add refcoco benchmark

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mushenL, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the RefCOCO benchmark into the evalscope framework, significantly expanding its capabilities for evaluating multi-modal models on referring expression comprehension. It provides a robust set of tools for assessing model performance across various tasks, from generating descriptive captions to precisely localizing objects via bounding box predictions, all within a flexible and configurable evaluation environment.

Highlights

  • New Benchmark Addition: The RefCOCO benchmark has been added to the system, enabling evaluation for referring expression comprehension tasks.
  • Comprehensive Evaluation Metrics: New evaluation functions are introduced to calculate metrics for both natural language generation (Bleu, METEOR, ROUGE_L, CIDEr) and bounding box prediction (IoU, ACC@thresholds, Center_ACC).
  • Flexible Evaluation Modes: The RefCOCO benchmark supports multiple evaluation modes ('bbox', 'seg', 'bbox_rec') configurable via an 'eval_mode' parameter, allowing tailored assessment based on the task.
  • Data Adaptation and Prompt Generation: A dedicated data adapter (RefCOCOAdapter) is implemented to transform raw RefCOCO dataset records into evaluation samples, including specific prompt formatting for bounding box regression and image processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new benchmark for RefCOCO. The changes include an evaluation library for calculating scores and a data adapter for processing the dataset. My review focuses on improving code robustness, correctness, and maintainability. I've suggested handling potential division-by-zero errors, using proper exception handling, replacing string-based exceptions with actual exception objects, and improving code clarity by using _ for unused variables and translating comments to English. I also recommend re-enabling a commented-out try-except block to prevent crashes during metric calculation.

segmentation = record.get('segmentation')
image_data = refcoco_seg_doc_to_visual(image_data, segmentation)
else:
raise 'Invalid eval mode parameter'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Raising a string literal like 'Invalid eval mode parameter' will result in a TypeError at runtime, not the exception you might expect. You should raise an actual Exception instance, such as ValueError.

Suggested change
raise 'Invalid eval mode parameter'
raise ValueError('Invalid eval mode parameter')
elif self.eval_mode in ['bbox', 'seg']:
target = record.get('answer')
else:
raise 'Invalid eval mode parameter'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Raising a string literal like 'Invalid eval mode parameter' will result in a TypeError at runtime, not the exception you might expect. You should raise an actual Exception instance, such as ValueError.

Suggested change
raise 'Invalid eval mode parameter'
raise ValueError('Invalid eval mode parameter')
Comment on lines 167 to 175
# try:
results = process_results(doc, filtered_prediction)
score.value.update(results)

score.main_score_name = doc['eval_mode']

# except Exception as e:
# logger.error(f'Error calculating ref_coco metrics: {e}')
# score.value = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The try...except block is commented out. This is risky as any error during metric calculation in process_results will crash the evaluation for this sample. It's better to enable this error handling to gracefully manage failures and log them, allowing the evaluation to continue with other samples.

Suggested change
# try:
results = process_results(doc, filtered_prediction)
score.value.update(results)
score.main_score_name = doc['eval_mode']
# except Exception as e:
# logger.error(f'Error calculating ref_coco metrics: {e}')
# score.value = {}
try:
results = process_results(doc, filtered_prediction)
score.value.update(results)
score.main_score_name = doc['eval_mode']
except Exception as e:
logger.error(f'Error calculating ref_coco metrics: {e}')
score.value = {}

def extract_answer(self, prediction: str, task_state: TaskState):
if task_state.metadata['eval_mode'] == 'bbox_rec':
# 匹配[a, b, c, d]格式的答案
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment is in Chinese. For consistency and to make the code accessible to a wider audience, please write comments in English.

Suggested change
# 匹配[a, b, c, d]格式的答案
# Match answers in the format [a, b, c, d]
@Yunnglin Yunnglin changed the title Yk 20251223 [Benchmark] Add RefCOCO Dec 24, 2025
@Yunnglin Yunnglin merged commit 2c0e014 into modelscope:main Dec 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants