-
Notifications
You must be signed in to change notification settings - Fork 453
[Task] V*-Bench (Visual Star Benchmark) #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Updated Python version requirement in pyproject.toml to >=3.12. - Removed specific version constraint for protobuf in dependencies. - Added 'uv.lock' to .gitignore. - Modified example script to change model task from 'mmmu_pro' to 'mme' and updated comments for clarity.
- Changed `use_flash_attention_2` parameter to `attn_implementation` for better flexibility in attention methods. - Added validation for `attn_implementation` to ensure only valid options are accepted. - Updated model loading to dynamically include attention implementation in arguments. - Introduced new V* benchmark task files, including default template, utility functions, and specific task configurations for direct attributes and relative position metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request refactors and adds configuration for the V*-Bench benchmark while updating some project dependencies and model loading parameters.
- Updated Python version requirement and dependency specifications in pyproject.toml
- Added benchmark YAML configuration files and supporting evaluation functions in utils.py
- Modified model initialization and generation settings in the Qwen2.5-VL implementation and updated the example launch script
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Upgraded Python requirement and removed version pin for protobuf |
| lmms_eval/tasks/vstar_bench/*.yaml | Added YAML configuration files for different V*-Bench sub-tasks |
| lmms_eval/tasks/vstar_bench/utils.py | Introduced processing, answer extraction, and result aggregation functions |
| lmms_eval/models/qwen2_5_vl.py | Updated model loading, attention implementation validation, and generation logic |
| examples/models/qwen25vl.sh | Adjusted example parameters for model arguments |
Comments suppressed due to low confidence (3)
pyproject.toml:21
- Upgrading the Python requirement to 3.12 may limit compatibility for users on older versions. Please confirm that the upgrade is intentional and that the project has no dependency on earlier Python versions.
requires-python = ">=3.12"
lmms_eval/models/qwen2_5_vl.py:308
- Setting temperature and top_p to None when do_sample is false may cause issues if the underlying generate function expects numerical values. Please verify that None is accepted for these parameters in the API.
if current_gen_kwargs["temperature"] > 0:
examples/models/qwen25vl.sh:16
- The change of the interleave_visuals flag from True to False is significant; please ensure that the documentation and any relevant comments are updated to reflect this new behavior.
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,use_flash_attention_2=True,interleave_visuals=False \
| "eva-decord; platform_system == 'Darwin'", | ||
| "zss", | ||
| "protobuf==3.20", | ||
| "protobuf", |
Copilot
AI
May 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the explicit version pin for protobuf could lead to unintended issues if newer releases introduce breaking changes. Consider pinning a known compatible version to ensure project stability.
| "protobuf", | |
| "protobuf==4.24.3", |
- Introduced the VLMs Are Blind benchmark to evaluate visual reasoning capabilities of Vision-Language Models through path-counting tasks in subway diagrams. - Added task configuration files, utility functions, and a comprehensive README for task instructions and dataset details. - Implemented both standard and lite versions of the benchmark for varied evaluation speeds.

V*-Bench (Visual Star Benchmark)
Overview
V*-Bench is a visual question-answering benchmark designed to evaluate multimodal language models' capabilities in visual perception and reasoning. The benchmark focuses on assessing models' ability to accurately identify and reason about visual attributes in images through multiple-choice questions.
Dataset Details
lmms-lab/vstar-benchTask Categories
The benchmark includes two main categories:
Direct Attributes (
vstar_bench_direct_attributes)Relative Position (
vstar_bench_relative_position)Evaluation
Metrics
Running the Benchmark
To evaluate a model on V*-Bench:
Configuration
The benchmark uses the following configuration:
Generation Settings:
max_new_tokens: 16temperature: 0top_p: 1.0num_beams: 1do_sample: falsePrompt Template:
Implementation Details
Answer Extraction
The evaluation system extracts answer letters (A, B, C, or D) from model responses using multiple patterns to handle various response formats:
Aggregation
Results are aggregated both by category and overall, providing detailed performance metrics for different aspects of visual understanding.
File Structure
References