[Task] V*-Bench (Visual Star Benchmark) #683

Luodian · 2025-05-26T05:32:54Z

V*-Bench (Visual Star Benchmark)

Overview

V*-Bench is a visual question-answering benchmark designed to evaluate multimodal language models' capabilities in visual perception and reasoning. The benchmark focuses on assessing models' ability to accurately identify and reason about visual attributes in images through multiple-choice questions.

Dataset Details

Dataset: lmms-lab/vstar-bench
Size: 191 test samples
Format: Multiple-choice questions with 4 options (A, B, C, D)
Modalities: Image + Text

Task Categories

The benchmark includes two main categories:

Direct Attributes (vstar_bench_direct_attributes)
- Questions about direct visual properties such as colors, objects, counts, and characteristics
- Examples: "What is the color of the glove?", "What is the breed of the dog?", "How many people are in the image?"
Relative Position (vstar_bench_relative_position)
- Questions about spatial relationships and positioning of objects within images
- Evaluates understanding of spatial concepts and object relationships

Evaluation

Metrics

Overall Accuracy: Percentage of correctly answered questions across all categories
Category-specific Accuracy: Accuracy for each individual category (direct_attributes, relative_position)

Running the Benchmark

To evaluate a model on V*-Bench:

# Run the full benchmark
lmms-eval --model <model_name> --tasks vstar_bench --output_path ./results

# Run specific categories
lmms-eval --model <model_name> --tasks vstar_bench_direct_attributes --output_path ./results
lmms-eval --model <model_name> --tasks vstar_bench_relative_position --output_path ./results

Configuration

The benchmark uses the following configuration:

Generation Settings:
- max_new_tokens: 16
- temperature: 0
- top_p: 1.0
- num_beams: 1
- do_sample: false
Prompt Template:
- Post-prompt: "\nAnswer with the option's letter from the given choices directly."

Implementation Details

Answer Extraction

The evaluation system extracts answer letters (A, B, C, or D) from model responses using multiple patterns to handle various response formats:

Direct letter: "A"
With punctuation: "A.", "A)", "(A)"
Full answer format: "Answer: A", "The answer is A"

Aggregation

Results are aggregated both by category and overall, providing detailed performance metrics for different aspects of visual understanding.

File Structure

vstar_bench/
├── __init__.py
├── README.md
├── _default_template_yaml         # Base configuration
├── vstar_bench.yaml              # Main task configuration
├── vstar_bench_direct_attributes.yaml
├── vstar_bench_relative_position.yaml
└── utils.py                      # Processing and evaluation functions

References

Dataset: https://huggingface.co/datasets/lmms-lab/vstar-bench

- Updated Python version requirement in pyproject.toml to >=3.12. - Removed specific version constraint for protobuf in dependencies. - Added 'uv.lock' to .gitignore. - Modified example script to change model task from 'mmmu_pro' to 'mme' and updated comments for clarity.

- Changed `use_flash_attention_2` parameter to `attn_implementation` for better flexibility in attention methods. - Added validation for `attn_implementation` to ensure only valid options are accepted. - Updated model loading to dynamically include attention implementation in arguments. - Introduced new V* benchmark task files, including default template, utility functions, and specific task configurations for direct attributes and relative position metrics.

Luodian · 2025-05-26T05:33:25Z

Copilot

Pull Request Overview

This pull request refactors and adds configuration for the V*-Bench benchmark while updating some project dependencies and model loading parameters.

Updated Python version requirement and dependency specifications in pyproject.toml
Added benchmark YAML configuration files and supporting evaluation functions in utils.py
Modified model initialization and generation settings in the Qwen2.5-VL implementation and updated the example launch script

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pyproject.toml	Upgraded Python requirement and removed version pin for protobuf
lmms_eval/tasks/vstar_bench/*.yaml	Added YAML configuration files for different V*-Bench sub-tasks
lmms_eval/tasks/vstar_bench/utils.py	Introduced processing, answer extraction, and result aggregation functions
lmms_eval/models/qwen2_5_vl.py	Updated model loading, attention implementation validation, and generation logic
examples/models/qwen25vl.sh	Adjusted example parameters for model arguments

Comments suppressed due to low confidence (3)

pyproject.toml:21

Upgrading the Python requirement to 3.12 may limit compatibility for users on older versions. Please confirm that the upgrade is intentional and that the project has no dependency on earlier Python versions.

requires-python = ">=3.12"

lmms_eval/models/qwen2_5_vl.py:308

Setting temperature and top_p to None when do_sample is false may cause issues if the underlying generate function expects numerical values. Please verify that None is accepted for these parameters in the API.

if current_gen_kwargs["temperature"] > 0:

examples/models/qwen25vl.sh:16

The change of the interleave_visuals flag from True to False is significant; please ensure that the documentation and any relevant comments are updated to reflect this new behavior.

--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,use_flash_attention_2=True,interleave_visuals=False \

Copilot · 2025-05-26T05:33:38Z

pyproject.toml

    "eva-decord; platform_system == 'Darwin'",
    "zss",
-    "protobuf==3.20",
+    "protobuf",


Removing the explicit version pin for protobuf could lead to unintended issues if newer releases introduce breaking changes. Consider pinning a known compatible version to ensure project stability.

Suggested change

"protobuf",

"protobuf==4.24.3",

- Introduced the VLMs Are Blind benchmark to evaluate visual reasoning capabilities of Vision-Language Models through path-counting tasks in subway diagrams. - Added task configuration files, utility functions, and a comprehensive README for task instructions and dataset details. - Implemented both standard and lite versions of the benchmark for varied evaluation speeds.

Luodian added 4 commits May 24, 2025 05:16

Update example script comments for clarity on visual token positioning

f413504

Add .venv to .gitignore to exclude virtual environment files

c6dfa24

Luodian requested review from Copilot and kcz358 May 26, 2025 05:33

Copilot AI reviewed May 26, 2025

View reviewed changes

Luodian merged commit cd1d194 into main May 26, 2025
2 checks passed

MasterBeeee mentioned this pull request May 27, 2025

[FIX] Fix parameter name in qwen25vl.sh #693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Task] V*-Bench (Visual Star Benchmark) #683

[Task] V*-Bench (Visual Star Benchmark) #683

Uh oh!

Luodian commented May 26, 2025

Uh oh!

Luodian commented May 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Task] V*-Bench (Visual Star Benchmark) #683

[Task] V*-Bench (Visual Star Benchmark) #683

Uh oh!

Conversation

Luodian commented May 26, 2025

V*-Bench (Visual Star Benchmark)

Overview

Dataset Details

Task Categories

Evaluation

Metrics

Running the Benchmark

Configuration

Implementation Details

Answer Extraction

Aggregation

File Structure

References

Uh oh!

Luodian commented May 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants