Skip to content

Conversation

@Luodian
Copy link
Contributor

@Luodian Luodian commented May 26, 2025

V*-Bench (Visual Star Benchmark)

Overview

V*-Bench is a visual question-answering benchmark designed to evaluate multimodal language models' capabilities in visual perception and reasoning. The benchmark focuses on assessing models' ability to accurately identify and reason about visual attributes in images through multiple-choice questions.

Dataset Details

  • Dataset: lmms-lab/vstar-bench
  • Size: 191 test samples
  • Format: Multiple-choice questions with 4 options (A, B, C, D)
  • Modalities: Image + Text

Task Categories

The benchmark includes two main categories:

  1. Direct Attributes (vstar_bench_direct_attributes)

    • Questions about direct visual properties such as colors, objects, counts, and characteristics
    • Examples: "What is the color of the glove?", "What is the breed of the dog?", "How many people are in the image?"
  2. Relative Position (vstar_bench_relative_position)

    • Questions about spatial relationships and positioning of objects within images
    • Evaluates understanding of spatial concepts and object relationships

Evaluation

Metrics

  • Overall Accuracy: Percentage of correctly answered questions across all categories
  • Category-specific Accuracy: Accuracy for each individual category (direct_attributes, relative_position)

Running the Benchmark

To evaluate a model on V*-Bench:

# Run the full benchmark
lmms-eval --model <model_name> --tasks vstar_bench --output_path ./results

# Run specific categories
lmms-eval --model <model_name> --tasks vstar_bench_direct_attributes --output_path ./results
lmms-eval --model <model_name> --tasks vstar_bench_relative_position --output_path ./results

Configuration

The benchmark uses the following configuration:

  • Generation Settings:

    • max_new_tokens: 16
    • temperature: 0
    • top_p: 1.0
    • num_beams: 1
    • do_sample: false
  • Prompt Template:

    • Post-prompt: "\nAnswer with the option's letter from the given choices directly."

Implementation Details

Answer Extraction

The evaluation system extracts answer letters (A, B, C, or D) from model responses using multiple patterns to handle various response formats:

  • Direct letter: "A"
  • With punctuation: "A.", "A)", "(A)"
  • Full answer format: "Answer: A", "The answer is A"

Aggregation

Results are aggregated both by category and overall, providing detailed performance metrics for different aspects of visual understanding.

File Structure

vstar_bench/
├── __init__.py
├── README.md
├── _default_template_yaml         # Base configuration
├── vstar_bench.yaml              # Main task configuration
├── vstar_bench_direct_attributes.yaml
├── vstar_bench_relative_position.yaml
└── utils.py                      # Processing and evaluation functions

References

Luodian added 4 commits May 24, 2025 05:16
- Updated Python version requirement in pyproject.toml to >=3.12.
- Removed specific version constraint for protobuf in dependencies.
- Added 'uv.lock' to .gitignore.
- Modified example script to change model task from 'mmmu_pro' to 'mme' and updated comments for clarity.
- Changed `use_flash_attention_2` parameter to `attn_implementation` for better flexibility in attention methods.
- Added validation for `attn_implementation` to ensure only valid options are accepted.
- Updated model loading to dynamically include attention implementation in arguments.
- Introduced new V* benchmark task files, including default template, utility functions, and specific task configurations for direct attributes and relative position metrics.
@Luodian Luodian requested review from Copilot and kcz358 May 26, 2025 05:33
@Luodian
Copy link
Contributor Author

Luodian commented May 26, 2025

PixPin_2025-05-26_13-33-19

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request refactors and adds configuration for the V*-Bench benchmark while updating some project dependencies and model loading parameters.

  • Updated Python version requirement and dependency specifications in pyproject.toml
  • Added benchmark YAML configuration files and supporting evaluation functions in utils.py
  • Modified model initialization and generation settings in the Qwen2.5-VL implementation and updated the example launch script

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyproject.toml Upgraded Python requirement and removed version pin for protobuf
lmms_eval/tasks/vstar_bench/*.yaml Added YAML configuration files for different V*-Bench sub-tasks
lmms_eval/tasks/vstar_bench/utils.py Introduced processing, answer extraction, and result aggregation functions
lmms_eval/models/qwen2_5_vl.py Updated model loading, attention implementation validation, and generation logic
examples/models/qwen25vl.sh Adjusted example parameters for model arguments
Comments suppressed due to low confidence (3)

pyproject.toml:21

  • Upgrading the Python requirement to 3.12 may limit compatibility for users on older versions. Please confirm that the upgrade is intentional and that the project has no dependency on earlier Python versions.
requires-python = ">=3.12"

lmms_eval/models/qwen2_5_vl.py:308

  • Setting temperature and top_p to None when do_sample is false may cause issues if the underlying generate function expects numerical values. Please verify that None is accepted for these parameters in the API.
if current_gen_kwargs["temperature"] > 0:

examples/models/qwen25vl.sh:16

  • The change of the interleave_visuals flag from True to False is significant; please ensure that the documentation and any relevant comments are updated to reflect this new behavior.
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,use_flash_attention_2=True,interleave_visuals=False \

"eva-decord; platform_system == 'Darwin'",
"zss",
"protobuf==3.20",
"protobuf",
Copy link

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the explicit version pin for protobuf could lead to unintended issues if newer releases introduce breaking changes. Consider pinning a known compatible version to ensure project stability.

Suggested change
"protobuf",
"protobuf==4.24.3",

Copilot uses AI. Check for mistakes.
- Introduced the VLMs Are Blind benchmark to evaluate visual reasoning capabilities of Vision-Language Models through path-counting tasks in subway diagrams.
- Added task configuration files, utility functions, and a comprehensive README for task instructions and dataset details.
- Implemented both standard and lite versions of the benchmark for varied evaluation speeds.
@Luodian Luodian merged commit cd1d194 into main May 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants