[Feat] fix tasks and vllm to reproduce better results. #774

Luodian · 2025-07-28T13:19:08Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

New Features
- Improved configuration flexibility and environment variable support for model and parallelism settings.
- Enhanced prompt evaluation robustness and error handling for MIA Bench tasks.
Bug Fixes
- Refined handling of generation parameters to prevent conflicts and warnings during text generation.
Refactor
- Migrated MathVerse evaluation to use the official OpenAI SDK, streamlining API usage and error management.
- Simplified and strengthened result processing and aggregation logic for MathVerse and MathVista tasks.
Chores
- Removed or simplified sampling and generation parameters across multiple task configurations.
- Increased maximum token output limits in several evaluation tasks for longer generation capabilities.

…ng steps during evaluation

…ation

…reamline configuration

…rom evaluation process

coderabbitai · 2025-07-28T13:19:14Z

Walkthrough

This update refactors model and evaluation logic, improves configuration management, and enhances robustness across code and YAML configurations. Key changes include parameter renaming and environment variable usage, migration to the official OpenAI SDK, improved error handling, and significant simplification of generation parameter configurations in multiple YAML files.

Changes

Cohort / File(s)	Change Summary
VLLM Model Refactor & Configurability `lmms_eval/models/vllm.py`, `examples/models/vllm_qwen2vl.sh`	Renamed constructor and argument keys from `model_version` to `model`, added `data_parallel_size`, removed `threads`, improved environment variable handling, adjusted client initialization, and reordered message content. Updated script argument key to match.
LLaVA Model Generation Logic `lmms_eval/models/llava_onevision.py`	Refined generation parameter handling: set sampling parameters only if `do_sample` is true, remove conflicting keys, and always remove `image_aspect_ratio` before generation.
OpenAI SDK Migration for MathVerse `lmms_eval/tasks/mathverse/mathverse_evals.py`, `lmms_eval/tasks/mathverse/utils.py`	Replaced manual HTTP requests with official OpenAI/Azure SDK clients, simplified initialization, updated response handling, removed file output, and streamlined error management.
YAML Generation Parameter Simplification `lmms_eval/tasks/k12/k12.yaml`, `lmms_eval/tasks/mathverse/mathverse_testmini.yaml`, `lmms_eval/tasks/mathvision/mathvision_reason_test.yaml`, `lmms_eval/tasks/mathvision/mathvision_reason_testmini.yaml`, `lmms_eval/tasks/mathvision/mathvision_test.yaml`, `lmms_eval/tasks/mathvision/mathvision_testmini.yaml`, `lmms_eval/tasks/mathvista/mathvista_test.yaml`, `lmms_eval/tasks/mathvista/mathvista_testmini_cot.yaml`, `lmms_eval/tasks/mathvista/mathvista_testmini_format.yaml`, `lmms_eval/tasks/mathvista/mathvista_testmini_solution.yaml`, `lmms_eval/tasks/mmmu/mmmu_val_thinking.yaml`	Removed most sampling and beam search parameters (e.g., `temperature`, `top_p`, `num_beams`, `do_sample`, `repetition_penalty`); increased `max_new_tokens` in several files.
MathVerse/MathVista Result Output Simplification `lmms_eval/tasks/mathverse/utils.py`, `lmms_eval/tasks/mathvista/utils.py`	Removed saving of intermediate/final results and scores to files; now only returns evaluation results.
MIA Bench Robustness and Logging `lmms_eval/tasks/mia_bench/utils.py`	Rewrote score processing and results functions for robust parsing, validation, and error handling, with improved logging and fallback behavior.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MathVerseEvaluator
    participant OpenAI/Azure SDK

    User->>MathVerseEvaluator: get_chat_response(prompt, ...)
    MathVerseEvaluator->>OpenAI/Azure SDK: chat.completions.create(...)
    OpenAI/Azure SDK-->>MathVerseEvaluator: response (choices[0].message.content)
    MathVerseEvaluator-->>User: parsed response

sequenceDiagram
    participant User
    participant VLLM_Model

    User->>VLLM_Model: __init__(model=..., tensor_parallel_size=..., data_parallel_size=...)
    VLLM_Model->>VLLM_Model: Configure via env vars, check parallel constraints
    VLLM_Model->>VLLM_Model: Initialize client after accelerator setup
    User->>VLLM_Model: generate_until(...)
    VLLM_Model->>VLLM_Model: Build messages (images first, then text)
    VLLM_Model->>User: Generation result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

Oh, what a hop through the code we take,
Parameters renamed for clarity’s sake!
YAMLs are lighter, configs now neat,
SDKs bring speed with a modern beat.
Robustness and logging, a rabbit’s delight—
This patch is a carrot, crisp and bright!
🥕🐇

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62c9a6a and 8e68a18.

📒 Files selected for processing (1)

lmms_eval/models/vllm.py (10 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lmms_eval/models/vllm.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor Bugbot

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch stable/0d3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

cursor

Bugbot free trial expires on July 29, 2025
Learn more in the Cursor dashboard.

cursor · 2025-07-28T13:24:34Z

lmms_eval/tasks/mathverse/mathverse_evals.py

            {"role": "user", "content": prompt},
        ]
-        payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens, "n": n}
+        payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens}


Bug: API Type Mismatch Causes Undefined Attribute

The gpt_model attribute is no longer set in the __init__ method and is only defined at the class level for the Azure API type. Consequently, when using the OpenAI API, self.gpt_model is undefined, leading to an AttributeError when accessed in the get_chat_response method.

Locations (1)

lmms_eval/tasks/mathverse/mathverse_evals.py#L86-L101

…ocessing

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

lmms_eval/tasks/mathverse/mathverse_evals.py (1)
85-92: Bug: self.gpt_model is undefined for OpenAI API type.

When using the OpenAI API type, self.gpt_model is not defined, which will cause an AttributeError when accessed in the get_chat_response method at line 101. The gpt_model attribute is only set at the class level for the Azure API type.

Apply this fix to define gpt_model for both API types:
 if API_TYPE == "openai":
     API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
     API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
     headers = {
         "Authorization": f"Bearer {API_KEY}",
         "Content-Type": "application/json",
     }
     client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+    gpt_model = os.getenv("MODEL_VERSION", "gpt-4o-2024-11-20")

🧹 Nitpick comments (7)

lmms_eval/tasks/mathvista/mathvista_testmini_cot.yaml (1)

11-13: Same 16 384-token cap – replicate the capacity check

Replicating the concern from the other YAMLs: ensure the models used for the cot split accept such long completions; otherwise lower the value or add explicit early-stop criteria.
lmms_eval/tasks/mathvista/mathvista_testmini_format.yaml (1)
10-12: High max_new_tokens without explicit do_sample: false

With sampling keys removed, vLLM defaults to do_sample=False, but being explicit avoids future surprises if defaults change.
 generation_kwargs:
+  do_sample: false
   max_new_tokens: 16384
lmms_eval/tasks/mathvision/mathvision_testmini.yaml (1)

10-12: Large generation window – revisit for mini split

The testmini subset is meant for quick sanity checks; a 16 k token ceiling defeats the purpose.
Recommend keeping the shorter limit (e.g. 1 024) here to keep CI runs lightweight.
lmms_eval/tasks/mathverse/utils.py (1)
82-88: File saving removal streamlines evaluation workflow.

Removing the file saving operations (raw results, evaluated results dictionary, and scores JSON files) streamlines the evaluation process and reduces potential I/O bottlenecks. However, consider the impact on debugging and result inspection capabilities.

If result persistence is needed for debugging or analysis, consider adding an optional parameter to control file saving:
def mathverse_aggregate_results_eval(results, args, *, calculate_gain=False, random_scores=None, save_results=False):
    split_flag = results[0]["metadata"]["split"]
    problem_version = results[0]["problem_version"].lower().replace(" ", "_")
    results_dict, scores = mathverse_evaluator.eval_results(results, config)
    
    if save_results:
        # Optional file saving logic here
        pass
    
    if scores["average"]["accuracy"] == 0:
        return None
    return scores["average"]["accuracy"]
lmms_eval/models/llava_onevision.py (2)
543-543: Fix code style issue.

Static analysis correctly identifies a minor style issue.
-            if "image_aspect_ratio" in gen_kwargs.keys():
+            if "image_aspect_ratio" in gen_kwargs:
735-741: Consider consistency with the other generate method.

The parameter handling here differs slightly from generate_until method. In generate_until, sampling parameters are only removed when do_sample=False, but here they're explicitly set when do_sample=True and then potentially removed later.

For consistency, consider using the same pattern as generate_until where sampling parameters are conditionally handled:
-                # Only set temperature and top_p if do_sample is True
-                if gen_kwargs.get("do_sample", False):
-                    if "temperature" not in gen_kwargs:
-                        gen_kwargs["temperature"] = 1.0  # Default temperature for sampling
-                    if "top_p" not in gen_kwargs:
-                        gen_kwargs["top_p"] = 1.0  # Default top_p for sampling
+                # When do_sample=False, remove sampling-related parameters to avoid warnings
+                if not gen_kwargs.get("do_sample", False):
+                    gen_kwargs.pop("temperature", None)
+                    gen_kwargs.pop("top_p", None)
+                    gen_kwargs.pop("top_k", None)
lmms_eval/tasks/mia_bench/utils.py (1)
153-257: Excellent robustification of the score parsing logic.

The rewritten process_rawscore function significantly improves error handling and robustness with:

Comprehensive input validation

Well-designed regex patterns for flexible score extraction

Score clamping to ensure valid ranges

Detailed logging for debugging

Graceful fallback mechanisms

Two minor improvements based on static analysis:
# Line 226: Rename unused loop variable
-        for i, component in enumerate(component_type):
+        for component in component_type:

# Line 234: Remove unnecessary .keys()
-                score_dict["total_score"] = sum(v for k, v in score_dict.items() if k != "total_score") / len([k for k in score_dict.keys() if k != "total_score"])
+                score_dict["total_score"] = sum(v for k, v in score_dict.items() if k != "total_score") / len([k for k in score_dict if k != "total_score"])

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8895505 and 62c9a6a.

📒 Files selected for processing (18)

examples/models/vllm_qwen2vl.sh (1 hunks)
lmms_eval/models/llava_onevision.py (3 hunks)
lmms_eval/models/vllm.py (9 hunks)
lmms_eval/tasks/k12/k12.yaml (0 hunks)
lmms_eval/tasks/mathverse/mathverse_evals.py (2 hunks)
lmms_eval/tasks/mathverse/mathverse_testmini.yaml (1 hunks)
lmms_eval/tasks/mathverse/utils.py (1 hunks)
lmms_eval/tasks/mathvision/mathvision_reason_test.yaml (0 hunks)
lmms_eval/tasks/mathvision/mathvision_reason_testmini.yaml (0 hunks)
lmms_eval/tasks/mathvision/mathvision_test.yaml (1 hunks)
lmms_eval/tasks/mathvision/mathvision_testmini.yaml (1 hunks)
lmms_eval/tasks/mathvista/mathvista_test.yaml (1 hunks)
lmms_eval/tasks/mathvista/mathvista_testmini_cot.yaml (1 hunks)
lmms_eval/tasks/mathvista/mathvista_testmini_format.yaml (1 hunks)
lmms_eval/tasks/mathvista/mathvista_testmini_solution.yaml (1 hunks)
lmms_eval/tasks/mathvista/utils.py (0 hunks)
lmms_eval/tasks/mia_bench/utils.py (2 hunks)
lmms_eval/tasks/mmmu/mmmu_val_thinking.yaml (0 hunks)

💤 Files with no reviewable changes (5)

lmms_eval/tasks/mathvision/mathvision_reason_testmini.yaml
lmms_eval/tasks/mathvista/utils.py
lmms_eval/tasks/mathvision/mathvision_reason_test.yaml
lmms_eval/tasks/k12/k12.yaml
lmms_eval/tasks/mmmu/mmmu_val_thinking.yaml

🧰 Additional context used

🧬 Code Graph Analysis (3)

lmms_eval/tasks/mathverse/mathverse_evals.py (1)

lmms_eval/tasks/mathvista/mathvista_evals.py (1)

get_chat_response (183-218)

lmms_eval/tasks/mathverse/utils.py (1)

lmms_eval/tasks/mathverse/mathverse_evals.py (1)

MathVerseEvaluator (76-300)

lmms_eval/models/vllm.py (1)

lmms_eval/models/qwen2_5_vl.py (4)

model (139-144)

device (159-160)

batch_size (155-156)

flatten (173-178)

🪛 Ruff (0.12.2)

lmms_eval/tasks/mathverse/mathverse_evals.py

85-85: Using .strip() with multi-character strings is misleading

(B005)

lmms_eval/tasks/mia_bench/utils.py

226-226: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

234-234: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

lmms_eval/models/llava_onevision.py

543-543: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor Bugbot

🔇 Additional comments (18)

lmms_eval/tasks/mathvision/mathvision_test.yaml (1)

10-12: Token limit raised to 16 384 → verify model/context capacity & evaluation cost

max_new_tokens jumps from 1 024 to 16 384.
Many open-weights (e.g. Llama 2/3, Phi-3, Gemma, most vision-LLMs) top out at 8 192 or less. vLLM aborts or silently truncates when the request exceeds the model’s context window, and such a long limit can explode GPU memory and evaluation time if the stop sequence is not hit.

Please double-check:

The models targeted by this task actually support ≥ 16 384 tokens.

The evaluation harness enforces earlier stopping (e.g. via generate_until) so we don’t pay for pathological generations.

Runtime / cost budgets remain acceptable.

If the larger limit is only a safeguard, consider a safer cap (e.g. 8 192) or dynamically deriving it from model.config.max_position_embeddings.

lmms_eval/tasks/mathvista/mathvista_test.yaml (1)

10-12: Potential OOM / latency spike due to 16 k max_new_tokens

Full-set MathVista evaluation is already compute-heavy. Unless your models demonstrably need >1 k output tokens, the new ceiling will:

Increase evaluation latency linearly.

Risk CUDA OOM on 24–48 GB GPUs.

Consider profiling with 1 k vs 16 k limits before merging.

lmms_eval/tasks/mathverse/mathverse_testmini.yaml (2)

23-23: LGTM: Configuration simplification improves reproducibility.

The simplification of generation parameters (removing stochastic sampling controls as mentioned in the AI summary) aligns well with the PR objective of improving result reproducibility. Deterministic generation settings will produce more consistent results across runs.

30-30: Appropriate model-specific configuration.

Adding image_aspect_ratio: original to the llava model-specific generation kwargs is appropriate for this visual reasoning task, ensuring images are processed with their original aspect ratios.

lmms_eval/tasks/mathvista/mathvista_testmini_solution.yaml (1)

11-11: Significant increase in token limit is appropriate for solution tasks.

Increasing max_new_tokens from 1024 to 16384 is well-suited for the "solution" shot type, as mathematical solutions often require detailed step-by-step explanations that can be quite lengthy. This change, combined with the removal of sampling parameters (as noted in the AI summary), supports the PR objective of improving reproducibility while ensuring adequate capacity for comprehensive solutions.

examples/models/vllm_qwen2vl.sh (1)

13-13: Parameter rename improves consistency.

The change from model_version to model in the model arguments is more intuitive and aligns with the refactoring mentioned in the AI summary. This standardization improves the API consistency across the codebase.

lmms_eval/tasks/mathverse/utils.py (1)

22-22: Evaluator initialization simplified appropriately.

The simplified initialization of mathverse_evaluator aligns with the migration to the official OpenAI SDK. Based on the relevant code snippets, the MathVerseEvaluator class now handles API client initialization internally using environment variables, which is a cleaner approach.

lmms_eval/models/llava_onevision.py (2)

539-551: Improved generation parameter handling prevents warnings.

The refined approach to handling generation kwargs is well-designed:

Unconditionally setting max_new_tokens ensures a reasonable default

Only removing sampling parameters when do_sample=False prevents unnecessary warnings

This aligns with the PR's goal of improving reproducibility by handling generation parameters more robustly

764-767: Consistent parameter cleanup approach.

The parameter cleanup logic here matches the approach used in the generate_until method, which is good for consistency.

lmms_eval/tasks/mia_bench/utils.py (3)

47-47: Good use of environment variable for model version configuration.

The change to source GPT_EVAL_MODEL_NAME from the MODEL_VERSION environment variable with a sensible default improves configurability and aligns with the broader PR's standardization of model version references.

149-149: Excellent improvement to prompt clarity for consistent parsing.

The explicit instruction to output scores in an exact format without markdown or asterisks will significantly improve parsing reliability downstream. This change directly addresses potential parsing issues.

260-342: Excellent defensive programming implementation.

The rewritten mia_bench_process_results function demonstrates best practices in error handling:

Comprehensive input validation

Error handling at every potential failure point

Consistent logging for debugging

Graceful degradation with fallback zero scores

Clear structure that aids maintainability

This robust implementation ensures the evaluation pipeline won't crash due to unexpected inputs or API failures.

lmms_eval/tasks/mathverse/mathverse_evals.py (1)

97-132: Good migration to OpenAI SDK with preserved error handling.

The refactored get_chat_response method successfully migrates to the official SDK while preserving important functionality:

Retry logic with patience parameter

Rate limit handling

Automatic prompt size reduction on length errors

Proper response extraction from the SDK response object

Note: This implementation will work correctly once the self.gpt_model AttributeError is fixed.

lmms_eval/models/vllm.py (5)

23-24: Good improvements to configurability and parameter naming.

The changes enhance the class design:

Environment variables NUM_SECONDS_TO_SLEEP and WORKERS provide runtime configurability

Renaming model_version to model is more intuitive and consistent with common ML frameworks

Addition of data_parallel_size parameter enables data parallelism support

Also applies to: 149-152

209-213: Clear documentation of parallelism constraints.

The code properly enforces and documents the current limitation that data parallelism cannot be used with tensor parallelism. The TODO comment indicates this is a known limitation that may be addressed in future versions.

214-222: Well-structured VLLM client initialization.

The client initialization is properly placed after accelerator setup and includes:

Correct use of the renamed model parameter

Fixed seed for reproducibility (aligns with PR objectives)

Appropriate distributed executor backend configuration

Proper forwarding of additional kwargs

307-310: Simplified generation parameters align with PR objectives.

The removal of default temperature and top_p parameters and keeping only max_tokens is consistent with the broader PR effort to simplify generation configurations across task YAML files.

333-336: Content ordering change: images before text.

The message content now adds images before text, which may affect model behavior. This could be intentional for better performance with vision-language models.

Was this ordering change intentional? Some vision-language models may be sensitive to the order of image and text inputs in the prompt.

coderabbitai · 2025-07-28T13:37:43Z

lmms_eval/models/vllm.py

+        self.chat_template = chat_template
        self.min_image_pixels = min_image_pixels
+        self.data_parallel_size = data_parallel_size
        # Qwen 2/2.5-VL models enforce minimum image dimensions
-        self._enforce_image_resize = self._is_qwen_vl_model(model_version)
+        self._enforce_image_resize = self._is_qwen_vl_model(model)

        # Load chat template during initialization
        self.chat_template = None


⚠️ Potential issue

Bug: Duplicate and conflicting chat_template initialization.

The self.chat_template is assigned the parameter value on line 166 but then immediately set to None on line 173, discarding the initial assignment. This appears to be an error.

Remove the redundant assignment:

self.model = model self.max_frame_num = max_frame_num - self.chat_template = chat_template self.min_image_pixels = min_image_pixels self.data_parallel_size = data_parallel_size # Qwen 2/2.5-VL models enforce minimum image dimensions self._enforce_image_resize = self._is_qwen_vl_model(model) # Load chat template during initialization self.chat_template = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.chat_template = chat_template

self.min_image_pixels = min_image_pixels

self.data_parallel_size = data_parallel_size

# Qwen 2/2.5-VL models enforce minimum image dimensions

self._enforce_image_resize = self._is_qwen_vl_model(model_version)

self._enforce_image_resize = self._is_qwen_vl_model(model)

# Load chat template during initialization

self.chat_template = None

self.model = model

self.max_frame_num = max_frame_num

self.min_image_pixels = min_image_pixels

self.data_parallel_size = data_parallel_size

# Qwen 2/2.5-VL models enforce minimum image dimensions

self._enforce_image_resize = self._is_qwen_vl_model(model)

# Load chat template during initialization

self.chat_template = None

🤖 Prompt for AI Agents

In lmms_eval/models/vllm.py between lines 166 and 173, the attribute self.chat_template is first assigned the parameter value and then immediately overwritten to None, which discards the initial assignment. Remove the redundant assignment setting self.chat_template to None on line 173 to preserve the intended initialization with the parameter value.

coderabbitai · 2025-07-28T13:37:43Z

lmms_eval/tasks/mathverse/mathverse_evals.py

            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        }
+        client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))


⚠️ Potential issue

Incorrect use of .rstrip() with multi-character string.

The .rstrip("chat/completions") call removes any combination of the characters in "chat/completions" from the end, not the exact string. This could lead to unexpected behavior.

Use .removesuffix() for exact string removal (Python 3.9+):

- client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) + client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))

Or for older Python versions:

- client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) + base_url = API_URL[:-len("/chat/completions")] if API_URL.endswith("/chat/completions") else API_URL + client = OpenAI(api_key=API_KEY, base_url=base_url)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))

client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))

🧰 Tools

🪛 Ruff (0.12.2)

85-85: Using .strip() with multi-character strings is misleading

(B005)

🤖 Prompt for AI Agents

In lmms_eval/tasks/mathverse/mathverse_evals.py at line 85, the use of .rstrip("chat/completions") is incorrect because it removes any combination of those characters from the end rather than the exact substring. Replace .rstrip("chat/completions") with .removesuffix("chat/completions") if using Python 3.9 or newer, or implement a conditional check to remove the exact suffix for older Python versions to ensure only the intended substring is removed from API_URL.

cursor · 2025-07-29T07:51:20Z

Bugbot found 1 bug

To see it, activate your membership in your Cursor dashboard.

Luodian and others added 10 commits July 27, 2025 07:13

modify gpt evaluation model name retrieval

85e3984

[Enhancement] Improve score processing and validation in mia_bench

2d77ad6

[Refactor] Update MathVerseEvaluator initialization and API key handling

a77a4fc

remove sampling-related parameters to avoid warnings

917b05f

fix llm judge with azure

972187c

Refactor mathverse result aggregation by removing redundant file savi…

384e22a

…ng steps during evaluation

Remove unused submission metric from mathverse_testmini.yaml configur…

e91efa7

…ation

Remove unused generation parameters from mmmu_val_thinking.yaml to st…

b3fede5

…reamline configuration

Refactor mathvista result aggregation by removing file saving steps f…

84517d4

…rom evaluation process

[Refactor] Update VLLM model initialization and configuration parameters

4909b13

Luodian marked this pull request as ready for review July 28, 2025 13:20

[Refactor] Update model argument naming in VLLM configuration

8369252

cursor bot reviewed Jul 28, 2025

View reviewed changes

[Refactor] Clean up whitespace and improve regex patterns in score pr…

62c9a6a

…ocessing

coderabbitai bot reviewed Jul 28, 2025

View reviewed changes

feat(vllm): Add default temperature and top_p to generation params

8e68a18

kcz358 approved these changes Jul 30, 2025

View reviewed changes

kcz358 merged commit 4994927 into main Jul 30, 2025
4 checks passed

kcz358 deleted the stable/0d3 branch July 30, 2025 01:16

kcz358 restored the stable/0d3 branch July 30, 2025 03:51

coderabbitai bot mentioned this pull request Aug 22, 2025

feat: Add Charxiv, videomme long, and vllm threadpool for decoding inputs #802

Merged

2 tasks

coderabbitai bot mentioned this pull request Sep 15, 2025

feat: Add longvila-r1 and benchmarks #819

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] fix tasks and vllm to reproduce better results. #774

[Feat] fix tasks and vllm to reproduce better results. #774

Luodian commented Jul 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 28, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jul 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jul 28, 2025

Uh oh!

coderabbitai bot Jul 28, 2025

Uh oh!

cursor bot commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
	client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))

[Feat] fix tasks and vllm to reproduce better results. #774

[Feat] fix tasks and vllm to reproduce better results. #774

Conversation

Luodian commented Jul 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jul 28, 2025

Choose a reason for hiding this comment

Bug: API Type Mismatch Causes Undefined Attribute

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot commented Jul 29, 2025

Bugbot found 1 bug

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Luodian commented Jul 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 28, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)