-
Notifications
You must be signed in to change notification settings - Fork 453
[Feat] fix tasks and vllm to reproduce better results. #774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ng steps during evaluation
…reamline configuration
…rom evaluation process
WalkthroughThis update refactors model and evaluation logic, improves configuration management, and enhances robustness across code and YAML configurations. Key changes include parameter renaming and environment variable usage, migration to the official OpenAI SDK, improved error handling, and significant simplification of generation parameter configurations in multiple YAML files. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MathVerseEvaluator
participant OpenAI/Azure SDK
User->>MathVerseEvaluator: get_chat_response(prompt, ...)
MathVerseEvaluator->>OpenAI/Azure SDK: chat.completions.create(...)
OpenAI/Azure SDK-->>MathVerseEvaluator: response (choices[0].message.content)
MathVerseEvaluator-->>User: parsed response
sequenceDiagram
participant User
participant VLLM_Model
User->>VLLM_Model: __init__(model=..., tensor_parallel_size=..., data_parallel_size=...)
VLLM_Model->>VLLM_Model: Configure via env vars, check parallel constraints
VLLM_Model->>VLLM_Model: Initialize client after accelerator setup
User->>VLLM_Model: generate_until(...)
VLLM_Model->>VLLM_Model: Build messages (images first, then text)
VLLM_Model->>User: Generation result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bugbot free trial expires on July 29, 2025
Learn more in the Cursor dashboard.
| {"role": "user", "content": prompt}, | ||
| ] | ||
| payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens, "n": n} | ||
| payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: API Type Mismatch Causes Undefined Attribute
The gpt_model attribute is no longer set in the __init__ method and is only defined at the class level for the Azure API type. Consequently, when using the OpenAI API, self.gpt_model is undefined, leading to an AttributeError when accessed in the get_chat_response method.
Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
lmms_eval/tasks/mathverse/mathverse_evals.py (1)
85-92: Bug:self.gpt_modelis undefined for OpenAI API type.When using the OpenAI API type,
self.gpt_modelis not defined, which will cause anAttributeErrorwhen accessed in theget_chat_responsemethod at line 101. Thegpt_modelattribute is only set at the class level for the Azure API type.Apply this fix to define
gpt_modelfor both API types:if API_TYPE == "openai": API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") headers = { "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json", } client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) + gpt_model = os.getenv("MODEL_VERSION", "gpt-4o-2024-11-20")
🧹 Nitpick comments (7)
lmms_eval/tasks/mathvista/mathvista_testmini_cot.yaml (1)
11-13: Same 16 384-token cap – replicate the capacity checkReplicating the concern from the other YAMLs: ensure the models used for the cot split accept such long completions; otherwise lower the value or add explicit early-stop criteria.
lmms_eval/tasks/mathvista/mathvista_testmini_format.yaml (1)
10-12: Highmax_new_tokenswithout explicitdo_sample: falseWith sampling keys removed, vLLM defaults to
do_sample=False, but being explicit avoids future surprises if defaults change.generation_kwargs: + do_sample: false max_new_tokens: 16384lmms_eval/tasks/mathvision/mathvision_testmini.yaml (1)
10-12: Large generation window – revisit for mini splitThe testmini subset is meant for quick sanity checks; a 16 k token ceiling defeats the purpose.
Recommend keeping the shorter limit (e.g. 1 024) here to keep CI runs lightweight.lmms_eval/tasks/mathverse/utils.py (1)
82-88: File saving removal streamlines evaluation workflow.Removing the file saving operations (raw results, evaluated results dictionary, and scores JSON files) streamlines the evaluation process and reduces potential I/O bottlenecks. However, consider the impact on debugging and result inspection capabilities.
If result persistence is needed for debugging or analysis, consider adding an optional parameter to control file saving:
def mathverse_aggregate_results_eval(results, args, *, calculate_gain=False, random_scores=None, save_results=False): split_flag = results[0]["metadata"]["split"] problem_version = results[0]["problem_version"].lower().replace(" ", "_") results_dict, scores = mathverse_evaluator.eval_results(results, config) if save_results: # Optional file saving logic here pass if scores["average"]["accuracy"] == 0: return None return scores["average"]["accuracy"]lmms_eval/models/llava_onevision.py (2)
543-543: Fix code style issue.Static analysis correctly identifies a minor style issue.
- if "image_aspect_ratio" in gen_kwargs.keys(): + if "image_aspect_ratio" in gen_kwargs:
735-741: Consider consistency with the other generate method.The parameter handling here differs slightly from
generate_untilmethod. Ingenerate_until, sampling parameters are only removed whendo_sample=False, but here they're explicitly set whendo_sample=Trueand then potentially removed later.For consistency, consider using the same pattern as
generate_untilwhere sampling parameters are conditionally handled:- # Only set temperature and top_p if do_sample is True - if gen_kwargs.get("do_sample", False): - if "temperature" not in gen_kwargs: - gen_kwargs["temperature"] = 1.0 # Default temperature for sampling - if "top_p" not in gen_kwargs: - gen_kwargs["top_p"] = 1.0 # Default top_p for sampling + # When do_sample=False, remove sampling-related parameters to avoid warnings + if not gen_kwargs.get("do_sample", False): + gen_kwargs.pop("temperature", None) + gen_kwargs.pop("top_p", None) + gen_kwargs.pop("top_k", None)lmms_eval/tasks/mia_bench/utils.py (1)
153-257: Excellent robustification of the score parsing logic.The rewritten
process_rawscorefunction significantly improves error handling and robustness with:
- Comprehensive input validation
- Well-designed regex patterns for flexible score extraction
- Score clamping to ensure valid ranges
- Detailed logging for debugging
- Graceful fallback mechanisms
Two minor improvements based on static analysis:
# Line 226: Rename unused loop variable - for i, component in enumerate(component_type): + for component in component_type: # Line 234: Remove unnecessary .keys() - score_dict["total_score"] = sum(v for k, v in score_dict.items() if k != "total_score") / len([k for k in score_dict.keys() if k != "total_score"]) + score_dict["total_score"] = sum(v for k, v in score_dict.items() if k != "total_score") / len([k for k in score_dict if k != "total_score"])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
examples/models/vllm_qwen2vl.sh(1 hunks)lmms_eval/models/llava_onevision.py(3 hunks)lmms_eval/models/vllm.py(9 hunks)lmms_eval/tasks/k12/k12.yaml(0 hunks)lmms_eval/tasks/mathverse/mathverse_evals.py(2 hunks)lmms_eval/tasks/mathverse/mathverse_testmini.yaml(1 hunks)lmms_eval/tasks/mathverse/utils.py(1 hunks)lmms_eval/tasks/mathvision/mathvision_reason_test.yaml(0 hunks)lmms_eval/tasks/mathvision/mathvision_reason_testmini.yaml(0 hunks)lmms_eval/tasks/mathvision/mathvision_test.yaml(1 hunks)lmms_eval/tasks/mathvision/mathvision_testmini.yaml(1 hunks)lmms_eval/tasks/mathvista/mathvista_test.yaml(1 hunks)lmms_eval/tasks/mathvista/mathvista_testmini_cot.yaml(1 hunks)lmms_eval/tasks/mathvista/mathvista_testmini_format.yaml(1 hunks)lmms_eval/tasks/mathvista/mathvista_testmini_solution.yaml(1 hunks)lmms_eval/tasks/mathvista/utils.py(0 hunks)lmms_eval/tasks/mia_bench/utils.py(2 hunks)lmms_eval/tasks/mmmu/mmmu_val_thinking.yaml(0 hunks)
💤 Files with no reviewable changes (5)
- lmms_eval/tasks/mathvision/mathvision_reason_testmini.yaml
- lmms_eval/tasks/mathvista/utils.py
- lmms_eval/tasks/mathvision/mathvision_reason_test.yaml
- lmms_eval/tasks/k12/k12.yaml
- lmms_eval/tasks/mmmu/mmmu_val_thinking.yaml
🧰 Additional context used
🧬 Code Graph Analysis (3)
lmms_eval/tasks/mathverse/mathverse_evals.py (1)
lmms_eval/tasks/mathvista/mathvista_evals.py (1)
get_chat_response(183-218)
lmms_eval/tasks/mathverse/utils.py (1)
lmms_eval/tasks/mathverse/mathverse_evals.py (1)
MathVerseEvaluator(76-300)
lmms_eval/models/vllm.py (1)
lmms_eval/models/qwen2_5_vl.py (4)
model(139-144)device(159-160)batch_size(155-156)flatten(173-178)
🪛 Ruff (0.12.2)
lmms_eval/tasks/mathverse/mathverse_evals.py
85-85: Using .strip() with multi-character strings is misleading
(B005)
lmms_eval/tasks/mia_bench/utils.py
226-226: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
234-234: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
lmms_eval/models/llava_onevision.py
543-543: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (18)
lmms_eval/tasks/mathvision/mathvision_test.yaml (1)
10-12: Token limit raised to 16 384 → verify model/context capacity & evaluation cost
max_new_tokensjumps from 1 024 to 16 384.
Many open-weights (e.g. Llama 2/3, Phi-3, Gemma, most vision-LLMs) top out at 8 192 or less. vLLM aborts or silently truncates when the request exceeds the model’s context window, and such a long limit can explode GPU memory and evaluation time if the stop sequence is not hit.Please double-check:
- The models targeted by this task actually support ≥ 16 384 tokens.
- The evaluation harness enforces earlier stopping (e.g. via
generate_until) so we don’t pay for pathological generations.- Runtime / cost budgets remain acceptable.
If the larger limit is only a safeguard, consider a safer cap (e.g. 8 192) or dynamically deriving it from
model.config.max_position_embeddings.lmms_eval/tasks/mathvista/mathvista_test.yaml (1)
10-12: Potential OOM / latency spike due to 16 kmax_new_tokensFull-set MathVista evaluation is already compute-heavy. Unless your models demonstrably need >1 k output tokens, the new ceiling will:
- Increase evaluation latency linearly.
- Risk CUDA OOM on 24–48 GB GPUs.
Consider profiling with 1 k vs 16 k limits before merging.
lmms_eval/tasks/mathverse/mathverse_testmini.yaml (2)
23-23: LGTM: Configuration simplification improves reproducibility.The simplification of generation parameters (removing stochastic sampling controls as mentioned in the AI summary) aligns well with the PR objective of improving result reproducibility. Deterministic generation settings will produce more consistent results across runs.
30-30: Appropriate model-specific configuration.Adding
image_aspect_ratio: originalto the llava model-specific generation kwargs is appropriate for this visual reasoning task, ensuring images are processed with their original aspect ratios.lmms_eval/tasks/mathvista/mathvista_testmini_solution.yaml (1)
11-11: Significant increase in token limit is appropriate for solution tasks.Increasing
max_new_tokensfrom 1024 to 16384 is well-suited for the "solution" shot type, as mathematical solutions often require detailed step-by-step explanations that can be quite lengthy. This change, combined with the removal of sampling parameters (as noted in the AI summary), supports the PR objective of improving reproducibility while ensuring adequate capacity for comprehensive solutions.examples/models/vllm_qwen2vl.sh (1)
13-13: Parameter rename improves consistency.The change from
model_versiontomodelin the model arguments is more intuitive and aligns with the refactoring mentioned in the AI summary. This standardization improves the API consistency across the codebase.lmms_eval/tasks/mathverse/utils.py (1)
22-22: Evaluator initialization simplified appropriately.The simplified initialization of
mathverse_evaluatoraligns with the migration to the official OpenAI SDK. Based on the relevant code snippets, theMathVerseEvaluatorclass now handles API client initialization internally using environment variables, which is a cleaner approach.lmms_eval/models/llava_onevision.py (2)
539-551: Improved generation parameter handling prevents warnings.The refined approach to handling generation kwargs is well-designed:
- Unconditionally setting
max_new_tokensensures a reasonable default- Only removing sampling parameters when
do_sample=Falseprevents unnecessary warnings- This aligns with the PR's goal of improving reproducibility by handling generation parameters more robustly
764-767: Consistent parameter cleanup approach.The parameter cleanup logic here matches the approach used in the
generate_untilmethod, which is good for consistency.lmms_eval/tasks/mia_bench/utils.py (3)
47-47: Good use of environment variable for model version configuration.The change to source
GPT_EVAL_MODEL_NAMEfrom theMODEL_VERSIONenvironment variable with a sensible default improves configurability and aligns with the broader PR's standardization of model version references.
149-149: Excellent improvement to prompt clarity for consistent parsing.The explicit instruction to output scores in an exact format without markdown or asterisks will significantly improve parsing reliability downstream. This change directly addresses potential parsing issues.
260-342: Excellent defensive programming implementation.The rewritten
mia_bench_process_resultsfunction demonstrates best practices in error handling:
- Comprehensive input validation
- Error handling at every potential failure point
- Consistent logging for debugging
- Graceful degradation with fallback zero scores
- Clear structure that aids maintainability
This robust implementation ensures the evaluation pipeline won't crash due to unexpected inputs or API failures.
lmms_eval/tasks/mathverse/mathverse_evals.py (1)
97-132: Good migration to OpenAI SDK with preserved error handling.The refactored
get_chat_responsemethod successfully migrates to the official SDK while preserving important functionality:
- Retry logic with patience parameter
- Rate limit handling
- Automatic prompt size reduction on length errors
- Proper response extraction from the SDK response object
Note: This implementation will work correctly once the
self.gpt_modelAttributeError is fixed.lmms_eval/models/vllm.py (5)
23-24: Good improvements to configurability and parameter naming.The changes enhance the class design:
- Environment variables
NUM_SECONDS_TO_SLEEPandWORKERSprovide runtime configurability- Renaming
model_versiontomodelis more intuitive and consistent with common ML frameworks- Addition of
data_parallel_sizeparameter enables data parallelism supportAlso applies to: 149-152
209-213: Clear documentation of parallelism constraints.The code properly enforces and documents the current limitation that data parallelism cannot be used with tensor parallelism. The TODO comment indicates this is a known limitation that may be addressed in future versions.
214-222: Well-structured VLLM client initialization.The client initialization is properly placed after accelerator setup and includes:
- Correct use of the renamed
modelparameter- Fixed seed for reproducibility (aligns with PR objectives)
- Appropriate distributed executor backend configuration
- Proper forwarding of additional kwargs
307-310: Simplified generation parameters align with PR objectives.The removal of default
temperatureandtop_pparameters and keeping onlymax_tokensis consistent with the broader PR effort to simplify generation configurations across task YAML files.
333-336: Content ordering change: images before text.The message content now adds images before text, which may affect model behavior. This could be intentional for better performance with vision-language models.
Was this ordering change intentional? Some vision-language models may be sensitive to the order of image and text inputs in the prompt.
| self.chat_template = chat_template | ||
| self.min_image_pixels = min_image_pixels | ||
| self.data_parallel_size = data_parallel_size | ||
| # Qwen 2/2.5-VL models enforce minimum image dimensions | ||
| self._enforce_image_resize = self._is_qwen_vl_model(model_version) | ||
| self._enforce_image_resize = self._is_qwen_vl_model(model) | ||
|
|
||
| # Load chat template during initialization | ||
| self.chat_template = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Duplicate and conflicting chat_template initialization.
The self.chat_template is assigned the parameter value on line 166 but then immediately set to None on line 173, discarding the initial assignment. This appears to be an error.
Remove the redundant assignment:
self.model = model
self.max_frame_num = max_frame_num
- self.chat_template = chat_template
self.min_image_pixels = min_image_pixels
self.data_parallel_size = data_parallel_size
# Qwen 2/2.5-VL models enforce minimum image dimensions
self._enforce_image_resize = self._is_qwen_vl_model(model)
# Load chat template during initialization
self.chat_template = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.chat_template = chat_template | |
| self.min_image_pixels = min_image_pixels | |
| self.data_parallel_size = data_parallel_size | |
| # Qwen 2/2.5-VL models enforce minimum image dimensions | |
| self._enforce_image_resize = self._is_qwen_vl_model(model_version) | |
| self._enforce_image_resize = self._is_qwen_vl_model(model) | |
| # Load chat template during initialization | |
| self.chat_template = None | |
| self.model = model | |
| self.max_frame_num = max_frame_num | |
| self.min_image_pixels = min_image_pixels | |
| self.data_parallel_size = data_parallel_size | |
| # Qwen 2/2.5-VL models enforce minimum image dimensions | |
| self._enforce_image_resize = self._is_qwen_vl_model(model) | |
| # Load chat template during initialization | |
| self.chat_template = None |
🤖 Prompt for AI Agents
In lmms_eval/models/vllm.py between lines 166 and 173, the attribute
self.chat_template is first assigned the parameter value and then immediately
overwritten to None, which discards the initial assignment. Remove the redundant
assignment setting self.chat_template to None on line 173 to preserve the
intended initialization with the parameter value.
| "Authorization": f"Bearer {API_KEY}", | ||
| "Content-Type": "application/json", | ||
| } | ||
| client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect use of .rstrip() with multi-character string.
The .rstrip("chat/completions") call removes any combination of the characters in "chat/completions" from the end, not the exact string. This could lead to unexpected behavior.
Use .removesuffix() for exact string removal (Python 3.9+):
- client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+ client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))Or for older Python versions:
- client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+ base_url = API_URL[:-len("/chat/completions")] if API_URL.endswith("/chat/completions") else API_URL
+ client = OpenAI(api_key=API_KEY, base_url=base_url)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) | |
| client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions")) |
🧰 Tools
🪛 Ruff (0.12.2)
85-85: Using .strip() with multi-character strings is misleading
(B005)
🤖 Prompt for AI Agents
In lmms_eval/tasks/mathverse/mathverse_evals.py at line 85, the use of
.rstrip("chat/completions") is incorrect because it removes any combination of
those characters from the end rather than the exact substring. Replace
.rstrip("chat/completions") with .removesuffix("chat/completions") if using
Python 3.9 or newer, or implement a conditional check to remove the exact suffix
for older Python versions to ensure only the intended substring is removed from
API_URL.
Bugbot found 1 bugTo see it, activate your membership in your Cursor dashboard. |
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores