-
Notifications
You must be signed in to change notification settings - Fork 451
[TASK] MMRefine Benchmark #793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new evaluation task "MMRefine" was integrated into the codebase. This includes documentation updates, a YAML configuration for the task, prompt definitions, an evaluator class for handling model interactions and scoring, and utility functions for processing and aggregating results. No existing public APIs were altered. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MMRefineEvaluator
participant ModelAPI
participant Utils
User->>MMRefineEvaluator: Provide problem and prediction
MMRefineEvaluator->>ModelAPI: Send evaluation/refinement prompt
ModelAPI-->>MMRefineEvaluator: Return model response
MMRefineEvaluator->>MMRefineEvaluator: Parse response, compute scores
MMRefineEvaluator-->>User: Return evaluation result
User->>Utils: Aggregate multiple results
Utils->>Utils: Compute metrics (RefScore, mRecall, etc.)
Utils-->>User: Return aggregated metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 16
🔭 Outside diff range comments (1)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)
127-161: Add type hints and docstring to eval_result method-def eval_result(self, result, config): +def eval_result(self, result: dict, config: dict) -> dict: + """Evaluate and classify a prediction result. + + Args: + result: Result dictionary with prediction and problem details + config: Configuration dictionary with metadata + + Returns: + Result dictionary with added evaluation scores and classification + """
🧹 Nitpick comments (7)
lmms_eval/tasks/mmrefine/prompts.py (2)
1-36: Add module docstring to document the purpose of these promptsThis module should have a docstring explaining its purpose and how these prompts are used in the MMRefine evaluation pipeline.
Add a module docstring at the beginning:
"""Prompt templates for the MMRefine evaluation task. This module contains standardized prompts used by the MMRefineEvaluator to interact with language models for evaluating and refining mathematical problem solutions. """
1-73: Consider adding inline documentation for each prompt constantWhile the prompt names are descriptive, brief comments explaining the expected input/output format for each prompt would improve maintainability.
Example:
# Prompt for mathematical expert to review and correct solutions # Expects: {question}, {initial_solution} # Returns: Structured assessment with correctness evaluation REFINEMENT_PROMPT = """..."""lmms_eval/tasks/mmrefine/mmrefine_evals.py (2)
131-134: Simplify conditional assignments using dict.get()Use the more concise
dict.get()method with default values.-"answer": result["answer"] if "answer" in result else None, +"answer": result.get("answer", None),-"reference_feedback": result["reference_feedback"] if "reference_feedback" in result else None, +"reference_feedback": result.get("reference_feedback", None),
144-159: Add type hints to nested functionThe nested
classify_resultsfunction should have type hints.-def classify_results(row): +def classify_results(row: dict) -> str:lmms_eval/tasks/mmrefine/utils.py (3)
17-17: Rename unused loop variableLoop variable
iis not used within the loop body.-for i, line in enumerate(raw_data): +for _, line in enumerate(raw_data):Or if enumeration is not needed:
-for i, line in enumerate(raw_data): +for line in raw_data:
42-42: Simplify conditional assignments using dict.get()Use the more concise
dict.get()method.-"answer": doc["answer"] if "answer" in doc else None, +"answer": doc.get("answer", None),-"reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, +"reference_feedback": doc.get("reference_feedback", None),Also applies to: 46-46, 59-59, 64-64
98-103: Remove unnecessary .keys() in membership testsThe
.keys()call is redundant when checking membership.-if "Refinement Failure" in vc.keys() +if "Refinement Failure" in vcApply this change to all occurrences on lines 98-103.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/current_tasks.md(1 hunks)lmms_eval/tasks/mmrefine/mmrefine.yaml(1 hunks)lmms_eval/tasks/mmrefine/mmrefine_evals.py(1 hunks)lmms_eval/tasks/mmrefine/prompts.py(1 hunks)lmms_eval/tasks/mmrefine/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CLAUDE.md)
**/*.py: Type hints required for all code
Public APIs must have docstrings
Functions must be focused and small
Line length: 88 chars maximum
PEP 8 naming (snake_case for functions/variables)
Class names in PascalCase
Constants in UPPER_SNAKE_CASE
Document with docstrings
Use f-strings for formatting
Use early returns to avoid nested conditions
Use clear variable/function names (prefix handlers with "handle")
Use constants where possible instead of functions
Don't repeat yourself (DRY code)
Prefer functional, immutable approaches when not verbose
Define composing functions before their components
Mark issues in existing code with "TODO:" prefix
Use functional and stateless approaches where they improve clarity
Keep core logic clean and push implementation details to the edges
Ruff: Line length (88 chars), import sorting (I001), unused imports
Strings: use parentheses for line wrapping
Function calls: multi-line with proper indent
Imports: split into multiple lines
Explicit None checks for Optional
Type narrowing for strings
Ruff (Python) runs as pre-commit hook
Break strings with parentheses for line length
Multi-line function calls for line length
Split imports for line length
Add None checks for Optional types
Narrow string types
Document public APIs
Files:
lmms_eval/tasks/mmrefine/prompts.pylmms_eval/tasks/mmrefine/mmrefine_evals.pylmms_eval/tasks/mmrefine/utils.py
**/*.{yaml,yml,json}
📄 CodeRabbit Inference Engine (CLAUDE.md)
Prettier (YAML/JSON) runs as pre-commit hook
Files:
lmms_eval/tasks/mmrefine/mmrefine.yaml
🧬 Code Graph Analysis (1)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)
lmms_eval/tasks/megabench/metrics/scoring/program_judge.py (1)
target(58-73)
🪛 YAMLlint (1.37.1)
lmms_eval/tasks/mmrefine/mmrefine.yaml
[error] 42-42: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.12.2)
lmms_eval/tasks/mmrefine/mmrefine_evals.py
4-4: pandas imported but unused
Remove unused import: pandas
(F401)
6-6: tqdm.tqdm imported but unused
Remove unused import: tqdm.tqdm
(F401)
21-21: Using .strip() with multi-character strings is misleading
(B005)
86-86: f-string without any placeholders
Remove extraneous f prefix
(F541)
89-89: Do not use bare except
(E722)
92-92: Do not use bare except
(E722)
103-103: Do not use bare except
(E722)
107-107: Do not use bare except
(E722)
111-111: Do not use bare except
(E722)
115-115: f-string without any placeholders
Remove extraneous f prefix
(F541)
131-131: Use result.get("answer", None) instead of an if block
Replace with result.get("answer", None)
(SIM401)
134-134: Use result.get("reference_feedback", None) instead of an if block
Replace with result.get("reference_feedback", None)
(SIM401)
lmms_eval/tasks/mmrefine/utils.py
2-2: os imported but unused
Remove unused import: os
(F401)
17-17: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
42-42: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
46-46: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
59-59: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
64-64: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
98-98: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
99-99: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
100-100: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
101-101: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
102-102: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
103-103: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
🔇 Additional comments (1)
docs/current_tasks.md (1)
70-70: LGTM!The MMRefine task entry follows the existing documentation format correctly.
| "Authorization": f"Bearer {API_KEY}", | ||
| "Content-Type": "application/json", | ||
| } | ||
| client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially misleading use of .strip() with multi-character string
Using .strip("chat/completions") removes any combination of those characters from both ends, not the exact substring. Use .rstrip("/chat/completions") or string slicing instead.
-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+client = OpenAI(api_key=API_KEY, base_url=API_URL.replace("/chat/completions", ""))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) | |
| client = OpenAI(api_key=API_KEY, base_url=API_URL.replace("/chat/completions", "")) |
🧰 Tools
🪛 Ruff (0.12.2)
21-21: Using .strip() with multi-character strings is misleading
(B005)
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/mmrefine_evals.py at line 21, the use of
.strip("chat/completions") is incorrect because it removes any of the characters
'c', 'h', 'a', 't', '/', 'o', 'm', 'p', 'l', 'e', 'i', 's' from both ends rather
than the exact substring. Replace .strip("chat/completions") with
.rstrip("/chat/completions") to correctly remove the exact suffix from the
API_URL string.
| def __init__(self, quick_extract=False): | ||
| self.quick_extract = quick_extract | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints to init method
Type hints are required for all code per the guidelines.
-def __init__(self, quick_extract=False):
+def __init__(self, quick_extract: bool = False) -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, quick_extract=False): | |
| self.quick_extract = quick_extract | |
| def __init__(self, quick_extract: bool = False) -> None: | |
| self.quick_extract = quick_extract |
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/mmrefine_evals.py around lines 35 to 37, the
__init__ method lacks type hints. Add type hints by specifying the type of the
quick_extract parameter as bool and the return type as None to comply with the
code guidelines.
| def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0): | ||
| messages = [ | ||
| {"role": "user", "content": prompt}, | ||
| ] | ||
| payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens} | ||
|
|
||
| while patience > 0: | ||
| patience -= 1 | ||
| try: | ||
| response = self.client.chat.completions.create(**payload) | ||
| if n == 1: | ||
| prediction = response.choices[0].message.content.strip() | ||
| if prediction and prediction != "": | ||
| return prediction | ||
| else: | ||
| prediction = [choice.message.content.strip() for choice in response.choices] | ||
| if prediction and prediction[0] != "": | ||
| return prediction | ||
|
|
||
| except Exception as e: | ||
| if "Rate limit" not in str(e): | ||
| eval_logger.error(e) | ||
|
|
||
| if "Please reduce the length of the messages" in str(e): | ||
| eval_logger.error("!!Reduce prompt size") | ||
| # reduce input prompt and keep the tail | ||
| new_size = int(len(prompt) * 0.9) | ||
| new_start = len(prompt) - new_size | ||
| prompt = prompt[new_start:] | ||
| payload["messages"] = [ | ||
| {"role": "user", "content": prompt}, | ||
| ] | ||
|
|
||
| if sleep_time > 0: | ||
| time.sleep(sleep_time) | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints and docstring to get_chat_response method
Methods need type hints and docstrings per the guidelines.
-def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0):
+def get_chat_response(
+ self,
+ prompt: str,
+ temperature: float = 0,
+ max_tokens: int = 256,
+ n: int = 1,
+ patience: int = 10000000,
+ sleep_time: int = 0
+) -> str | list[str]:
+ """Get chat completion response from the configured API.
+
+ Args:
+ prompt: Input prompt for the model
+ temperature: Sampling temperature
+ max_tokens: Maximum tokens to generate
+ n: Number of completions to generate
+ patience: Number of retries before giving up
+ sleep_time: Seconds to sleep between retries
+
+ Returns:
+ Model response string or list of strings if n > 1
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0): | |
| messages = [ | |
| {"role": "user", "content": prompt}, | |
| ] | |
| payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens} | |
| while patience > 0: | |
| patience -= 1 | |
| try: | |
| response = self.client.chat.completions.create(**payload) | |
| if n == 1: | |
| prediction = response.choices[0].message.content.strip() | |
| if prediction and prediction != "": | |
| return prediction | |
| else: | |
| prediction = [choice.message.content.strip() for choice in response.choices] | |
| if prediction and prediction[0] != "": | |
| return prediction | |
| except Exception as e: | |
| if "Rate limit" not in str(e): | |
| eval_logger.error(e) | |
| if "Please reduce the length of the messages" in str(e): | |
| eval_logger.error("!!Reduce prompt size") | |
| # reduce input prompt and keep the tail | |
| new_size = int(len(prompt) * 0.9) | |
| new_start = len(prompt) - new_size | |
| prompt = prompt[new_start:] | |
| payload["messages"] = [ | |
| {"role": "user", "content": prompt}, | |
| ] | |
| if sleep_time > 0: | |
| time.sleep(sleep_time) | |
| return "" | |
| def get_chat_response( | |
| self, | |
| prompt: str, | |
| temperature: float = 0, | |
| max_tokens: int = 256, | |
| n: int = 1, | |
| patience: int = 10000000, | |
| sleep_time: int = 0 | |
| ) -> str | list[str]: | |
| """Get chat completion response from the configured API. | |
| Args: | |
| prompt: Input prompt for the model | |
| temperature: Sampling temperature | |
| max_tokens: Maximum tokens to generate | |
| n: Number of completions to generate | |
| patience: Number of retries before giving up | |
| sleep_time: Seconds to sleep between retries | |
| Returns: | |
| Model response string or list of strings if n > 1 | |
| """ | |
| messages = [ | |
| {"role": "user", "content": prompt}, | |
| ] | |
| payload = { | |
| "model": self.gpt_model, | |
| "messages": messages, | |
| "temperature": temperature, | |
| "max_tokens": max_tokens, | |
| } | |
| while patience > 0: | |
| patience -= 1 | |
| try: | |
| response = self.client.chat.completions.create(**payload) | |
| if n == 1: | |
| prediction = response.choices[0].message.content.strip() | |
| if prediction and prediction != "": | |
| return prediction | |
| else: | |
| prediction = [c.message.content.strip() for c in response.choices] | |
| if prediction and prediction[0] != "": | |
| return prediction | |
| except Exception as e: | |
| if "Rate limit" not in str(e): | |
| eval_logger.error(e) | |
| if "Please reduce the length of the messages" in str(e): | |
| eval_logger.error("!!Reduce prompt size") | |
| # reduce input prompt and keep the tail | |
| new_size = int(len(prompt) * 0.9) | |
| new_start = len(prompt) - new_size | |
| prompt = prompt[new_start:] | |
| payload["messages"] = [ | |
| {"role": "user", "content": prompt}, | |
| ] | |
| if sleep_time > 0: | |
| time.sleep(sleep_time) | |
| return "" |
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/mmrefine_evals.py between lines 38 and 73, the
get_chat_response method lacks type hints and a docstring. Add appropriate type
hints for all parameters and the return type, specifying types like str, int,
and list or str for the return. Also, include a clear docstring describing the
method's purpose, parameters, and return value to improve code clarity and
maintainability.
| def mmrefine_doc_to_visual(doc): | ||
| if doc["image"] is None or str(doc["image"]).strip() == "": | ||
| return [] | ||
| if isinstance(doc["image"], Image.Image): | ||
| return [doc["image"].convert("RGB")] | ||
| return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints and docstring to mmrefine_doc_to_visual function
Functions require type hints and docstrings per guidelines.
-def mmrefine_doc_to_visual(doc):
+def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]:
+ """Convert document image data to PIL Image objects.
+
+ Args:
+ doc: Document dictionary potentially containing image data
+
+ Returns:
+ List of RGB PIL Image objects, empty if no image
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def mmrefine_doc_to_visual(doc): | |
| if doc["image"] is None or str(doc["image"]).strip() == "": | |
| return [] | |
| if isinstance(doc["image"], Image.Image): | |
| return [doc["image"].convert("RGB")] | |
| return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")] | |
| def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]: | |
| """Convert document image data to PIL Image objects. | |
| Args: | |
| doc: Document dictionary potentially containing image data | |
| Returns: | |
| List of RGB PIL Image objects, empty if no image | |
| """ | |
| if doc["image"] is None or str(doc["image"]).strip() == "": | |
| return [] | |
| if isinstance(doc["image"], Image.Image): | |
| return [doc["image"].convert("RGB")] | |
| return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")] |
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/utils.py around lines 30 to 36, the function
mmrefine_doc_to_visual lacks type hints and a docstring. Add appropriate type
hints for the input parameter and return type, and include a clear docstring
describing the function's purpose, input parameter, and return value following
the project's documentation style.
| def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None): | ||
| problem = { | ||
| "id": doc["id"], | ||
| "question": doc["question"], | ||
| "answer": doc["answer"] if "answer" in doc else None, | ||
| "initial_solution": doc["initial_solution"], | ||
| "solution_source": doc["solution_source"], | ||
| "solution_label": doc["solution_label"], | ||
| "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, | ||
| "meta": doc["meta"], | ||
| } | ||
| query_prompt = mmrefine_evaluator.create_one_query(problem) | ||
| return query_prompt | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints and docstring to mmrefine_doc_to_text function
-def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None):
+def mmrefine_doc_to_text(
+ doc: dict,
+ lmms_eval_specific_kwargs: dict | None = None
+) -> str:
+ """Generate query prompt from document.
+
+ Args:
+ doc: Document dictionary with problem details
+ lmms_eval_specific_kwargs: Optional eval-specific arguments
+
+ Returns:
+ Query prompt string
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None): | |
| problem = { | |
| "id": doc["id"], | |
| "question": doc["question"], | |
| "answer": doc["answer"] if "answer" in doc else None, | |
| "initial_solution": doc["initial_solution"], | |
| "solution_source": doc["solution_source"], | |
| "solution_label": doc["solution_label"], | |
| "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, | |
| "meta": doc["meta"], | |
| } | |
| query_prompt = mmrefine_evaluator.create_one_query(problem) | |
| return query_prompt | |
| def mmrefine_doc_to_text( | |
| doc: dict, | |
| lmms_eval_specific_kwargs: dict | None = None | |
| ) -> str: | |
| """Generate query prompt from document. | |
| Args: | |
| doc: Document dictionary with problem details | |
| lmms_eval_specific_kwargs: Optional eval-specific arguments | |
| Returns: | |
| Query prompt string | |
| """ | |
| problem = { | |
| "id": doc["id"], | |
| "question": doc["question"], | |
| "answer": doc["answer"] if "answer" in doc else None, | |
| "initial_solution": doc["initial_solution"], | |
| "solution_source": doc["solution_source"], | |
| "solution_label": doc["solution_label"], | |
| "reference_feedback": doc["reference_feedback"] | |
| if "reference_feedback" in doc | |
| else None, | |
| "meta": doc["meta"], | |
| } | |
| query_prompt = mmrefine_evaluator.create_one_query(problem) | |
| return query_prompt |
🧰 Tools
🪛 Ruff (0.12.2)
42-42: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
46-46: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/utils.py around lines 38 to 51, the function
mmrefine_doc_to_text lacks type hints and a docstring. Add appropriate type
hints for the function parameters and return type, and include a clear docstring
describing the function's purpose, inputs, and outputs to improve code
readability and maintainability.
| def mmrefine_process_results(doc, results): | ||
| prediction = results[0].strip() | ||
|
|
||
| result = { | ||
| "id": doc["id"], | ||
| "question": doc["question"], | ||
| "answer": doc["answer"] if "answer" in doc else None, | ||
| "prediction": prediction, | ||
| "initial_solution": doc["initial_solution"], | ||
| "solution_source": doc["solution_source"], | ||
| "solution_label": doc["solution_label"], | ||
| "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, | ||
| "meta": doc["meta"], | ||
| } | ||
| result = mmrefine_evaluator.eval_result(result, config) | ||
|
|
||
| metric_dict = {} | ||
| for _metric in [ | ||
| "Refinement Failure", | ||
| "Error Detection Success", | ||
| "Error Correction Success", | ||
| "Refinement Success", | ||
| "False Error Detection", | ||
| "Validation Success", | ||
| "RefScore", | ||
| "mRecall", | ||
| ]: | ||
| _result = result.copy() | ||
| metric_dict[_metric] = _result | ||
| return metric_dict | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints and docstring to mmrefine_process_results function
-def mmrefine_process_results(doc, results):
+def mmrefine_process_results(doc: dict, results: list) -> dict:
+ """Process prediction results and evaluate them.
+
+ Args:
+ doc: Document dictionary with problem details
+ results: List containing prediction string
+
+ Returns:
+ Dictionary mapping metric names to evaluated results
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def mmrefine_process_results(doc, results): | |
| prediction = results[0].strip() | |
| result = { | |
| "id": doc["id"], | |
| "question": doc["question"], | |
| "answer": doc["answer"] if "answer" in doc else None, | |
| "prediction": prediction, | |
| "initial_solution": doc["initial_solution"], | |
| "solution_source": doc["solution_source"], | |
| "solution_label": doc["solution_label"], | |
| "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, | |
| "meta": doc["meta"], | |
| } | |
| result = mmrefine_evaluator.eval_result(result, config) | |
| metric_dict = {} | |
| for _metric in [ | |
| "Refinement Failure", | |
| "Error Detection Success", | |
| "Error Correction Success", | |
| "Refinement Success", | |
| "False Error Detection", | |
| "Validation Success", | |
| "RefScore", | |
| "mRecall", | |
| ]: | |
| _result = result.copy() | |
| metric_dict[_metric] = _result | |
| return metric_dict | |
| def mmrefine_process_results(doc: dict, results: list) -> dict: | |
| """Process prediction results and evaluate them. | |
| Args: | |
| doc: Document dictionary with problem details | |
| results: List containing prediction string | |
| Returns: | |
| Dictionary mapping metric names to evaluated results | |
| """ | |
| prediction = results[0].strip() | |
| result = { | |
| "id": doc["id"], | |
| "question": doc["question"], | |
| "answer": doc["answer"] if "answer" in doc else None, | |
| "prediction": prediction, | |
| "initial_solution": doc["initial_solution"], | |
| "solution_source": doc["solution_source"], | |
| "solution_label": doc["solution_label"], | |
| "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, | |
| "meta": doc["meta"], | |
| } | |
| result = mmrefine_evaluator.eval_result(result, config) | |
| metric_dict = {} | |
| for _metric in [ | |
| "Refinement Failure", | |
| "Error Detection Success", | |
| "Error Correction Success", | |
| "Refinement Success", | |
| "False Error Detection", | |
| "Validation Success", | |
| "RefScore", | |
| "mRecall", | |
| ]: | |
| _result = result.copy() | |
| metric_dict[_metric] = _result | |
| return metric_dict |
🧰 Tools
🪛 Ruff (0.12.2)
59-59: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
64-64: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/utils.py between lines 53 and 83, the function
mmrefine_process_results lacks type hints and a docstring. Add appropriate type
hints for the parameters and return type based on their expected data types,
such as dict for doc and results, and dict for the return. Also, include a
concise docstring describing the function's purpose, parameters, and return
value to improve code clarity and maintainability.
| def mmrefine_aggregate_results(results, args=None, **kwargs): | ||
| # calculate total scores | ||
| results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results} | ||
| results = pd.DataFrame(results) | ||
| try: | ||
| assert len(results) == len(results_dict) | ||
| except AssertionError: | ||
| eval_logger.error(f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries") | ||
|
|
||
| n_correct_solutions = len(results[results["solution_label"]=="correct"]) | ||
| n_incorrect_solutions = len(results) - n_correct_solutions | ||
|
|
||
| vc = results["eval_result"].value_counts() | ||
| scores = { | ||
| "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0, | ||
| "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0, | ||
| "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0, | ||
| "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0, | ||
| "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0, | ||
| "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0, | ||
| } | ||
|
|
||
| scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"] | ||
| scores["Error Correction Success"] = scores["Error Correction Success"] + scores["Refinement Success"] | ||
| scores["Error Detection Success"] = scores["Error Detection Success"] + scores["Error Correction Success"] | ||
| scores["mRecall"] = (scores["Error Detection Success"] + (1 - scores["False Error Detection"])) / 2 | ||
|
|
||
| scores["results_dict"] = results_dict | ||
|
|
||
| return scores | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints and docstring to mmrefine_aggregate_results function
-def mmrefine_aggregate_results(results, args=None, **kwargs):
+def mmrefine_aggregate_results(
+ results: list[dict],
+ args: object | None = None,
+ **kwargs
+) -> dict:
+ """Aggregate evaluation results and compute metrics.
+
+ Args:
+ results: List of result dictionaries
+ args: Optional arguments object
+ **kwargs: Additional keyword arguments
+
+ Returns:
+ Dictionary with computed metrics and results
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def mmrefine_aggregate_results(results, args=None, **kwargs): | |
| # calculate total scores | |
| results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results} | |
| results = pd.DataFrame(results) | |
| try: | |
| assert len(results) == len(results_dict) | |
| except AssertionError: | |
| eval_logger.error(f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries") | |
| n_correct_solutions = len(results[results["solution_label"]=="correct"]) | |
| n_incorrect_solutions = len(results) - n_correct_solutions | |
| vc = results["eval_result"].value_counts() | |
| scores = { | |
| "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0, | |
| "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0, | |
| "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0, | |
| "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0, | |
| "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0, | |
| "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0, | |
| } | |
| scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"] | |
| scores["Error Correction Success"] = scores["Error Correction Success"] + scores["Refinement Success"] | |
| scores["Error Detection Success"] = scores["Error Detection Success"] + scores["Error Correction Success"] | |
| scores["mRecall"] = (scores["Error Detection Success"] + (1 - scores["False Error Detection"])) / 2 | |
| scores["results_dict"] = results_dict | |
| return scores | |
| def mmrefine_aggregate_results( | |
| results: list[dict], | |
| args: object | None = None, | |
| **kwargs | |
| ) -> dict: | |
| """Aggregate evaluation results and compute metrics. | |
| Args: | |
| results: List of result dictionaries | |
| args: Optional arguments object | |
| **kwargs: Additional keyword arguments | |
| Returns: | |
| Dictionary with computed metrics and results | |
| """ | |
| # calculate total scores | |
| results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results} | |
| results = pd.DataFrame(results) | |
| try: | |
| assert len(results) == len(results_dict) | |
| except AssertionError: | |
| eval_logger.error( | |
| f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries" | |
| ) | |
| n_correct_solutions = len(results[results["solution_label"] == "correct"]) | |
| n_incorrect_solutions = len(results) - n_correct_solutions | |
| vc = results["eval_result"].value_counts() | |
| scores = { | |
| "Refinement Failure": ( | |
| vc.loc["Refinement Failure"] / n_incorrect_solutions | |
| if "Refinement Failure" in vc.keys() | |
| else 0.0 | |
| ), | |
| "Error Detection Success": ( | |
| vc.loc["Error Detection Success"] / n_incorrect_solutions | |
| if "Error Detection Success" in vc.keys() | |
| else 0.0 | |
| ), | |
| "Error Correction Success": ( | |
| vc.loc["Error Correction Success"] / n_incorrect_solutions | |
| if "Error Correction Success" in vc.keys() | |
| else 0.0 | |
| ), | |
| "Refinement Success": ( | |
| vc.loc["Refinement Success"] / n_incorrect_solutions | |
| if "Refinement Success" in vc.keys() | |
| else 0.0 | |
| ), | |
| "False Error Detection": ( | |
| vc.loc["False Error Detection"] / n_correct_solutions | |
| if "False Error Detection" in vc.keys() | |
| else 0.0 | |
| ), | |
| "Validation Success": ( | |
| vc.loc["Validation Success"] / n_correct_solutions | |
| if "Validation Success" in vc.keys() | |
| else 0.0 | |
| ), | |
| } | |
| scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"] | |
| scores["Error Correction Success"] += scores["Refinement Success"] | |
| scores["Error Detection Success"] += scores["Error Correction Success"] | |
| scores["mRecall"] = ( | |
| scores["Error Detection Success"] + (1 - scores["False Error Detection"]) | |
| ) / 2 | |
| scores["results_dict"] = results_dict | |
| return scores |
🧰 Tools
🪛 Ruff (0.12.2)
98-98: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
99-99: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
100-100: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
101-101: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
102-102: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
103-103: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/utils.py around lines 84 to 114, the function
mmrefine_aggregate_results lacks type hints and a docstring. Add a clear
docstring describing the function's purpose, parameters, and return value.
Include type hints for the parameters (results as a list of dictionaries, args
as optional) and the return type (a dictionary with string keys and float or
dict values) to improve code readability and maintainability.
lmms_eval/tasks/mmrefine/utils.py
Outdated
| def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"] | ||
|
|
||
| def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"] | ||
|
|
||
| def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"] | ||
|
|
||
| def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"] | ||
|
|
||
| def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"] | ||
|
|
||
| def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"] | ||
|
|
||
| def mmrefine_aggregate_results_refscore(results, args=None, **kwargs): | ||
| scores = mmrefine_aggregate_results(results, args, **kwargs) | ||
| results_dict = scores.pop("results_dict") | ||
|
|
||
| path = generate_submission_file("mmrefine_results.json", args) | ||
| with open(path, "w") as f: | ||
| json.dump(results_dict, f, indent=4) | ||
|
|
||
| path = generate_submission_file("mmrefine_scores.json", args) | ||
| with open(path, "w") as f: | ||
| json.dump(scores, f, indent=4) | ||
| return scores["RefScore"] | ||
|
|
||
| def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs): | ||
| return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints to aggregation helper functions
All the aggregation helper functions need type hints.
Example for one function (apply to all):
-def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs):
+def mmrefine_aggregate_results_refinement_failure(
+ results: list[dict],
+ args: object | None = None,
+ **kwargs
+) -> float:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"] | |
| def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"] | |
| def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"] | |
| def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"] | |
| def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"] | |
| def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"] | |
| def mmrefine_aggregate_results_refscore(results, args=None, **kwargs): | |
| scores = mmrefine_aggregate_results(results, args, **kwargs) | |
| results_dict = scores.pop("results_dict") | |
| path = generate_submission_file("mmrefine_results.json", args) | |
| with open(path, "w") as f: | |
| json.dump(results_dict, f, indent=4) | |
| path = generate_submission_file("mmrefine_scores.json", args) | |
| with open(path, "w") as f: | |
| json.dump(scores, f, indent=4) | |
| return scores["RefScore"] | |
| def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs): | |
| return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"] | |
| def mmrefine_aggregate_results_refinement_failure( | |
| results: list[dict], | |
| args: object | None = None, | |
| **kwargs | |
| ) -> float: | |
| return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"] |
🤖 Prompt for AI Agents
In lmms_eval/tasks/mmrefine/utils.py between lines 115 and 148, the aggregation
helper functions lack type hints. Add appropriate type hints to all these
functions, specifying the types of the parameters (e.g., results as a list or
dict, args as optional argument) and the return type (likely a float or dict
depending on the function). This will improve code clarity and type safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (16)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (8)
124-124: Remove unnecessary f-string prefixThe string contains no placeholders.
- eval_logger.error(f"Error in evaluating answer for problem") + eval_logger.error("Error in evaluating answer for problem")
31-31: Fix incorrect use of rstrip with multi-character string
.rstrip("chat/completions")removes any combination of those characters from the end, not the exact substring. Use.removesuffix()for Python 3.9+ or string replacement.-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) +client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))For Python < 3.9 compatibility:
-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) +base_url = API_URL[:-len("/chat/completions")] if API_URL.endswith("/chat/completions") else API_URL +client = OpenAI(api_key=API_KEY, base_url=base_url)
45-46: Add type hints to init methodType hints are required for all code per the guidelines.
-def __init__(self, quick_extract=False): +def __init__(self, quick_extract: bool = False) -> None:
48-83: Add type hints and docstring to get_chat_response methodPublic methods require type hints and docstrings per the guidelines.
-def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0): +def get_chat_response( + self, + prompt: str, + temperature: float = 0, + max_tokens: int = 256, + n: int = 1, + patience: int = 10000000, + sleep_time: int = 0 +) -> str | list[str]: + """Get chat completion response from the configured API. + + Args: + prompt: Input prompt for the model + temperature: Sampling temperature (0-2) + max_tokens: Maximum tokens to generate + n: Number of completions to generate + patience: Number of retries before giving up + sleep_time: Seconds to sleep between retries + + Returns: + Model response string or list of strings if n > 1 + """
85-130: Add type hints and docstring to evaluate_answer methodThis method needs proper type annotations and documentation.
-def evaluate_answer(self, problem, prediction): +def evaluate_answer(self, problem: dict, prediction: str) -> dict: + """Evaluate a model's prediction against a problem. + + Args: + problem: Problem dictionary with solution details + prediction: Model's predicted refinement + + Returns: + Dictionary with evaluation scores + """
98-102: Replace bare except clauses with specific exceptionsBare except clauses can hide unexpected errors. Catch specific exceptions.
- except: + except (ValueError, TypeError): try: solution_correctness = self.get_chat_response(PARSING_PROMPT.format(target="Output", model_response=resp), temperature=0, max_tokens=256, n=1) - except: + except Exception as e: + eval_logger.error(f"Failed to parse solution correctness: {e}") solution_correctness = 0
112-121: Replace bare except clauses with specific exceptions- except: + except (ValueError, TypeError): error_detection = 0 try: error_correction = int(self.get_chat_response(PARSING_PROMPT.format(target="Error Correction", model_response=resp), temperature=0, max_tokens=256, n=1).strip()) - except: + except (ValueError, TypeError): error_correction = 0 try: solution_correctness = int(self.get_chat_response(PARSING_PROMPT.format(target="Effectiveness and Correctness of the Feedback", model_response=resp), temperature=0, max_tokens=256, n=1).strip()) - except: + except (ValueError, TypeError): solution_correctness = 0
132-134: Add type hints and docstring to create_one_query method-def create_one_query(self, problem): +def create_one_query(self, problem: dict) -> str: + """Create a refinement query prompt for a given problem. + + Args: + problem: Problem dictionary with question and initial solution + + Returns: + Formatted query prompt string + """lmms_eval/tasks/mmrefine/utils.py (8)
41-45: Simplify conditional assignments with dict.get()- "answer": doc["answer"] if "answer" in doc else None, + "answer": doc.get("answer", None), "initial_solution": doc["initial_solution"], "solution_source": doc["solution_source"], "solution_label": doc["solution_label"], - "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, + "reference_feedback": doc.get("reference_feedback", None),
58-63: Simplify conditional assignments with dict.get()- "answer": doc["answer"] if "answer" in doc else None, + "answer": doc.get("answer", None), "prediction": prediction, "initial_solution": doc["initial_solution"], "solution_source": doc["solution_source"], "solution_label": doc["solution_label"], - "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None, + "reference_feedback": doc.get("reference_feedback", None),
98-103: Remove unnecessary .keys() when checking dict membershipUse
key in dictinstead ofkey in dict.keys()for cleaner, more efficient code.- "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0, - "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0, - "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0, - "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0, - "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0, - "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0, + "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc else 0.0, + "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc else 0.0, + "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc else 0.0, + "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc else 0.0, + "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc else 0.0, + "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc else 0.0,
29-34: Add type hints and docstring to mmrefine_doc_to_visual functionFunctions require type hints and docstrings per guidelines.
-def mmrefine_doc_to_visual(doc): +def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]: + """Convert document image data to PIL Image objects. + + Args: + doc: Document dictionary potentially containing image data + + Returns: + List of RGB PIL Image objects, empty if no image + """
37-49: Add type hints and docstring to mmrefine_doc_to_text function-def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None): +def mmrefine_doc_to_text( + doc: dict, + lmms_eval_specific_kwargs: dict | None = None +) -> str: + """Generate query prompt from document. + + Args: + doc: Document dictionary with problem details + lmms_eval_specific_kwargs: Optional eval-specific arguments + + Returns: + Query prompt string + """
52-81: Add type hints and docstring to mmrefine_process_results function-def mmrefine_process_results(doc, results): +def mmrefine_process_results(doc: dict, results: list) -> dict: + """Process prediction results and evaluate them. + + Args: + doc: Document dictionary with problem details + results: List containing prediction string + + Returns: + Dictionary mapping metric names to evaluated results + """
84-113: Add type hints and docstring to mmrefine_aggregate_results function-def mmrefine_aggregate_results(results, args=None, **kwargs): +def mmrefine_aggregate_results( + results: list[dict], + args: object | None = None, + **kwargs +) -> dict: + """Aggregate evaluation results and compute metrics. + + Args: + results: List of result dictionaries + args: Optional arguments object + **kwargs: Additional keyword arguments + + Returns: + Dictionary with computed metrics and results + """
116-155: Add type hints to all aggregation helper functionsAll the aggregation helper functions need type hints per guidelines.
-def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs): +def mmrefine_aggregate_results_refinement_failure( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"] -def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs): +def mmrefine_aggregate_results_error_detection_success( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"] -def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs): +def mmrefine_aggregate_results_error_correction_success( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"] -def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs): +def mmrefine_aggregate_results_refinement_success( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"] -def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs): +def mmrefine_aggregate_results_false_error_detection( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"] -def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs): +def mmrefine_aggregate_results_validation_success( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"] -def mmrefine_aggregate_results_refscore(results, args=None, **kwargs): +def mmrefine_aggregate_results_refscore( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: scores = mmrefine_aggregate_results(results, args, **kwargs) # ... rest of implementation -def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs): +def mmrefine_aggregate_results_mrecall( + results: list[dict], + args: object | None = None, + **kwargs +) -> float: return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"]
🧹 Nitpick comments (2)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)
140-143: Simplify conditional assignments with dict.get()Use
dict.get()for cleaner code.- "answer": result["answer"] if "answer" in result else None, + "answer": result.get("answer", None), "initial_solution": result["initial_solution"], "solution_label": result["solution_label"], - "reference_feedback": result["reference_feedback"] if "reference_feedback" in result else None, + "reference_feedback": result.get("reference_feedback", None),lmms_eval/tasks/mmrefine/utils.py (1)
16-19: Rename unused loop variable to underscoreLoop variable
iis not used within the loop body.- for i, line in enumerate(raw_data): + for _, line in enumerate(raw_data):Or if enumeration is not needed:
- for i, line in enumerate(raw_data): + for line in raw_data:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lmms_eval/tasks/mmrefine/mmrefine.yaml(1 hunks)lmms_eval/tasks/mmrefine/mmrefine_evals.py(1 hunks)lmms_eval/tasks/mmrefine/prompts.py(1 hunks)lmms_eval/tasks/mmrefine/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lmms_eval/tasks/mmrefine/prompts.py
- lmms_eval/tasks/mmrefine/mmrefine.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit Inference Engine (CLAUDE.md)
**/*.py: Type hints required for all code
Public APIs must have docstrings
Functions must be focused and small
Line length: 88 chars maximum
PEP 8 naming (snake_case for functions/variables)
Class names in PascalCase
Constants in UPPER_SNAKE_CASE
Document with docstrings
Use f-strings for formatting
Use early returns to avoid nested conditions
Use clear variable/function names (prefix handlers with "handle")
Use constants where possible instead of functions
Don't repeat yourself (DRY code)
Prefer functional, immutable approaches when not verbose
Define composing functions before their components
Mark issues in existing code with "TODO:" prefix
Use functional and stateless approaches where they improve clarity
Keep core logic clean and push implementation details to the edges
Ruff: Line length (88 chars), import sorting (I001), unused imports
Strings: use parentheses for line wrapping
Function calls: multi-line with proper indent
Imports: split into multiple lines
Explicit None checks for Optional
Type narrowing for strings
Ruff (Python) runs as pre-commit hook
Break strings with parentheses for line length
Multi-line function calls for line length
Split imports for line length
Add None checks for Optional types
Narrow string types
Document public APIs
Files:
lmms_eval/tasks/mmrefine/mmrefine_evals.pylmms_eval/tasks/mmrefine/utils.py
🧠 Learnings (3)
📚 Learning: applies to **/*.py : ruff: line length (88 chars), import sorting (i001), unused imports...
Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Ruff: Line length (88 chars), import sorting (I001), unused imports
Applied to files:
lmms_eval/tasks/mmrefine/mmrefine_evals.pylmms_eval/tasks/mmrefine/utils.py
📚 Learning: applies to **/*.py : type hints required for all code...
Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Type hints required for all code
Applied to files:
lmms_eval/tasks/mmrefine/mmrefine_evals.pylmms_eval/tasks/mmrefine/utils.py
📚 Learning: applies to **/*.py : use f-strings for formatting...
Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Use f-strings for formatting
Applied to files:
lmms_eval/tasks/mmrefine/mmrefine_evals.py
🧬 Code Graph Analysis (1)
lmms_eval/tasks/mmrefine/utils.py (2)
lmms_eval/tasks/_task_utils/file_utils.py (1)
generate_submission_file(4-12)lmms_eval/tasks/mmrefine/mmrefine_evals.py (3)
MMRefineEvaluator(16-169)create_one_query(132-134)eval_result(136-169)
🪛 Ruff (0.12.2)
lmms_eval/tasks/mmrefine/mmrefine_evals.py
31-31: Using .strip() with multi-character strings is misleading
(B005)
98-98: Do not use bare except
(E722)
101-101: Do not use bare except
(E722)
112-112: Do not use bare except
(E722)
116-116: Do not use bare except
(E722)
120-120: Do not use bare except
(E722)
124-124: f-string without any placeholders
Remove extraneous f prefix
(F541)
140-140: Use result.get("answer", None) instead of an if block
Replace with result.get("answer", None)
(SIM401)
143-143: Use result.get("reference_feedback", None) instead of an if block
Replace with result.get("reference_feedback", None)
(SIM401)
lmms_eval/tasks/mmrefine/utils.py
16-16: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
41-41: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
45-45: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
58-58: Use doc.get("answer", None) instead of an if block
Replace with doc.get("answer", None)
(SIM401)
63-63: Use doc.get("reference_feedback", None) instead of an if block
Replace with doc.get("reference_feedback", None)
(SIM401)
98-98: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
99-99: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
100-100: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
101-101: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
102-102: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
103-103: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
Hello.
This PR adds a new multimodal (vision) benchmark MMRefine.
arxiv: https://arxiv.org/abs/2506.04688
code: https://github.com/naver-ai/MMRefine
Thank you!
Summary by CodeRabbit
New Features
Documentation