[TASK] MMRefine Benchmark #793

skyil7 · 2025-08-07T12:46:39Z

Hello.

This PR adds a new multimodal (vision) benchmark MMRefine.

arxiv: https://arxiv.org/abs/2506.04688
code: https://github.com/naver-ai/MMRefine

Thank you!

Summary by CodeRabbit

New Features
- Added support for the MMRefine task, enabling evaluation and refinement of multimodal mathematical problem solutions.
- Introduced new prompts and evaluation logic for assessing solution correctness, error detection, and error correction.
- Implemented comprehensive metrics and aggregation functions for detailed result analysis.
Documentation
- Updated documentation to include the MMRefine task with relevant links and task details.

coderabbitai · 2025-08-07T12:46:46Z

Walkthrough

A new evaluation task "MMRefine" was integrated into the codebase. This includes documentation updates, a YAML configuration for the task, prompt definitions, an evaluator class for handling model interactions and scoring, and utility functions for processing and aggregating results. No existing public APIs were altered.

Changes

Cohort / File(s)	Change Summary
Documentation Update `docs/current_tasks.md`	Added "MMRefine" to the image tasks section, linking to its homepage and positioning it between existing tasks.
Task Configuration `lmms_eval/tasks/mmrefine/mmrefine.yaml`	Introduced a new YAML config for the MMRefine task, specifying dataset path, splits, output type, generation parameters, utility functions for document conversion, target extraction, result processing, evaluation metrics, and metadata.
Prompt Definitions `lmms_eval/tasks/mmrefine/prompts.py`	Added multi-line string constants for prompts used in MMRefine: refinement prompt, evaluation prompts for correct/incorrect solutions, and a parsing prompt.
Evaluator Implementation `lmms_eval/tasks/mmrefine/mmrefine_evals.py`	Introduced `MMRefineEvaluator` class to handle model interaction, prompt construction, response parsing, scoring, and result classification for the MMRefine task. Includes retry logic, error handling, and logging.
Utility Functions `lmms_eval/tasks/mmrefine/utils.py`	Added utility functions for document-to-visual/text conversion, result processing, aggregation of metrics, and writing outputs. Includes several metric-specific aggregation helpers and logging for error conditions.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MMRefineEvaluator
    participant ModelAPI
    participant Utils

    User->>MMRefineEvaluator: Provide problem and prediction
    MMRefineEvaluator->>ModelAPI: Send evaluation/refinement prompt
    ModelAPI-->>MMRefineEvaluator: Return model response
    MMRefineEvaluator->>MMRefineEvaluator: Parse response, compute scores
    MMRefineEvaluator-->>User: Return evaluation result

    User->>Utils: Aggregate multiple results
    Utils->>Utils: Compute metrics (RefScore, mRecall, etc.)
    Utils-->>User: Return aggregated metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Suggested reviewers

Luodian

Poem

In the warren where code and math entwine,
A new task hops forth—MMRefine!
With prompts and scores, and careful review,
Rabbits crunch numbers, as only they do.
From docs to YAML, utilities align—
Another leap forward, in record time!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 16

🔭 Outside diff range comments (1)

lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)

127-161: Add type hints and docstring to eval_result method

-def eval_result(self, result, config):
+def eval_result(self, result: dict, config: dict) -> dict:
+    """Evaluate and classify a prediction result.
+    
+    Args:
+        result: Result dictionary with prediction and problem details
+        config: Configuration dictionary with metadata
+        
+    Returns:
+        Result dictionary with added evaluation scores and classification
+    """

🧹 Nitpick comments (7)

lmms_eval/tasks/mmrefine/prompts.py (2)
1-36: Add module docstring to document the purpose of these prompts

This module should have a docstring explaining its purpose and how these prompts are used in the MMRefine evaluation pipeline.

Add a module docstring at the beginning:
"""Prompt templates for the MMRefine evaluation task.

This module contains standardized prompts used by the MMRefineEvaluator
to interact with language models for evaluating and refining mathematical
problem solutions.
"""
1-73: Consider adding inline documentation for each prompt constant

While the prompt names are descriptive, brief comments explaining the expected input/output format for each prompt would improve maintainability.

Example:
# Prompt for mathematical expert to review and correct solutions
# Expects: {question}, {initial_solution}
# Returns: Structured assessment with correctness evaluation
REFINEMENT_PROMPT = """..."""
lmms_eval/tasks/mmrefine/mmrefine_evals.py (2)
131-134: Simplify conditional assignments using dict.get()

Use the more concise dict.get() method with default values.
-"answer": result["answer"] if "answer" in result else None,
+"answer": result.get("answer", None),
-"reference_feedback": result["reference_feedback"] if "reference_feedback" in result else None,
+"reference_feedback": result.get("reference_feedback", None),
144-159: Add type hints to nested function

The nested classify_results function should have type hints.
-def classify_results(row):
+def classify_results(row: dict) -> str:
lmms_eval/tasks/mmrefine/utils.py (3)
17-17: Rename unused loop variable

Loop variable i is not used within the loop body.
-for i, line in enumerate(raw_data):
+for _, line in enumerate(raw_data):
Or if enumeration is not needed:
-for i, line in enumerate(raw_data):
+for line in raw_data:
42-42: Simplify conditional assignments using dict.get()

Use the more concise dict.get() method.
-"answer": doc["answer"] if "answer" in doc else None,
+"answer": doc.get("answer", None),
-"reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,
+"reference_feedback": doc.get("reference_feedback", None),
Also applies to: 46-46, 59-59, 64-64

98-103: Remove unnecessary .keys() in membership tests

The .keys() call is redundant when checking membership.
-if "Refinement Failure" in vc.keys()
+if "Refinement Failure" in vc
Apply this change to all occurrences on lines 98-103.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a895c4 and 366f9b6.

📒 Files selected for processing (5)

docs/current_tasks.md (1 hunks)
lmms_eval/tasks/mmrefine/mmrefine.yaml (1 hunks)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (1 hunks)
lmms_eval/tasks/mmrefine/prompts.py (1 hunks)
lmms_eval/tasks/mmrefine/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: Type hints required for all code
Public APIs must have docstrings
Functions must be focused and small
Line length: 88 chars maximum
PEP 8 naming (snake_case for functions/variables)
Class names in PascalCase
Constants in UPPER_SNAKE_CASE
Document with docstrings
Use f-strings for formatting
Use early returns to avoid nested conditions
Use clear variable/function names (prefix handlers with "handle")
Use constants where possible instead of functions
Don't repeat yourself (DRY code)
Prefer functional, immutable approaches when not verbose
Define composing functions before their components
Mark issues in existing code with "TODO:" prefix
Use functional and stateless approaches where they improve clarity
Keep core logic clean and push implementation details to the edges
Ruff: Line length (88 chars), import sorting (I001), unused imports
Strings: use parentheses for line wrapping
Function calls: multi-line with proper indent
Imports: split into multiple lines
Explicit None checks for Optional
Type narrowing for strings
Ruff (Python) runs as pre-commit hook
Break strings with parentheses for line length
Multi-line function calls for line length
Split imports for line length
Add None checks for Optional types
Narrow string types
Document public APIs

Files:

lmms_eval/tasks/mmrefine/prompts.py
lmms_eval/tasks/mmrefine/mmrefine_evals.py
lmms_eval/tasks/mmrefine/utils.py

**/*.{yaml,yml,json}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Prettier (YAML/JSON) runs as pre-commit hook

Files:

lmms_eval/tasks/mmrefine/mmrefine.yaml

🧬 Code Graph Analysis (1)

lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)

lmms_eval/tasks/megabench/metrics/scoring/program_judge.py (1)

target (58-73)

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/mmrefine/mmrefine.yaml

[error] 42-42: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.12.2)

lmms_eval/tasks/mmrefine/mmrefine_evals.py

4-4: pandas imported but unused

Remove unused import: pandas

(F401)

6-6: tqdm.tqdm imported but unused

Remove unused import: tqdm.tqdm

(F401)

21-21: Using .strip() with multi-character strings is misleading

(B005)

86-86: f-string without any placeholders

Remove extraneous f prefix

(F541)

89-89: Do not use bare except

(E722)

92-92: Do not use bare except

(E722)

103-103: Do not use bare except

(E722)

107-107: Do not use bare except

(E722)

111-111: Do not use bare except

(E722)

115-115: f-string without any placeholders

Remove extraneous f prefix

(F541)

131-131: Use result.get("answer", None) instead of an if block

Replace with result.get("answer", None)

(SIM401)

134-134: Use result.get("reference_feedback", None) instead of an if block

Replace with result.get("reference_feedback", None)

(SIM401)

lmms_eval/tasks/mmrefine/utils.py

2-2: os imported but unused

Remove unused import: os

(F401)

17-17: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

42-42: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

46-46: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

59-59: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

64-64: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

98-98: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

99-99: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

100-100: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

101-101: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

102-102: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

103-103: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🔇 Additional comments (1)

docs/current_tasks.md (1)

70-70: LGTM!

The MMRefine task entry follows the existing documentation format correctly.

lmms_eval/tasks/mmrefine/mmrefine_evals.py

coderabbitai · 2025-08-07T12:52:36Z

lmms_eval/tasks/mmrefine/mmrefine_evals.py

+            "Authorization": f"Bearer {API_KEY}",
+            "Content-Type": "application/json",
+        }
+        client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))


⚠️ Potential issue

Potentially misleading use of .strip() with multi-character string

Using .strip("chat/completions") removes any combination of those characters from both ends, not the exact substring. Use .rstrip("/chat/completions") or string slicing instead.

-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions")) +client = OpenAI(api_key=API_KEY, base_url=API_URL.replace("/chat/completions", ""))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))

client = OpenAI(api_key=API_KEY, base_url=API_URL.replace("/chat/completions", ""))

🧰 Tools

🪛 Ruff (0.12.2)

21-21: Using .strip() with multi-character strings is misleading

(B005)

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/mmrefine_evals.py at line 21, the use of .strip("chat/completions") is incorrect because it removes any of the characters 'c', 'h', 'a', 't', '/', 'o', 'm', 'p', 'l', 'e', 'i', 's' from both ends rather than the exact substring. Replace .strip("chat/completions") with .rstrip("/chat/completions") to correctly remove the exact suffix from the API_URL string.

coderabbitai · 2025-08-07T12:52:36Z

lmms_eval/tasks/mmrefine/mmrefine_evals.py

+    def __init__(self, quick_extract=False):
+        self.quick_extract = quick_extract
+


⚠️ Potential issue

Add type hints to init method

Type hints are required for all code per the guidelines.

-def __init__(self, quick_extract=False): +def __init__(self, quick_extract: bool = False) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, quick_extract=False):

self.quick_extract = quick_extract

def __init__(self, quick_extract: bool = False) -> None:

self.quick_extract = quick_extract

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/mmrefine_evals.py around lines 35 to 37, the __init__ method lacks type hints. Add type hints by specifying the type of the quick_extract parameter as bool and the return type as None to comply with the code guidelines.

coderabbitai · 2025-08-07T12:52:36Z

lmms_eval/tasks/mmrefine/mmrefine_evals.py

+    def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0):
+        messages = [
+            {"role": "user", "content": prompt},
+        ]
+        payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens}
+
+        while patience > 0:
+            patience -= 1
+            try:
+                response = self.client.chat.completions.create(**payload)
+                if n == 1:
+                    prediction = response.choices[0].message.content.strip()
+                    if prediction and prediction != "":
+                        return prediction
+                else:
+                    prediction = [choice.message.content.strip() for choice in response.choices]
+                    if prediction and prediction[0] != "":
+                        return prediction
+
+            except Exception as e:
+                if "Rate limit" not in str(e):
+                    eval_logger.error(e)
+
+                if "Please reduce the length of the messages" in str(e):
+                    eval_logger.error("!!Reduce prompt size")
+                    # reduce input prompt and keep the tail
+                    new_size = int(len(prompt) * 0.9)
+                    new_start = len(prompt) - new_size
+                    prompt = prompt[new_start:]
+                    payload["messages"] = [
+                        {"role": "user", "content": prompt},
+                    ]
+
+                if sleep_time > 0:
+                    time.sleep(sleep_time)
+        return ""


⚠️ Potential issue

Add type hints and docstring to get_chat_response method

Methods need type hints and docstrings per the guidelines.

-def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0): +def get_chat_response( + self, + prompt: str, + temperature: float = 0, + max_tokens: int = 256, + n: int = 1, + patience: int = 10000000, + sleep_time: int = 0 +) -> str | list[str]: + """Get chat completion response from the configured API. + + Args: + prompt: Input prompt for the model + temperature: Sampling temperature + max_tokens: Maximum tokens to generate + n: Number of completions to generate + patience: Number of retries before giving up + sleep_time: Seconds to sleep between retries + + Returns: + Model response string or list of strings if n > 1 + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0):

messages = [

{"role": "user", "content": prompt},

]

payload = {"model": self.gpt_model, "messages": messages, "temperature": temperature, "max_tokens": max_tokens}

while patience > 0:

patience -= 1

try:

response = self.client.chat.completions.create(**payload)

if n == 1:

prediction = response.choices[0].message.content.strip()

if prediction and prediction != "":

return prediction

else:

prediction = [choice.message.content.strip() for choice in response.choices]

if prediction and prediction[0] != "":

return prediction

except Exception as e:

if "Rate limit" not in str(e):

eval_logger.error(e)

if "Please reduce the length of the messages" in str(e):

eval_logger.error("!!Reduce prompt size")

# reduce input prompt and keep the tail

new_size = int(len(prompt) * 0.9)

new_start = len(prompt) - new_size

prompt = prompt[new_start:]

payload["messages"] = [

{"role": "user", "content": prompt},

]

if sleep_time > 0:

time.sleep(sleep_time)

return ""

def get_chat_response(

self,

prompt: str,

temperature: float = 0,

max_tokens: int = 256,

n: int = 1,

patience: int = 10000000,

sleep_time: int = 0

) -> str | list[str]:

"""Get chat completion response from the configured API.

Args:

prompt: Input prompt for the model

temperature: Sampling temperature

max_tokens: Maximum tokens to generate

n: Number of completions to generate

patience: Number of retries before giving up

sleep_time: Seconds to sleep between retries

Returns:

Model response string or list of strings if n > 1

"""

messages = [

{"role": "user", "content": prompt},

]

payload = {

"model": self.gpt_model,

"messages": messages,

"temperature": temperature,

"max_tokens": max_tokens,

}

while patience > 0:

patience -= 1

try:

response = self.client.chat.completions.create(**payload)

if n == 1:

prediction = response.choices[0].message.content.strip()

if prediction and prediction != "":

return prediction

else:

prediction = [c.message.content.strip() for c in response.choices]

if prediction and prediction[0] != "":

return prediction

except Exception as e:

if "Rate limit" not in str(e):

eval_logger.error(e)

if "Please reduce the length of the messages" in str(e):

eval_logger.error("!!Reduce prompt size")

# reduce input prompt and keep the tail

new_size = int(len(prompt) * 0.9)

new_start = len(prompt) - new_size

prompt = prompt[new_start:]

payload["messages"] = [

{"role": "user", "content": prompt},

]

if sleep_time > 0:

time.sleep(sleep_time)

return ""

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/mmrefine_evals.py between lines 38 and 73, the get_chat_response method lacks type hints and a docstring. Add appropriate type hints for all parameters and the return type, specifying types like str, int, and list or str for the return. Also, include a clear docstring describing the method's purpose, parameters, and return value to improve code clarity and maintainability.

coderabbitai · 2025-08-07T12:52:37Z

lmms_eval/tasks/mmrefine/utils.py

+def mmrefine_doc_to_visual(doc):
+    if doc["image"] is None or str(doc["image"]).strip() == "":
+        return []
+    if isinstance(doc["image"], Image.Image):
+        return [doc["image"].convert("RGB")]
+    return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")]
+


⚠️ Potential issue

Add type hints and docstring to mmrefine_doc_to_visual function

Functions require type hints and docstrings per guidelines.

-def mmrefine_doc_to_visual(doc): +def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]: + """Convert document image data to PIL Image objects. + + Args: + doc: Document dictionary potentially containing image data + + Returns: + List of RGB PIL Image objects, empty if no image + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def mmrefine_doc_to_visual(doc):

if doc["image"] is None or str(doc["image"]).strip() == "":

return []

if isinstance(doc["image"], Image.Image):

return [doc["image"].convert("RGB")]

return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")]

def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]:

"""Convert document image data to PIL Image objects.

Args:

doc: Document dictionary potentially containing image data

Returns:

List of RGB PIL Image objects, empty if no image

"""

if doc["image"] is None or str(doc["image"]).strip() == "":

return []

if isinstance(doc["image"], Image.Image):

return [doc["image"].convert("RGB")]

return [Image.open(io.BytesIO(doc["image"]["bytes"])).convert("RGB")]

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/utils.py around lines 30 to 36, the function mmrefine_doc_to_visual lacks type hints and a docstring. Add appropriate type hints for the input parameter and return type, and include a clear docstring describing the function's purpose, input parameter, and return value following the project's documentation style.

coderabbitai · 2025-08-07T12:52:37Z

lmms_eval/tasks/mmrefine/utils.py

+def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None):
+    problem = {
+        "id": doc["id"],
+        "question": doc["question"],
+        "answer": doc["answer"] if "answer" in doc else None,
+        "initial_solution": doc["initial_solution"],
+        "solution_source": doc["solution_source"],
+        "solution_label": doc["solution_label"],
+        "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,
+        "meta": doc["meta"],
+    }
+    query_prompt = mmrefine_evaluator.create_one_query(problem)
+    return query_prompt
+


⚠️ Potential issue

Add type hints and docstring to mmrefine_doc_to_text function

-def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None): +def mmrefine_doc_to_text( + doc: dict, + lmms_eval_specific_kwargs: dict | None = None +) -> str: + """Generate query prompt from document. + + Args: + doc: Document dictionary with problem details + lmms_eval_specific_kwargs: Optional eval-specific arguments + + Returns: + Query prompt string + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None):

problem = {

"id": doc["id"],

"question": doc["question"],

"answer": doc["answer"] if "answer" in doc else None,

"initial_solution": doc["initial_solution"],

"solution_source": doc["solution_source"],

"solution_label": doc["solution_label"],

"reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,

"meta": doc["meta"],

}

query_prompt = mmrefine_evaluator.create_one_query(problem)

return query_prompt

def mmrefine_doc_to_text(

doc: dict,

lmms_eval_specific_kwargs: dict | None = None

) -> str:

"""Generate query prompt from document.

Args:

doc: Document dictionary with problem details

lmms_eval_specific_kwargs: Optional eval-specific arguments

Returns:

Query prompt string

"""

problem = {

"id": doc["id"],

"question": doc["question"],

"answer": doc["answer"] if "answer" in doc else None,

"initial_solution": doc["initial_solution"],

"solution_source": doc["solution_source"],

"solution_label": doc["solution_label"],

"reference_feedback": doc["reference_feedback"]

if "reference_feedback" in doc

else None,

"meta": doc["meta"],

}

query_prompt = mmrefine_evaluator.create_one_query(problem)

return query_prompt

🧰 Tools

🪛 Ruff (0.12.2)

42-42: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

46-46: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/utils.py around lines 38 to 51, the function mmrefine_doc_to_text lacks type hints and a docstring. Add appropriate type hints for the function parameters and return type, and include a clear docstring describing the function's purpose, inputs, and outputs to improve code readability and maintainability.

coderabbitai · 2025-08-07T12:52:37Z

lmms_eval/tasks/mmrefine/utils.py

+def mmrefine_process_results(doc, results):
+    prediction = results[0].strip()
+
+    result = {
+        "id": doc["id"],
+        "question": doc["question"],
+        "answer": doc["answer"] if "answer" in doc else None,
+        "prediction": prediction,
+        "initial_solution": doc["initial_solution"],
+        "solution_source": doc["solution_source"],
+        "solution_label": doc["solution_label"],
+        "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,
+        "meta": doc["meta"],
+    }
+    result = mmrefine_evaluator.eval_result(result, config)
+
+    metric_dict = {}
+    for _metric in [
+        "Refinement Failure",
+        "Error Detection Success",
+        "Error Correction Success",
+        "Refinement Success",
+        "False Error Detection",
+        "Validation Success",
+        "RefScore",
+        "mRecall",
+    ]:
+        _result = result.copy()
+        metric_dict[_metric] = _result
+    return metric_dict
+


⚠️ Potential issue

Add type hints and docstring to mmrefine_process_results function

-def mmrefine_process_results(doc, results): +def mmrefine_process_results(doc: dict, results: list) -> dict: + """Process prediction results and evaluate them. + + Args: + doc: Document dictionary with problem details + results: List containing prediction string + + Returns: + Dictionary mapping metric names to evaluated results + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def mmrefine_process_results(doc, results):

prediction = results[0].strip()

result = {

"id": doc["id"],

"question": doc["question"],

"answer": doc["answer"] if "answer" in doc else None,

"prediction": prediction,

"initial_solution": doc["initial_solution"],

"solution_source": doc["solution_source"],

"solution_label": doc["solution_label"],

"reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,

"meta": doc["meta"],

}

result = mmrefine_evaluator.eval_result(result, config)

metric_dict = {}

for _metric in [

"Refinement Failure",

"Error Detection Success",

"Error Correction Success",

"Refinement Success",

"False Error Detection",

"Validation Success",

"RefScore",

"mRecall",

]:

_result = result.copy()

metric_dict[_metric] = _result

return metric_dict

def mmrefine_process_results(doc: dict, results: list) -> dict:

"""Process prediction results and evaluate them.

Args:

doc: Document dictionary with problem details

results: List containing prediction string

Returns:

Dictionary mapping metric names to evaluated results

"""

prediction = results[0].strip()

result = {

"id": doc["id"],

"question": doc["question"],

"answer": doc["answer"] if "answer" in doc else None,

"prediction": prediction,

"initial_solution": doc["initial_solution"],

"solution_source": doc["solution_source"],

"solution_label": doc["solution_label"],

"reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,

"meta": doc["meta"],

}

result = mmrefine_evaluator.eval_result(result, config)

metric_dict = {}

for _metric in [

"Refinement Failure",

"Error Detection Success",

"Error Correction Success",

"Refinement Success",

"False Error Detection",

"Validation Success",

"RefScore",

"mRecall",

]:

_result = result.copy()

metric_dict[_metric] = _result

return metric_dict

🧰 Tools

🪛 Ruff (0.12.2)

59-59: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

64-64: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/utils.py between lines 53 and 83, the function mmrefine_process_results lacks type hints and a docstring. Add appropriate type hints for the parameters and return type based on their expected data types, such as dict for doc and results, and dict for the return. Also, include a concise docstring describing the function's purpose, parameters, and return value to improve code clarity and maintainability.

coderabbitai · 2025-08-07T12:52:38Z

lmms_eval/tasks/mmrefine/utils.py

+def mmrefine_aggregate_results(results, args=None, **kwargs):
+    # calculate total scores
+    results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results}
+    results = pd.DataFrame(results)
+    try:
+        assert len(results) == len(results_dict)
+    except AssertionError:
+        eval_logger.error(f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries")
+
+    n_correct_solutions = len(results[results["solution_label"]=="correct"])
+    n_incorrect_solutions = len(results) - n_correct_solutions
+
+    vc = results["eval_result"].value_counts()
+    scores = {
+        "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0,
+        "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0,
+        "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0,
+        "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0,
+        "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0,
+        "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0,
+    }
+
+    scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"]
+    scores["Error Correction Success"] = scores["Error Correction Success"] + scores["Refinement Success"]
+    scores["Error Detection Success"] = scores["Error Detection Success"] + scores["Error Correction Success"]
+    scores["mRecall"] = (scores["Error Detection Success"] + (1 - scores["False Error Detection"])) / 2
+
+    scores["results_dict"] = results_dict
+
+    return scores
+


⚠️ Potential issue

Add type hints and docstring to mmrefine_aggregate_results function

-def mmrefine_aggregate_results(results, args=None, **kwargs): +def mmrefine_aggregate_results( + results: list[dict], + args: object | None = None, + **kwargs +) -> dict: + """Aggregate evaluation results and compute metrics. + + Args: + results: List of result dictionaries + args: Optional arguments object + **kwargs: Additional keyword arguments + + Returns: + Dictionary with computed metrics and results + """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def mmrefine_aggregate_results(results, args=None, **kwargs):

# calculate total scores

results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results}

results = pd.DataFrame(results)

try:

assert len(results) == len(results_dict)

except AssertionError:

eval_logger.error(f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries")

n_correct_solutions = len(results[results["solution_label"]=="correct"])

n_incorrect_solutions = len(results) - n_correct_solutions

vc = results["eval_result"].value_counts()

scores = {

"Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0,

"Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0,

"Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0,

"Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0,

"False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0,

"Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0,

}

scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"]

scores["Error Correction Success"] = scores["Error Correction Success"] + scores["Refinement Success"]

scores["Error Detection Success"] = scores["Error Detection Success"] + scores["Error Correction Success"]

scores["mRecall"] = (scores["Error Detection Success"] + (1 - scores["False Error Detection"])) / 2

scores["results_dict"] = results_dict

return scores

def mmrefine_aggregate_results(

results: list[dict],

args: object | None = None,

**kwargs

) -> dict:

"""Aggregate evaluation results and compute metrics.

Args:

results: List of result dictionaries

args: Optional arguments object

**kwargs: Additional keyword arguments

Returns:

Dictionary with computed metrics and results

"""

# calculate total scores

results_dict = {f"{result['id']}_{result['solution_source']}": result for result in results}

results = pd.DataFrame(results)

try:

assert len(results) == len(results_dict)

except AssertionError:

eval_logger.error(

f"Length mismatch: {len(results)} results vs {len(results_dict)} unique entries"

)

n_correct_solutions = len(results[results["solution_label"] == "correct"])

n_incorrect_solutions = len(results) - n_correct_solutions

vc = results["eval_result"].value_counts()

scores = {

"Refinement Failure": (

vc.loc["Refinement Failure"] / n_incorrect_solutions

if "Refinement Failure" in vc.keys()

else 0.0

),

"Error Detection Success": (

vc.loc["Error Detection Success"] / n_incorrect_solutions

if "Error Detection Success" in vc.keys()

else 0.0

),

"Error Correction Success": (

vc.loc["Error Correction Success"] / n_incorrect_solutions

if "Error Correction Success" in vc.keys()

else 0.0

),

"Refinement Success": (

vc.loc["Refinement Success"] / n_incorrect_solutions

if "Refinement Success" in vc.keys()

else 0.0

),

"False Error Detection": (

vc.loc["False Error Detection"] / n_correct_solutions

if "False Error Detection" in vc.keys()

else 0.0

),

"Validation Success": (

vc.loc["Validation Success"] / n_correct_solutions

if "Validation Success" in vc.keys()

else 0.0

),

}

scores["RefScore"] = scores["Refinement Success"] - scores["False Error Detection"]

scores["Error Correction Success"] += scores["Refinement Success"]

scores["Error Detection Success"] += scores["Error Correction Success"]

scores["mRecall"] = (

scores["Error Detection Success"] + (1 - scores["False Error Detection"])

) / 2

scores["results_dict"] = results_dict

return scores

🧰 Tools

🪛 Ruff (0.12.2)

98-98: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

99-99: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

100-100: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

101-101: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

102-102: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

103-103: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/utils.py around lines 84 to 114, the function mmrefine_aggregate_results lacks type hints and a docstring. Add a clear docstring describing the function's purpose, parameters, and return value. Include type hints for the parameters (results as a list of dictionaries, args as optional) and the return type (a dictionary with string keys and float or dict values) to improve code readability and maintainability.

coderabbitai · 2025-08-07T12:52:38Z

lmms_eval/tasks/mmrefine/utils.py

+def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"]
+
+def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"]
+
+def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"]
+
+def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"]
+
+def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"]
+
+def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"]
+
+def mmrefine_aggregate_results_refscore(results, args=None, **kwargs):
+    scores = mmrefine_aggregate_results(results, args, **kwargs)
+    results_dict = scores.pop("results_dict")
+
+    path = generate_submission_file("mmrefine_results.json", args)
+    with open(path, "w") as f:
+        json.dump(results_dict, f, indent=4)
+
+    path = generate_submission_file("mmrefine_scores.json", args)
+    with open(path, "w") as f:
+        json.dump(scores, f, indent=4)
+    return scores["RefScore"]
+
+def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs):
+    return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"]
+


⚠️ Potential issue

Add type hints to aggregation helper functions

All the aggregation helper functions need type hints.

Example for one function (apply to all):

-def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs): +def mmrefine_aggregate_results_refinement_failure( + results: list[dict], + args: object | None = None, + **kwargs +) -> float:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"]

def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"]

def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"]

def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"]

def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"]

def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"]

def mmrefine_aggregate_results_refscore(results, args=None, **kwargs):

scores = mmrefine_aggregate_results(results, args, **kwargs)

results_dict = scores.pop("results_dict")

path = generate_submission_file("mmrefine_results.json", args)

with open(path, "w") as f:

json.dump(results_dict, f, indent=4)

path = generate_submission_file("mmrefine_scores.json", args)

with open(path, "w") as f:

json.dump(scores, f, indent=4)

return scores["RefScore"]

def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs):

return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"]

def mmrefine_aggregate_results_refinement_failure(

results: list[dict],

args: object | None = None,

**kwargs

) -> float:

return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"]

🤖 Prompt for AI Agents

In lmms_eval/tasks/mmrefine/utils.py between lines 115 and 148, the aggregation helper functions lack type hints. Add appropriate type hints to all these functions, specifying the types of the parameters (e.g., results as a list or dict, args as optional argument) and the return type (likely a float or dict depending on the function). This will improve code clarity and type safety.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (16)

lmms_eval/tasks/mmrefine/mmrefine_evals.py (8)

124-124: Remove unnecessary f-string prefix

The string contains no placeholders.

-                eval_logger.error(f"Error in evaluating answer for problem")
+                eval_logger.error("Error in evaluating answer for problem")

31-31: Fix incorrect use of rstrip with multi-character string

.rstrip("chat/completions") removes any combination of those characters from the end, not the exact substring. Use .removesuffix() for Python 3.9+ or string replacement.

-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+client = OpenAI(api_key=API_KEY, base_url=API_URL.removesuffix("/chat/completions"))

For Python < 3.9 compatibility:

-client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
+base_url = API_URL[:-len("/chat/completions")] if API_URL.endswith("/chat/completions") else API_URL
+client = OpenAI(api_key=API_KEY, base_url=base_url)

45-46: Add type hints to init method

Type hints are required for all code per the guidelines.

-def __init__(self, quick_extract=False):
+def __init__(self, quick_extract: bool = False) -> None:

48-83: Add type hints and docstring to get_chat_response method

Public methods require type hints and docstrings per the guidelines.

-def get_chat_response(self, prompt, temperature=0, max_tokens=256, n=1, patience=10000000, sleep_time=0):
+def get_chat_response(
+    self,
+    prompt: str,
+    temperature: float = 0,
+    max_tokens: int = 256,
+    n: int = 1,
+    patience: int = 10000000,
+    sleep_time: int = 0
+) -> str | list[str]:
+    """Get chat completion response from the configured API.
+    
+    Args:
+        prompt: Input prompt for the model
+        temperature: Sampling temperature (0-2)
+        max_tokens: Maximum tokens to generate
+        n: Number of completions to generate
+        patience: Number of retries before giving up
+        sleep_time: Seconds to sleep between retries
+        
+    Returns:
+        Model response string or list of strings if n > 1
+    """

85-130: Add type hints and docstring to evaluate_answer method

This method needs proper type annotations and documentation.

-def evaluate_answer(self, problem, prediction):
+def evaluate_answer(self, problem: dict, prediction: str) -> dict:
+    """Evaluate a model's prediction against a problem.
+    
+    Args:
+        problem: Problem dictionary with solution details
+        prediction: Model's predicted refinement
+        
+    Returns:
+        Dictionary with evaluation scores
+    """

98-102: Replace bare except clauses with specific exceptions

Bare except clauses can hide unexpected errors. Catch specific exceptions.

-        except:
+        except (ValueError, TypeError):
             try:
                 solution_correctness = self.get_chat_response(PARSING_PROMPT.format(target="Output", model_response=resp), temperature=0, max_tokens=256, n=1)
-            except:
+            except Exception as e:
+                eval_logger.error(f"Failed to parse solution correctness: {e}")
                 solution_correctness = 0

112-121: Replace bare except clauses with specific exceptions

-                except:
+                except (ValueError, TypeError):
                     error_detection = 0
                 try:
                     error_correction = int(self.get_chat_response(PARSING_PROMPT.format(target="Error Correction", model_response=resp), temperature=0, max_tokens=256, n=1).strip())
-                except:
+                except (ValueError, TypeError):
                     error_correction = 0
                 try:
                     solution_correctness = int(self.get_chat_response(PARSING_PROMPT.format(target="Effectiveness and Correctness of the Feedback", model_response=resp), temperature=0, max_tokens=256, n=1).strip())
-                except:
+                except (ValueError, TypeError):
                     solution_correctness = 0

132-134: Add type hints and docstring to create_one_query method

-def create_one_query(self, problem):
+def create_one_query(self, problem: dict) -> str:
+    """Create a refinement query prompt for a given problem.
+    
+    Args:
+        problem: Problem dictionary with question and initial solution
+        
+    Returns:
+        Formatted query prompt string
+    """

lmms_eval/tasks/mmrefine/utils.py (8)

41-45: Simplify conditional assignments with dict.get()

-        "answer": doc["answer"] if "answer" in doc else None,
+        "answer": doc.get("answer", None),
         "initial_solution": doc["initial_solution"],
         "solution_source": doc["solution_source"],
         "solution_label": doc["solution_label"],
-        "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,
+        "reference_feedback": doc.get("reference_feedback", None),

58-63: Simplify conditional assignments with dict.get()

-        "answer": doc["answer"] if "answer" in doc else None,
+        "answer": doc.get("answer", None),
         "prediction": prediction,
         "initial_solution": doc["initial_solution"],
         "solution_source": doc["solution_source"],
         "solution_label": doc["solution_label"],
-        "reference_feedback": doc["reference_feedback"] if "reference_feedback" in doc else None,
+        "reference_feedback": doc.get("reference_feedback", None),

98-103: Remove unnecessary .keys() when checking dict membership

Use key in dict instead of key in dict.keys() for cleaner, more efficient code.

-        "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc.keys() else 0.0,
-        "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc.keys() else 0.0,
-        "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc.keys() else 0.0,
-        "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc.keys() else 0.0,
-        "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc.keys() else 0.0,
-        "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc.keys() else 0.0,
+        "Refinement Failure": vc.loc["Refinement Failure"] / n_incorrect_solutions if "Refinement Failure" in vc else 0.0,
+        "Error Detection Success": vc.loc["Error Detection Success"] / n_incorrect_solutions if "Error Detection Success" in vc else 0.0,
+        "Error Correction Success": vc.loc["Error Correction Success"] / n_incorrect_solutions if "Error Correction Success" in vc else 0.0,
+        "Refinement Success": vc.loc["Refinement Success"] / n_incorrect_solutions if "Refinement Success" in vc else 0.0,
+        "False Error Detection": vc.loc["False Error Detection"] / n_correct_solutions if "False Error Detection" in vc else 0.0,
+        "Validation Success": vc.loc["Validation Success"] / n_correct_solutions if "Validation Success" in vc else 0.0,

29-34: Add type hints and docstring to mmrefine_doc_to_visual function

Functions require type hints and docstrings per guidelines.

-def mmrefine_doc_to_visual(doc):
+def mmrefine_doc_to_visual(doc: dict) -> list[Image.Image]:
+    """Convert document image data to PIL Image objects.
+    
+    Args:
+        doc: Document dictionary potentially containing image data
+        
+    Returns:
+        List of RGB PIL Image objects, empty if no image
+    """

37-49: Add type hints and docstring to mmrefine_doc_to_text function

-def mmrefine_doc_to_text(doc, lmms_eval_specific_kwargs=None):
+def mmrefine_doc_to_text(
+    doc: dict,
+    lmms_eval_specific_kwargs: dict | None = None
+) -> str:
+    """Generate query prompt from document.
+    
+    Args:
+        doc: Document dictionary with problem details
+        lmms_eval_specific_kwargs: Optional eval-specific arguments
+        
+    Returns:
+        Query prompt string
+    """

52-81: Add type hints and docstring to mmrefine_process_results function

-def mmrefine_process_results(doc, results):
+def mmrefine_process_results(doc: dict, results: list) -> dict:
+    """Process prediction results and evaluate them.
+    
+    Args:
+        doc: Document dictionary with problem details
+        results: List containing prediction string
+        
+    Returns:
+        Dictionary mapping metric names to evaluated results
+    """

84-113: Add type hints and docstring to mmrefine_aggregate_results function

-def mmrefine_aggregate_results(results, args=None, **kwargs):
+def mmrefine_aggregate_results(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> dict:
+    """Aggregate evaluation results and compute metrics.
+    
+    Args:
+        results: List of result dictionaries
+        args: Optional arguments object
+        **kwargs: Additional keyword arguments
+        
+    Returns:
+        Dictionary with computed metrics and results
+    """

116-155: Add type hints to all aggregation helper functions

All the aggregation helper functions need type hints per guidelines.

-def mmrefine_aggregate_results_refinement_failure(results, args=None, **kwargs):
+def mmrefine_aggregate_results_refinement_failure(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Failure"]

-def mmrefine_aggregate_results_error_detection_success(results, args=None, **kwargs):
+def mmrefine_aggregate_results_error_detection_success(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["Error Detection Success"]

-def mmrefine_aggregate_results_error_correction_success(results, args=None, **kwargs):
+def mmrefine_aggregate_results_error_correction_success(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["Error Correction Success"]

-def mmrefine_aggregate_results_refinement_success(results, args=None, **kwargs):
+def mmrefine_aggregate_results_refinement_success(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["Refinement Success"]

-def mmrefine_aggregate_results_false_error_detection(results, args=None, **kwargs):
+def mmrefine_aggregate_results_false_error_detection(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["False Error Detection"]

-def mmrefine_aggregate_results_validation_success(results, args=None, **kwargs):
+def mmrefine_aggregate_results_validation_success(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["Validation Success"]

-def mmrefine_aggregate_results_refscore(results, args=None, **kwargs):
+def mmrefine_aggregate_results_refscore(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     scores = mmrefine_aggregate_results(results, args, **kwargs)
     # ... rest of implementation

-def mmrefine_aggregate_results_mrecall(results, args=None, **kwargs):
+def mmrefine_aggregate_results_mrecall(
+    results: list[dict],
+    args: object | None = None,
+    **kwargs
+) -> float:
     return mmrefine_aggregate_results(results, args, **kwargs)["mRecall"]

🧹 Nitpick comments (2)

lmms_eval/tasks/mmrefine/mmrefine_evals.py (1)

140-143: Simplify conditional assignments with dict.get()

Use dict.get() for cleaner code.

-        "answer": result["answer"] if "answer" in result else None,
+        "answer": result.get("answer", None),
         "initial_solution": result["initial_solution"],
         "solution_label": result["solution_label"],
-        "reference_feedback": result["reference_feedback"] if "reference_feedback" in result else None,
+        "reference_feedback": result.get("reference_feedback", None),

lmms_eval/tasks/mmrefine/utils.py (1)

16-19: Rename unused loop variable to underscore

Loop variable i is not used within the loop body.
-    for i, line in enumerate(raw_data):
+    for _, line in enumerate(raw_data):
Or if enumeration is not needed:
-    for i, line in enumerate(raw_data):
+    for line in raw_data:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 366f9b6 and c5d2e87.

📒 Files selected for processing (4)

lmms_eval/tasks/mmrefine/mmrefine.yaml (1 hunks)
lmms_eval/tasks/mmrefine/mmrefine_evals.py (1 hunks)
lmms_eval/tasks/mmrefine/prompts.py (1 hunks)
lmms_eval/tasks/mmrefine/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

lmms_eval/tasks/mmrefine/prompts.py
lmms_eval/tasks/mmrefine/mmrefine.yaml

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit Inference Engine (CLAUDE.md)

**/*.py: Type hints required for all code
Public APIs must have docstrings
Functions must be focused and small
Line length: 88 chars maximum
PEP 8 naming (snake_case for functions/variables)
Class names in PascalCase
Constants in UPPER_SNAKE_CASE
Document with docstrings
Use f-strings for formatting
Use early returns to avoid nested conditions
Use clear variable/function names (prefix handlers with "handle")
Use constants where possible instead of functions
Don't repeat yourself (DRY code)
Prefer functional, immutable approaches when not verbose
Define composing functions before their components
Mark issues in existing code with "TODO:" prefix
Use functional and stateless approaches where they improve clarity
Keep core logic clean and push implementation details to the edges
Ruff: Line length (88 chars), import sorting (I001), unused imports
Strings: use parentheses for line wrapping
Function calls: multi-line with proper indent
Imports: split into multiple lines
Explicit None checks for Optional
Type narrowing for strings
Ruff (Python) runs as pre-commit hook
Break strings with parentheses for line length
Multi-line function calls for line length
Split imports for line length
Add None checks for Optional types
Narrow string types
Document public APIs

Files:

lmms_eval/tasks/mmrefine/mmrefine_evals.py
lmms_eval/tasks/mmrefine/utils.py

🧠 Learnings (3)

📚 Learning: applies to **/*.py : ruff: line length (88 chars), import sorting (i001), unused imports...

Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Ruff: Line length (88 chars), import sorting (I001), unused imports

Applied to files:

lmms_eval/tasks/mmrefine/mmrefine_evals.py
lmms_eval/tasks/mmrefine/utils.py

📚 Learning: applies to **/*.py : type hints required for all code...

Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Type hints required for all code

Applied to files:

lmms_eval/tasks/mmrefine/mmrefine_evals.py
lmms_eval/tasks/mmrefine/utils.py

📚 Learning: applies to **/*.py : use f-strings for formatting...

Learnt from: CR
PR: EvolvingLMMs-Lab/lmms-eval#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-04T08:07:14.042Z
Learning: Applies to **/*.py : Use f-strings for formatting

Applied to files:

lmms_eval/tasks/mmrefine/mmrefine_evals.py

🧬 Code Graph Analysis (1)

lmms_eval/tasks/mmrefine/utils.py (2)

lmms_eval/tasks/_task_utils/file_utils.py (1)

generate_submission_file (4-12)

lmms_eval/tasks/mmrefine/mmrefine_evals.py (3)

MMRefineEvaluator (16-169)

create_one_query (132-134)

eval_result (136-169)

🪛 Ruff (0.12.2)

lmms_eval/tasks/mmrefine/mmrefine_evals.py

31-31: Using .strip() with multi-character strings is misleading

(B005)

98-98: Do not use bare except

(E722)

101-101: Do not use bare except

(E722)

112-112: Do not use bare except

(E722)

116-116: Do not use bare except

(E722)

120-120: Do not use bare except

(E722)

124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)

140-140: Use result.get("answer", None) instead of an if block

Replace with result.get("answer", None)

(SIM401)

143-143: Use result.get("reference_feedback", None) instead of an if block

Replace with result.get("reference_feedback", None)

(SIM401)

lmms_eval/tasks/mmrefine/utils.py

16-16: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

41-41: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

45-45: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

58-58: Use doc.get("answer", None) instead of an if block

Replace with doc.get("answer", None)

(SIM401)

63-63: Use doc.get("reference_feedback", None) instead of an if block

Replace with doc.get("reference_feedback", None)

(SIM401)

98-98: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

99-99: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

100-100: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

101-101: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

102-102: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

103-103: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

lmms_eval/tasks/mmrefine/mmrefine_evals.py

add: MMRefine

366f9b6

coderabbitai bot reviewed Aug 7, 2025

View reviewed changes

skyil7 added 2 commits August 7, 2025 13:03

lint

3111955

lint

c5d2e87

coderabbitai bot reviewed Aug 7, 2025

View reviewed changes

lmms_eval/tasks/mmrefine/mmrefine_evals.py Show resolved Hide resolved

Luodian approved these changes Aug 12, 2025

View reviewed changes

Luodian merged commit 30bd271 into EvolvingLMMs-Lab:main Aug 12, 2025
2 checks passed

	client = OpenAI(api_key=API_KEY, base_url=API_URL.rstrip("chat/completions"))
	client = OpenAI(api_key=API_KEY, base_url=API_URL.replace("/chat/completions", ""))

		def __init__(self, quick_extract=False):
		self.quick_extract = quick_extract

[TASK] MMRefine Benchmark #793

[TASK] MMRefine Benchmark #793

Uh oh!

Conversation

skyil7 commented Aug 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skyil7 commented Aug 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)