add scibench(math) task #834

KelvinDo183 · 2025-09-23T04:32:38Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

New Features
- Added SciBench evaluation (single- and multi-shot) with few-shot prompts and configurable pre/post prompts.
- Automatic numeric answer extraction, normalization, and unit-aware comparison with ~5% tolerance.
- Added MEDQA multiple-choice pipeline: prompt construction, choice inference, robust free-form answer parsing, and result processing.
- Standardized accuracy metric (mean aggregation) for evaluation reports.

coderabbitai · 2025-09-23T04:32:46Z

Walkthrough

Adds SciBench (single- and multi-shot) and MEDQA task configurations plus two utility modules implementing prompt construction, parsing, normalization, target extraction, and result-processing hooks wired into evaluation configs.

Changes

Cohort / File(s)	Summary
SciBench Task Configs `lmms_eval/tasks/scibench/scibench.yaml`, `lmms_eval/tasks/scibench/scibench_multishot.yaml`	New YAML configs registering dataset_path/kwargs, test_split, task names, `doc_to_text`/`doc_to_target` bindings, lmms_eval-specific default prompts, accuracy metric (mean, higher_is_better), `process_results` hook, and metadata version. Multishot uses a FEWSHOT prompt binding.
SciBench Utilities `lmms_eval/tasks/scibench/utils.py`	New utilities: `FEWSHOT_PROMPT`, prompt builders (`scibench_doc_to_text`, `scibench_multishot_doc_to_text`), boxed-answer extraction, numeric string cleaning/parsing (`clean_number_string`, `parse_not`, `cal_not`, `remove_not`), unit-aware comparison (`equiv_with_unit`), and `scibench_process_results` to compute accuracy with tolerance.
MEDQA Task Config `lmms_eval/tasks/medqa/medqa.yaml`	New YAML config registering dataset_path/kwargs, test_split, task name, `doc_to_text`/`doc_to_target`/`doc_to_choice` bindings, lmms_eval-specific prompts, accuracy metric, `process_results` hook, and metadata version.
MEDQA Utilities `lmms_eval/tasks/medqa/utils.py`	New utilities: `medqa_doc_to_text`, `medqa_doc_to_target`, `medqa_doc_to_choice`, `medqa_process_results`, and `_parse_multi_choice_response` for robust MCQ prompt creation, target extraction, response parsing, and accuracy calculation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner
  participant Config as Task YAML
  participant Utils as Task Utils
  participant Model as LM

  Runner->>Config: Load task config (dataset, mappings, metrics)
  Runner->>Utils: doc_to_text(doc, lmms_eval_specific_kwargs)
  Utils-->>Runner: Prompt text
  Runner->>Model: Send prompt -> generate completion
  Model-->>Runner: Completion / answers
  Runner->>Utils: process_results(doc, [completion])
  Utils-->>Runner: {"metric_name": value}
  Runner->>Runner: Aggregate metrics (e.g., mean accuracy)
  note right of Runner: Reports final metric (higher_is_better)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

Luodian
kcz358

Poem

I nibble prompts beneath the moon,
Boxed answers hum a quiet tune.
I parse, I check each unit right,
Multishot sparks the data night.
Hop, score, and stash a carrot bright—🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description contains only the repository template text and provides no summary of the actual changes, rationale, tests, or completed checklist items; it therefore does not meet the repository's required description template. Required information such as a detailed description of added files (scibench configs and utils, and other new modules), testing/reproduction steps, and the checklist status is missing. Because the description is the untouched template, reviewers cannot assess intent or testing from it.	Please replace the template text with a detailed description that summarizes the changes (list added/modified files and their purpose, e.g., scibench configs, utils, medqa additions), state the motivation and expected behavior, include testing or reproduction steps and any configuration notes (dataset paths, kwargs), and mark the checklist items as completed before requesting review. Also mention any backward-incompatible changes or required follow-ups.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "add scibench(math) task" is concise and accurately reflects the primary change in the changeset, which introduces SciBench math configuration files and supporting utilities; it is a short, single sentence that highlights the main developer intention. The phrasing is clear and specific enough for someone scanning PR history to understand the primary purpose. No further clarification is strictly required for the title.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (6)

lmms_eval/tasks/scibench/utils.py (5)

28-35: Guard optional unit and add a minimal docstring.

-def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."    
-    return f"{pre_prompt}{question}{post_prompt}"
+def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
+    """Single-shot prompt builder for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = doc["problem_text"]
+    unit = str(doc.get("unit", "") or "").strip()
+    if unit:
+        question += f" The unit of the answer is {unit}."
+    return f"{pre_prompt}{question}{post_prompt}"

118-125: Use pre_prompt/post_prompt in multishot prompt; add docstring.

-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."    
+def scibench_multishot_doc_to_text(
+    doc: Dict, lmms_eval_specific_kwargs: Dict
+) -> str:
+    """Few-shot prompt builder for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = doc["problem_text"]
+    unit = str(doc.get("unit", "") or "").strip()
+    if unit:
+        question += f" The unit of the answer is {unit}."
+    return (
+        f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
+        f"Answer: Let's think step by step.{post_prompt}"
+    )

88-103: Remove debug prints, type annotate, and use the unit parameter.

-def equiv_with_unit(model_output, answer, unit):
-    model_output=model_output.replace(',', '')
-    print("Model_output: ", model_output)
-    try:
-        ans=float(answer.strip())
-        first=isclose(float(model_output.strip()), ans, rel_tol=0.05)
-    except:
-        first=False
-    try: 
-        model=model_output.strip().split()[0]
-        second=isclose(float(model.strip()), ans, rel_tol=0.05)
-    except:
-        second=False
-    if first or second:
-        return True
-    return False
+def equiv_with_unit(model_output: str, answer: str, unit: str) -> bool:
+    """Compare numeric values, ignoring commas and an optional trailing unit."""
+    try:
+        ans = float(clean_number_string(answer))
+    except (TypeError, ValueError):
+        return False
+    candidates = [
+        model_output,
+        model_output.split()[0] if model_output.split() else model_output,
+    ]
+    for c in candidates:
+        try:
+            c_num = clean_number_string(c.replace(",", "").replace(unit, ""))
+            if isclose(float(c_num), ans, rel_tol=0.05):
+                return True
+        except (TypeError, ValueError):
+            continue
+    return False

105-107: Replace ambiguous Unicode minus with escape to satisfy Ruff (RUF001).

-def clean_number_string(s):
-    return s.replace(",", "").replace("−", "-").strip()
+def clean_number_string(s: str) -> str:
+    # \u2212 is the Unicode MINUS SIGN; normalize to ASCII hyphen-minus.
+    return s.replace(",", "").replace("\u2212", "-").strip()

5-26: Replace ambiguous × with ASCII x or LaTeX \times in FEWSHOT_PROMPT.

Ruff flags MULTIPLICATION SIGN (RUF001). It also improves copy/paste robustness.

-... P = (10.0 mol × 0.0821 L·atm/(mol·K) × 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ...
+... P = (10.0 mol x 0.0821 L·atm/(mol·K) x 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ...
-... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 × 313.15 × 2.797 ≈ 7274.5 J/mol ...
+... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 x 313.15 x 2.797 ≈ 7274.5 J/mol ...
-... the numerator at 45° as 1.697×10⁻² m/s² ...
+... the numerator at 45° as 1.697x10⁻² m/s² ...

(Apply similarly to any remaining × in the block.)

lmms_eval/tasks/scibench/scibench.yaml (1)

12-15: Clarify unit guidance to avoid boxed-unit leakage.

Pre-prompt bans units in the answer; doc_to_text also appends “The unit of the answer is …”. Consider clarifying “Do not include the unit inside the boxed number.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 036637e and 197a934.

📒 Files selected for processing (3)

lmms_eval/tasks/scibench/scibench.yaml (1 hunks)
lmms_eval/tasks/scibench/scibench_multishot.yaml (1 hunks)
lmms_eval/tasks/scibench/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

lmms_eval/tasks/scibench/scibench_multishot.yaml
lmms_eval/tasks/scibench/scibench.yaml

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/scibench/utils.py

🪛 Ruff (0.13.1)

lmms_eval/tasks/scibench/utils.py

10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

68-68: Do not use bare except

(E722)

84-84: Consider moving this statement to an else block

(TRY300)

85-85: Do not use bare except

(E722)

88-88: Unused function argument: unit

(ARG001)

94-94: Do not use bare except

(E722)

99-99: Do not use bare except

(E722)

106-106: String contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF001)

119-119: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)

120-120: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🔇 Additional comments (2)

lmms_eval/tasks/scibench/scibench.yaml (1)
4-4: Confirm split selection; using train as the evaluation split is unusual.

Switch to test if available.
-test_split: train
+test_split: test
lmms_eval/tasks/scibench/scibench_multishot.yaml (1)

8-8: Confirm target field exists — resolved.
SciBench includes a numeric ground-truth field "answer_number" (e.g., 1.5, -1.49); keeping doc_to_target: "answer_number" is correct.

coderabbitai · 2025-09-23T05:14:18Z

lmms_eval/tasks/scibench/scibench_multishot.yaml

@@ -0,0 +1,23 @@
+dataset_path: lmms-lab/SuperGPQA


⚠️ Potential issue

Wrong dataset referenced; points to SuperGPQA instead of SciBench.

This will evaluate the wrong corpus.

-dataset_path: lmms-lab/SuperGPQA +dataset_path: lmms-lab/SciBench

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dataset_path: lmms-lab/SuperGPQA

dataset_path: lmms-lab/SciBench

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/scibench_multishot.yaml around line 1, the dataset_path is set to lmms-lab/SuperGPQA which references the wrong corpus; replace this value with the correct SciBench dataset path (e.g., the proper repository/dataset identifier for SciBench used elsewhere in the project) so the task evaluates the SciBench corpus instead of SuperGPQA, and verify the dataset name matches other configs and is accessible.

coderabbitai · 2025-09-23T05:14:18Z

lmms_eval/tasks/scibench/utils.py

+def extract_boxed_answers(text):
+    # Find all boxed contents
+    matches = re.findall(r'boxed{([^}]*)}', text)
+    for m in matches:
+        # Strip spaces
+        candidate = m.strip()
+        # Keep only the numeric ones (int or decimal, with optional sign)
+        if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
+            return candidate
+    return None


⚠️ Potential issue

Fix boxed-answer extraction: current regex never matches \boxed{} and breaks grading.

Pattern is missing the backslash and rejects scientific notation; accuracy will be 0 for valid outputs.

Apply this diff:

-def extract_boxed_answers(text): - # Find all boxed contents - matches = re.findall(r'boxed{([^}]*)}', text) - for m in matches: - # Strip spaces - candidate = m.strip() - # Keep only the numeric ones (int or decimal, with optional sign) - if re.fullmatch(r'[-+]?\d*\.?\d+', candidate): - return candidate - return None +from typing import Optional + +def extract_boxed_answers(text: str) -> Optional[str]: + """Return the first numeric token inside a \boxed{...} span, if any.""" + matches = re.findall(r'\\boxed\{([^}]*)\}', text) + for m in matches: + candidate = m.strip() + # Supports ints/decimals and optional scientific notation (e/E). + mnum = re.search( + r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate + ) + if mnum: + return mnum.group(0) + return None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_boxed_answers(text):

# Find all boxed contents

matches = re.findall(r'boxed{([^}]*)}', text)

for m in matches:

# Strip spaces

candidate = m.strip()

# Keep only the numeric ones (int or decimal, with optional sign)

if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):

return candidate

return None

from typing import Optional

def extract_boxed_answers(text: str) -> Optional[str]:

"""Return the first numeric token inside a \boxed{...} span, if any."""

matches = re.findall(r'\\boxed\{([^}]*)\}', text)

for m in matches:

candidate = m.strip()

# Supports ints/decimals and optional scientific notation (e/E).

mnum = re.search(

r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate

)

if mnum:

return mnum.group(0)

return None

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/utils.py around lines 36 to 45, the regex used to find boxed answers fails because it omits the backslash in `\boxed{}` and its numeric check rejects scientific notation; update the extraction pattern to match LaTeX boxed macros (e.g. use a raw pattern that searches for a literal backslash like \\boxed\{([^}]*)\}) and replace the numeric validation with a regex that accepts integers, decimals and scientific notation (e.g. allow optional sign, digits with optional decimal point or leading decimal, and optional exponent like [eE][+-]?\d+); keep trimming whitespace from the captured group and return the numeric string when it matches the improved numeric pattern, otherwise continue or return None as before.

coderabbitai · 2025-09-23T05:14:18Z

lmms_eval/tasks/scibench/utils.py

+def remove_not(x):
+    match_number = re.compile('[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?')
+    result=re.findall(match_number, x)
+    if len(result) !=0:
+        return re.split(match_number, x)[-1]
+    return None
+
+def cal_not(inputs): 
+    try:
+        x,ab=list(inputs)
+        match_number = re.compile('10\^[{]?\ *-?[0-9]+\ *[}]?')
+        ab=re.findall(match_number, ab)[0]
+        ab=ab[ab.find('^')+1:]
+        if '{' in ab:
+            ab=ab[ab.find('{')+1:]
+        if '}' in ab:
+            ab=ab[:ab.find('}')]
+        x=x.strip()
+        out=float(x)*10**float(ab)
+        # print(float(x)*10**float(ab))
+        return str(out)
+    except:
+        print('error')
+    return inputs
+
+def parse_not(inputs):
+    try:
+        if not inputs:
+            return '',''
+        if '\\times' in inputs:
+            x,ab=inputs.split('\\times')
+        elif '\times' in inputs:
+            x,ab=inputs.split('\times')
+        elif '*' in inputs:
+            x,ab=inputs.split('*')
+        else:
+            return inputs
+        return x,ab
+    except:
+        return '',''
+


🛠️ Refactor suggestion

🧩 Analysis chain

Remove or fix unused helpers and bare except blocks.

These helpers are unused here and contain except: (E722). Prefer removal to reduce surface area.

Run to confirm they’re unused before removal:

If unused, delete them; otherwise, add type hints and narrow exceptions.

🏁 Script executed:

#!/bin/bash rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -g '!**/test/**' -S -C2

Length of output: 148

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "rg version: $(rg --version 2>/dev/null || true)" rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -S -C2 || true echo "---- definitions ----" rg -nP '^def\s+(remove_not|cal_not|parse_not)\s*\(' -n -S || true echo "---- show lmms_eval/tasks/scibench/utils.py (if present) ----" if [ -f "lmms_eval/tasks/scibench/utils.py" ]; then sed -n '1,240p' lmms_eval/tasks/scibench/utils.py || true else echo "lmms_eval/tasks/scibench/utils.py not found" fi

Length of output: 8084

Remove unused helpers and replace bare except blocks

remove_not, cal_not, parse_not are only defined in lmms_eval/tasks/scibench/utils.py (≈ lines 47–87) and contain bare except: handlers — delete them to reduce surface area. If you keep them, add type hints and replace bare except with specific exceptions (e.g., ValueError/IndexError) and return/raise explicit errors instead of printing.

🧰 Tools

🪛 Ruff (0.13.1)

68-68: Do not use bare except

(E722)

84-84: Consider moving this statement to an else block

(TRY300)

85-85: Do not use bare except

(E722)

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/utils.py around lines 47–87 the helper functions remove_not, cal_not, and parse_not are unused and contain bare except: handlers; either delete these three functions to reduce surface area, or (if you intend to keep them) add precise type hints for inputs/returns, replace bare except clauses with specific exceptions (e.g., ValueError, IndexError, AttributeError) and handle errors explicitly by returning well-typed sentinel values or raising informative exceptions instead of printing, ensure regex lookups check for non-empty matches before indexing, and remove any silent prints so behavior is deterministic and testable.

coderabbitai · 2025-09-23T05:14:18Z

lmms_eval/tasks/scibench/utils.py

+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+    pred = result[0]
+    pred = extract_boxed_answers(pred)
+    if pred:
+        res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
+        score = 1 if res_equiv else 0
+    else:
+        score = 0
+    return {"accuracy": score}


⚠️ Potential issue

Harden result processing: handle empty results and non‑string targets safely.

Avoids IndexError and type issues when answer_number is numeric.

-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: - pred = result[0] - pred = extract_boxed_answers(pred) - if pred: - res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05) - score = 1 if res_equiv else 0 - else: - score = 0 - return {"accuracy": score} +def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: + """Compute accuracy with 5% tolerance using the first \boxed{} number.""" + pred_text = result[0] if result else "" + boxed = extract_boxed_answers(pred_text) + if boxed is None: + return {"accuracy": 0} + try: + gold = float(clean_number_string(str(doc["answer_number"]))) + pred_val = float(clean_number_string(boxed)) + return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))} + except (TypeError, ValueError): + return {"accuracy": 0}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:

pred = result[0]

pred = extract_boxed_answers(pred)

if pred:

res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)

score = 1 if res_equiv else 0

else:

score = 0

return {"accuracy": score}

def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:

"""Compute accuracy with 5% tolerance using the first \boxed{} number."""

pred_text = result[0] if result else ""

boxed = extract_boxed_answers(pred_text)

if boxed is None:

return {"accuracy": 0}

try:

gold = float(clean_number_string(str(doc["answer_number"])))

pred_val = float(clean_number_string(boxed))

return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}

except (TypeError, ValueError):

return {"accuracy": 0}

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/utils.py around lines 108 to 116, the function assumes result[0] exists and doc["answer_number"] is a string; update it to first check that result is a non-empty list and that result[0] is a string before processing, and coerce doc["answer_number"] to a string (or handle numeric types) before cleaning/converting; wrap the float conversions in a try/except (catch ValueError/TypeError) and return {"accuracy": 0} on any parsing error so no IndexError or type errors can propagate.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lmms_eval/tasks/medqa/utils.py (1)

92-123: Make parsing deterministic and case-insensitive; fix last-occurrence logic without NumPy

Current logic can fail to detect lowercase letters and uses randomness when no match is found, making evaluations non-reproducible. Also, the “last occurrence” computation only considers the " space letter space " pattern. Make it case-insensitive, deterministic, and compute last occurrence across all patterns.

-def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str:
-    # Clean punctuation around the response
-    for ch in [",", ".", "!", "?", ";", ":", "'"]:
-        response = response.strip(ch)
-    response = " " + response + " "
-
-    candidates = []
-    # (A) style
-    for c in all_choices:
-        if f"({c})" in response:
-            candidates.append(c)
-
-    # plain letter surrounded by spaces
-    if len(candidates) == 0:
-        for c in all_choices:
-            if f" {c} " in response:
-                candidates.append(c)
-
-    # A., B., etc.
-    if len(candidates) == 0:
-        for c in all_choices:
-            if f"{c}." in response:
-                candidates.append(c)
-
-    if len(candidates) == 0:
-        return random.choice(all_choices)
-    if len(candidates) > 1:
-        # choose the last occurrence to mitigate explanations mentioning multiple letters
-        start_indexes = [response.rfind(f" {can} ") for can in candidates]
-        return candidates[int(np.argmax(start_indexes))]
-    return candidates[0]
+def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str:
+    # Normalize casing and pad to simplify boundary searches
+    resp = f" {str(response).upper()} "
+
+    def last_pos(c: str) -> int:
+        # Consider common patterns: (A), A., A), (A, plain " A "
+        patterns = [f"({c})", f"{c}.", f"{c})", f"({c}", f" {c} "]
+        return max(resp.rfind(pat) for pat in patterns)
+
+    best_choice = None
+    best_idx = -1
+    for c in [ch.upper() for ch in all_choices]:
+        idx = last_pos(c)
+        if idx > best_idx:
+            best_idx = idx
+            best_choice = c
+
+    # Deterministic fallback if nothing matched
+    return best_choice if best_idx != -1 else all_choices[0]

🧹 Nitpick comments (6)

lmms_eval/tasks/scibench/utils.py (3)

27-33: Add docstring and None‑safe unit handling; keep lines ≤88 chars

Prevents KeyError/AttributeError and documents the public API.

 def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return f"{pre_prompt}{question}{post_prompt}"
+    """Build the single-shot prompt for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = str(doc.get("problem_text", ""))
+    unit = str(doc.get("unit") or "").strip()
+    if unit:
+        question = f"{question} The unit of the answer is {unit}."
+    return f"{pre_prompt}{question}{post_prompt}"

110-111: Add type hints and docstring for public utility

-def clean_number_string(s):
-    return s.replace(",", "").replace("−", "-").strip()
+def clean_number_string(s: str) -> str:
+    """Normalize numeric strings: strip, remove commas, normalize minus sign."""
+    return s.replace(",", "").replace("−", "-").strip()

92-107: Remove unused equiv_with_unit (lmms_eval/tasks/scibench/utils.py:92)
Function prints to stdout and uses bare excepts — delete it. If needed later, reintroduce with type hints and no prints. Verified: whole-repo search (rg / git grep / find) found only the definition at lmms_eval/tasks/scibench/utils.py:92; no callers.

lmms_eval/tasks/medqa/utils.py (3)

1-5: Drop unnecessary imports; avoid randomness dependency

random and numpy are only used in parsing; both can be removed with a deterministic parser. This also addresses S311 and improves reproducibility.

Apply this diff:

-import random
-from typing import Any, Dict, List
-
-import numpy as np
+from typing import Any, Dict, List

32-35: Avoid KeyError on pre/post prompt; provide safe defaults

Use .get with default and keep behavior stable if keys are missing.

-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    prompt = f"{question}\n{options_block}"
-    return f"{pre_prompt}{prompt}{post_prompt}"
+    pre_prompt = str(lmms_eval_specific_kwargs.get("pre_prompt", ""))
+    post_prompt = str(lmms_eval_specific_kwargs.get("post_prompt", ""))
+    prompt = f"{question}\n{options_block}" if options_block else question
+    return f"{pre_prompt}{prompt}{post_prompt}"

66-76: Add docstring to public API

Public APIs must have docstrings.

-def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]:
-    # Detect how many choices are present and return corresponding letters
+def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]:
+    """
+    Infer present choice letters (A–E) from the options structure.
+    """
+    # Detect how many choices are present and return corresponding letters

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 197a934 and b906383.

📒 Files selected for processing (3)

lmms_eval/tasks/medqa/medqa.yaml (1 hunks)
lmms_eval/tasks/medqa/utils.py (1 hunks)
lmms_eval/tasks/scibench/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/medqa/utils.py
lmms_eval/tasks/scibench/utils.py

**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

lmms_eval/tasks/medqa/medqa.yaml

🪛 Ruff (0.13.1)

lmms_eval/tasks/medqa/utils.py

117-117: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

lmms_eval/tasks/scibench/utils.py

9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

70-70: Do not use bare except

(E722)

87-87: Consider moving this statement to an else block

(TRY300)

88-88: Do not use bare except

(E722)

92-92: Unused function argument: unit

(ARG001)

98-98: Do not use bare except

(E722)

103-103: Do not use bare except

(E722)

111-111: String contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF001)

126-126: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)

127-127: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🔇 Additional comments (5)

lmms_eval/tasks/scibench/utils.py (4)

5-24: FEWSHOT_PROMPT content looks good for anchoring multi‑shot examples

48-53: Remove unused helpers and bare except: blocks

These introduce noise and lint errors (E722/F401) and aren’t used.

-def remove_not(x):
-    match_number = re.compile("[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?")
-    result = re.findall(match_number, x)
-    if len(result) != 0:
-        return re.split(match_number, x)[-1]
-    return None
-
-
-def cal_not(inputs):
-    try:
-        x, ab = list(inputs)
-        match_number = re.compile("10\^[{]?\ *-?[0-9]+\ *[}]?")
-        ab = re.findall(match_number, ab)[0]
-        ab = ab[ab.find("^") + 1 :]
-        if "{" in ab:
-            ab = ab[ab.find("{") + 1 :]
-        if "}" in ab:
-            ab = ab[: ab.find("}")]
-        x = x.strip()
-        out = float(x) * 10 ** float(ab)
-        # print(float(x)*10**float(ab))
-        return str(out)
-    except:
-        print("error")
-    return inputs
-
-
-def parse_not(inputs):
-    try:
-        if not inputs:
-            return "", ""
-        if "\\times" in inputs:
-            x, ab = inputs.split("\\times")
-        elif "\times" in inputs:
-            x, ab = inputs.split("\times")
-        elif "*" in inputs:
-            x, ab = inputs.split("*")
-        else:
-            return inputs
-        return x, ab
-    except:
-        return "", ""

Also applies to: 56-73, 75-90

36-45: Fix boxed‑answer extraction: missing backslash; accept scientific notation

Current regex never matches \boxed and rejects 1e-3; breaks grading.

-def extract_boxed_answers(text):
-    # Find all boxed contents
-    matches = re.findall(r"boxed{([^}]*)}", text)
-    for m in matches:
-        # Strip spaces
-        candidate = m.strip()
-        # Keep only the numeric ones (int or decimal, with optional sign)
-        if re.fullmatch(r"[-+]?\d*\.?\d+", candidate):
-            return candidate
-    return None
+def extract_boxed_answers(text: str) -> Optional[str]:
+    """Return the first numeric token inside a \boxed{...} span, if any."""
+    matches = re.findall(r"\\boxed\{([^}]*)\}", text)
+    for m in matches:
+        candidate = m.strip()
+        num_pat = r"[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?"
+        mnum = re.search(num_pat, candidate)
+        if mnum:
+            return mnum.group(0)
+    return None

114-122: Harden result processing: empty results, typing, and 88‑char wrapping

Prevents IndexError/type errors; documents the public API.

-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
-    pred = result[0]
-    pred = extract_boxed_answers(pred)
-    if pred:
-        res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
-        score = 1 if res_equiv else 0
-    else:
-        score = 0
-    return {"accuracy": score}
+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+    """Compute accuracy with 5% tolerance using the first \\boxed{} number."""
+    pred_text = result[0] if result else ""
+    boxed = extract_boxed_answers(pred_text)
+    if boxed is None:
+        return {"accuracy": 0}
+    try:
+        gold = float(clean_number_string(str(doc["answer_number"])))
+        pred_val = float(clean_number_string(boxed))
+        return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
+    except (TypeError, ValueError):
+        return {"accuracy": 0}

lmms_eval/tasks/medqa/medqa.yaml (1)

7-11: No change needed — !function utils.* resolves to the task-local utils.py
import_function in lmms_eval/utils.py builds module_path from os.path.dirname(loader.name), so !function utils.<fn> loads lmms_eval/tasks/medqa/utils.py (which defines the referenced functions).

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/medqa/utils.py

+import numpy as np
+
+
+def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):


🛠️ Refactor suggestion

Add return type annotation for public API

Type hints are required. Annotate return type.

-def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]): +def medqa_doc_to_text( + doc: Dict[str, Any], + lmms_eval_specific_kwargs: Dict[str, Any], +) -> str:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):

def medqa_doc_to_text(

doc: Dict[str, Any],

lmms_eval_specific_kwargs: Dict[str, Any],

) -> str:

🤖 Prompt for AI Agents

In lmms_eval/tasks/medqa/utils.py around line 7, the function signature def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]): is missing a return type annotation; update the signature to include an explicit return type (e.g., -> str) that matches the function's actual return value, and add any needed typing imports (from typing import Any, Dict, Optional, Union) if you choose a more complex type.

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/medqa/utils.py

+    options = doc.get("options")
+    if isinstance(options, dict):
+        # Keep only A-E in sorted letter order if present
+        ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
+        options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
+    elif isinstance(options, list):
+        letters = ["A", "B", "C", "D", "E"]
+        options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
+    else:
+        # Fallback: try to format if already string-like
+        options_block = str(options) if options is not None else ""
+


⚠️ Potential issue

Fix potential IndexError when options is a list; wrap long lines

Enumerating beyond 5 options will raise IndexError. Use zip with fixed letters. Also wrap long joins to respect line length.

- # Normalize options into A..E style lines + # Normalize options into A..E style lines options = doc.get("options") if isinstance(options, dict): # Keep only A-E in sorted letter order if present ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options] - options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys]) + options_block = "\n".join( + f"{k}. {str(options[k]).strip()}" for k in ordered_keys + ) elif isinstance(options, list): letters = ["A", "B", "C", "D", "E"] - options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)]) + options_block = "\n".join( + f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options) + ) else: # Fallback: try to format if already string-like options_block = str(options) if options is not None else ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

options = doc.get("options")

if isinstance(options, dict):

# Keep only A-E in sorted letter order if present

ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]

options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])

elif isinstance(options, list):

letters = ["A", "B", "C", "D", "E"]

options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])

else:

# Fallback: try to format if already string-like

options_block = str(options) if options is not None else ""

options = doc.get("options")

if isinstance(options, dict):

# Keep only A-E in sorted letter order if present

ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]

options_block = "\n".join(

f"{k}. {str(options[k]).strip()}" for k in ordered_keys

)

elif isinstance(options, list):

letters = ["A", "B", "C", "D", "E"]

options_block = "\n".join(

f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)

)

else:

# Fallback: try to format if already string-like

options_block = str(options) if options is not None else ""

🤖 Prompt for AI Agents

In lmms_eval/tasks/medqa/utils.py around lines 20 to 31, the branch that formats options when options is a list can raise IndexError if there are more than five options and also contains long unwrapped joins; replace the enumerate-based indexing with a safe zip between a fixed list of letters ["A","B","C","D","E"] and the options list so extra options are ignored without throwing, and when constructing the options_block join, wrap or truncate long option strings (e.g., use a textwrap.fill or truncate each option to a max width before joining) to keep line length reasonable.

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/medqa/utils.py

+def medqa_doc_to_target(doc: Dict[str, Any]):
+    """
+    Return the ground-truth answer letter.
+
+    MEDQA on HF commonly provides either:
+    - "answer_idx": a letter like "A"/"B"/... OR
+    - "answer": a full string like "C" or the option text. We prioritize letter if available.
+    """
+    # Prefer explicit answer letter field when present


🛠️ Refactor suggestion

Add return type; normalize answer letter casing

Guarantee uppercase to match choice letters consistently.

-def medqa_doc_to_target(doc: Dict[str, Any]): +def medqa_doc_to_target(doc: Dict[str, Any]) -> str: @@ - if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1: - return doc["answer_idx"].strip() + if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1: + return doc["answer_idx"].strip().upper()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def medqa_doc_to_target(doc: Dict[str, Any]):

"""

Return the ground-truth answer letter.

MEDQA on HF commonly provides either:

- "answer_idx": a letter like "A"/"B"/... OR

- "answer": a full string like "C" or the option text. We prioritize letter if available.

"""

# Prefer explicit answer letter field when present

def medqa_doc_to_target(doc: Dict[str, Any]) -> str:

"""

Return the ground-truth answer letter.

MEDQA on HF commonly provides either:

- "answer_idx": a letter like "A"/"B"/... OR

- "answer": a full string like "C" or the option text. We prioritize letter if available.

"""

# Prefer explicit answer letter field when present

if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:

return doc["answer_idx"].strip().upper()

🤖 Prompt for AI Agents

In lmms_eval/tasks/medqa/utils.py around lines 38-46, add an explicit return type for medqa_doc_to_target (str) and ensure the returned answer letter is always uppercase: prefer doc.get("answer_idx") and if present return its .upper(); otherwise take doc.get("answer"), strip whitespace, extract the first non-empty character (or first token) and return it uppercased; if neither exists return an empty string. Also ensure you handle non-string inputs by converting to str before stripping/uppercasing.

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/medqa/utils.py

+def medqa_process_results(doc: Dict[str, Any], result: List[str]):
+    """
+    Parse model output and compute accuracy against the gold letter.
+    We robustly extract a single letter from the response.
+    """
+    response = result[0].strip()
+    all_choices = medqa_doc_to_choice(doc)
+    pred = _parse_multi_choice_response(response, all_choices)
+    gt_ans = medqa_doc_to_target(doc)
+    score = 1.0 if pred == gt_ans else 0.0
+    return {"accuracy": score}
+


🛠️ Refactor suggestion

Type hints + guard empty result; ensure deterministic behavior

Add return type; handle empty result defensively.

-def medqa_process_results(doc: Dict[str, Any], result: List[str]): +def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]: @@ - response = result[0].strip() + response = str(result[0]).strip() if result else ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def medqa_process_results(doc: Dict[str, Any], result: List[str]):

"""

Parse model output and compute accuracy against the gold letter.

We robustly extract a single letter from the response.

"""

response = result[0].strip()

all_choices = medqa_doc_to_choice(doc)

pred = _parse_multi_choice_response(response, all_choices)

gt_ans = medqa_doc_to_target(doc)

score = 1.0 if pred == gt_ans else 0.0

return {"accuracy": score}

def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:

"""

Parse model output and compute accuracy against the gold letter.

We robustly extract a single letter from the response.

"""

response = str(result[0]).strip() if result else ""

all_choices = medqa_doc_to_choice(doc)

pred = _parse_multi_choice_response(response, all_choices)

gt_ans = medqa_doc_to_target(doc)

score = 1.0 if pred == gt_ans else 0.0

return {"accuracy": score}

🤖 Prompt for AI Agents

In lmms_eval/tasks/medqa/utils.py around lines 79-90, the function lacks a return type and doesn’t guard against an empty result list; add a return type annotation (-> Dict[str, float]) and make the function defensive by checking if not result or not result[0]. If the result is empty, return a deterministic {"accuracy": 0.0}; otherwise proceed to compute response = result[0].strip(), parse prediction, compare to gold, and return {"accuracy": score}.

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/scibench/utils.py

+import re
+from math import isclose
+from typing import Dict, List, Tuple
+


⚠️ Potential issue

Silence RUF001 or replace ambiguous unicode; add Optional for types

Ruff flags the Unicode ×/− in literals. Easiest: per‑file suppression. Also prep for Optional return types.

+ # ruff: noqa: RUF001 # allow ×/− in prompt examples import re from math import isclose -from typing import Dict, List, Tuple +from typing import Dict, List, Optional

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import re

from math import isclose

from typing import Dict, List, Tuple

# ruff: noqa: RUF001 # allow ×/− in prompt examples

import re

from math import isclose

from typing import Dict, List, Optional

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/utils.py lines 1-4, Ruff warns about ambiguous Unicode characters (×/−) and the comment requests preparing return types as Optional; to fix, add a per-file Ruff suppression directive at the top (e.g., a single-line ruff noqa comment targeting RUF001) or replace any ambiguous Unicode multiplication/division/minus characters in literals with ASCII equivalents (x, *, -) throughout the file, and update any function return type hints that may be nullable to use typing.Optional[...] (import Optional if not already) so signatures reflect possible None returns.

coderabbitai · 2025-09-23T15:11:13Z

lmms_eval/tasks/scibench/utils.py

+def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = doc["problem_text"]
+    if doc["unit"].strip():
+        question = question + " The unit of the answer is " + doc["unit"] + "."
+    return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."


⚠️ Potential issue

Use pre/post prompts; add docstring; avoid unused vars (F841)

Also keeps lines within 88 chars.

-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: - pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] - post_prompt = lmms_eval_specific_kwargs["post_prompt"] - question = doc["problem_text"] - if doc["unit"].strip(): - question = question + " The unit of the answer is " + doc["unit"] + "." - return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step." +def scibench_multishot_doc_to_text( + doc: Dict, lmms_eval_specific_kwargs: Dict +) -> str: + """Build the multi-shot prompt for SciBench.""" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + question = str(doc.get("problem_text", "")) + unit = str(doc.get("unit") or "").strip() + if unit: + question = f"{question} The unit of the answer is {unit}." + return ( + f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n" + "Answer: Let's think step by step." + f"{post_prompt}" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:

pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]

post_prompt = lmms_eval_specific_kwargs["post_prompt"]

question = doc["problem_text"]

if doc["unit"].strip():

question = question + " The unit of the answer is " + doc["unit"] + "."

return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."

def scibench_multishot_doc_to_text(

doc: Dict, lmms_eval_specific_kwargs: Dict

) -> str:

"""Build the multi-shot prompt for SciBench."""

pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]

post_prompt = lmms_eval_specific_kwargs["post_prompt"]

question = str(doc.get("problem_text", ""))

unit = str(doc.get("unit") or "").strip()

if unit:

question = f"{question} The unit of the answer is {unit}."

return (

f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"

"Answer: Let's think step by step."

f"{post_prompt}"

)

🧰 Tools

🪛 Ruff (0.13.1)

126-126: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)

127-127: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🤖 Prompt for AI Agents

In lmms_eval/tasks/scibench/utils.py around lines 125-131, the function currently defines pre_prompt and post_prompt but never uses them, lacks a docstring, and has long concatenated strings; update the function to include a short docstring describing parameters and return value, use pre_prompt before the question and post_prompt after the answer (combine with FEWSHOT_PROMPT), replace string concatenation with f-strings to avoid unused variable lint (F841), and reflow string construction so no line exceeds ~88 characters.

add scibench math task

197a934

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

KelvinDo183 added 2 commits September 23, 2025 22:57

add medqa task

1ad3206

run pre commit

b906383

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

Luodian approved these changes Sep 24, 2025

View reviewed changes

Luodian merged commit df477b1 into EvolvingLMMs-Lab:main Sep 25, 2025
2 checks passed

	dataset_path: lmms-lab/SuperGPQA
	dataset_path: lmms-lab/SciBench

		import numpy as np


		def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):

add scibench(math) task #834

add scibench(math) task #834

Uh oh!

Conversation

KelvinDo183 commented Sep 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KelvinDo183 commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 23, 2025 •

edited

Loading