Skip to content

Conversation

@KelvinDo183
Copy link
Collaborator

@KelvinDo183 KelvinDo183 commented Sep 23, 2025

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

  • A descriptive title: [xxx] XXXX
  • A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

  • New Features
    • Added SciBench evaluation (single- and multi-shot) with few-shot prompts and configurable pre/post prompts.
    • Automatic numeric answer extraction, normalization, and unit-aware comparison with ~5% tolerance.
    • Added MEDQA multiple-choice pipeline: prompt construction, choice inference, robust free-form answer parsing, and result processing.
    • Standardized accuracy metric (mean aggregation) for evaluation reports.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds SciBench (single- and multi-shot) and MEDQA task configurations plus two utility modules implementing prompt construction, parsing, normalization, target extraction, and result-processing hooks wired into evaluation configs.

Changes

Cohort / File(s) Summary
SciBench Task Configs
lmms_eval/tasks/scibench/scibench.yaml, lmms_eval/tasks/scibench/scibench_multishot.yaml
New YAML configs registering dataset_path/kwargs, test_split, task names, doc_to_text/doc_to_target bindings, lmms_eval-specific default prompts, accuracy metric (mean, higher_is_better), process_results hook, and metadata version. Multishot uses a FEWSHOT prompt binding.
SciBench Utilities
lmms_eval/tasks/scibench/utils.py
New utilities: FEWSHOT_PROMPT, prompt builders (scibench_doc_to_text, scibench_multishot_doc_to_text), boxed-answer extraction, numeric string cleaning/parsing (clean_number_string, parse_not, cal_not, remove_not), unit-aware comparison (equiv_with_unit), and scibench_process_results to compute accuracy with tolerance.
MEDQA Task Config
lmms_eval/tasks/medqa/medqa.yaml
New YAML config registering dataset_path/kwargs, test_split, task name, doc_to_text/doc_to_target/doc_to_choice bindings, lmms_eval-specific prompts, accuracy metric, process_results hook, and metadata version.
MEDQA Utilities
lmms_eval/tasks/medqa/utils.py
New utilities: medqa_doc_to_text, medqa_doc_to_target, medqa_doc_to_choice, medqa_process_results, and _parse_multi_choice_response for robust MCQ prompt creation, target extraction, response parsing, and accuracy calculation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner
  participant Config as Task YAML
  participant Utils as Task Utils
  participant Model as LM

  Runner->>Config: Load task config (dataset, mappings, metrics)
  Runner->>Utils: doc_to_text(doc, lmms_eval_specific_kwargs)
  Utils-->>Runner: Prompt text
  Runner->>Model: Send prompt -> generate completion
  Model-->>Runner: Completion / answers
  Runner->>Utils: process_results(doc, [completion])
  Utils-->>Runner: {"metric_name": value}
  Runner->>Runner: Aggregate metrics (e.g., mean accuracy)
  note right of Runner: Reports final metric (higher_is_better)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • Luodian
  • kcz358

Poem

I nibble prompts beneath the moon,
Boxed answers hum a quiet tune.
I parse, I check each unit right,
Multishot sparks the data night.
Hop, score, and stash a carrot bright—🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description contains only the repository template text and provides no summary of the actual changes, rationale, tests, or completed checklist items; it therefore does not meet the repository's required description template. Required information such as a detailed description of added files (scibench configs and utils, and other new modules), testing/reproduction steps, and the checklist status is missing. Because the description is the untouched template, reviewers cannot assess intent or testing from it. Please replace the template text with a detailed description that summarizes the changes (list added/modified files and their purpose, e.g., scibench configs, utils, medqa additions), state the motivation and expected behavior, include testing or reproduction steps and any configuration notes (dataset paths, kwargs), and mark the checklist items as completed before requesting review. Also mention any backward-incompatible changes or required follow-ups.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "add scibench(math) task" is concise and accurately reflects the primary change in the changeset, which introduces SciBench math configuration files and supporting utilities; it is a short, single sentence that highlights the main developer intention. The phrasing is clear and specific enough for someone scanning PR history to understand the primary purpose. No further clarification is strictly required for the title.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
lmms_eval/tasks/scibench/utils.py (5)

28-35: Guard optional unit and add a minimal docstring.

-def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."    
-    return f"{pre_prompt}{question}{post_prompt}"
+def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
+    """Single-shot prompt builder for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = doc["problem_text"]
+    unit = str(doc.get("unit", "") or "").strip()
+    if unit:
+        question += f" The unit of the answer is {unit}."
+    return f"{pre_prompt}{question}{post_prompt}"

118-125: Use pre_prompt/post_prompt in multishot prompt; add docstring.

-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."    
+def scibench_multishot_doc_to_text(
+    doc: Dict, lmms_eval_specific_kwargs: Dict
+) -> str:
+    """Few-shot prompt builder for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = doc["problem_text"]
+    unit = str(doc.get("unit", "") or "").strip()
+    if unit:
+        question += f" The unit of the answer is {unit}."
+    return (
+        f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
+        f"Answer: Let's think step by step.{post_prompt}"
+    )

88-103: Remove debug prints, type annotate, and use the unit parameter.

-def equiv_with_unit(model_output, answer, unit):
-    model_output=model_output.replace(',', '')
-    print("Model_output: ", model_output)
-    try:
-        ans=float(answer.strip())
-        first=isclose(float(model_output.strip()), ans, rel_tol=0.05)
-    except:
-        first=False
-    try: 
-        model=model_output.strip().split()[0]
-        second=isclose(float(model.strip()), ans, rel_tol=0.05)
-    except:
-        second=False
-    if first or second:
-        return True
-    return False
+def equiv_with_unit(model_output: str, answer: str, unit: str) -> bool:
+    """Compare numeric values, ignoring commas and an optional trailing unit."""
+    try:
+        ans = float(clean_number_string(answer))
+    except (TypeError, ValueError):
+        return False
+    candidates = [
+        model_output,
+        model_output.split()[0] if model_output.split() else model_output,
+    ]
+    for c in candidates:
+        try:
+            c_num = clean_number_string(c.replace(",", "").replace(unit, ""))
+            if isclose(float(c_num), ans, rel_tol=0.05):
+                return True
+        except (TypeError, ValueError):
+            continue
+    return False

105-107: Replace ambiguous Unicode minus with escape to satisfy Ruff (RUF001).

-def clean_number_string(s):
-    return s.replace(",", "").replace("−", "-").strip()
+def clean_number_string(s: str) -> str:
+    # \u2212 is the Unicode MINUS SIGN; normalize to ASCII hyphen-minus.
+    return s.replace(",", "").replace("\u2212", "-").strip()

5-26: Replace ambiguous × with ASCII x or LaTeX \times in FEWSHOT_PROMPT.

Ruff flags MULTIPLICATION SIGN (RUF001). It also improves copy/paste robustness.

-... P = (10.0 mol × 0.0821 L·atm/(mol·K) × 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ...
+... P = (10.0 mol x 0.0821 L·atm/(mol·K) x 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ...
-... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 × 313.15 × 2.797 ≈ 7274.5 J/mol ...
+... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 x 313.15 x 2.797 ≈ 7274.5 J/mol ...
-... the numerator at 45° as 1.697×10⁻² m/s² ...
+... the numerator at 45° as 1.697x10⁻² m/s² ...

(Apply similarly to any remaining × in the block.)

lmms_eval/tasks/scibench/scibench.yaml (1)

12-15: Clarify unit guidance to avoid boxed-unit leakage.

Pre-prompt bans units in the answer; doc_to_text also appends “The unit of the answer is …”. Consider clarifying “Do not include the unit inside the boxed number.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 036637e and 197a934.

📒 Files selected for processing (3)
  • lmms_eval/tasks/scibench/scibench.yaml (1 hunks)
  • lmms_eval/tasks/scibench/scibench_multishot.yaml (1 hunks)
  • lmms_eval/tasks/scibench/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

  • lmms_eval/tasks/scibench/scibench_multishot.yaml
  • lmms_eval/tasks/scibench/scibench.yaml
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/tasks/scibench/utils.py
🪛 Ruff (0.13.1)
lmms_eval/tasks/scibench/utils.py

10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


68-68: Do not use bare except

(E722)


84-84: Consider moving this statement to an else block

(TRY300)


85-85: Do not use bare except

(E722)


88-88: Unused function argument: unit

(ARG001)


94-94: Do not use bare except

(E722)


99-99: Do not use bare except

(E722)


106-106: String contains ambiguous (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF001)


119-119: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)


120-120: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🔇 Additional comments (2)
lmms_eval/tasks/scibench/scibench.yaml (1)

4-4: Confirm split selection; using train as the evaluation split is unusual.

Switch to test if available.

-test_split: train
+test_split: test
lmms_eval/tasks/scibench/scibench_multishot.yaml (1)

8-8: Confirm target field exists — resolved.
SciBench includes a numeric ground-truth field "answer_number" (e.g., 1.5, -1.49); keeping doc_to_target: "answer_number" is correct.

@@ -0,0 +1,23 @@
dataset_path: lmms-lab/SuperGPQA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Wrong dataset referenced; points to SuperGPQA instead of SciBench.

This will evaluate the wrong corpus.

-dataset_path: lmms-lab/SuperGPQA
+dataset_path: lmms-lab/SciBench
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dataset_path: lmms-lab/SuperGPQA
dataset_path: lmms-lab/SciBench
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/scibench_multishot.yaml around line 1, the
dataset_path is set to lmms-lab/SuperGPQA which references the wrong corpus;
replace this value with the correct SciBench dataset path (e.g., the proper
repository/dataset identifier for SciBench used elsewhere in the project) so the
task evaluates the SciBench corpus instead of SuperGPQA, and verify the dataset
name matches other configs and is accessible.

Comment on lines 36 to 45
def extract_boxed_answers(text):
# Find all boxed contents
matches = re.findall(r'boxed{([^}]*)}', text)
for m in matches:
# Strip spaces
candidate = m.strip()
# Keep only the numeric ones (int or decimal, with optional sign)
if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
return candidate
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix boxed-answer extraction: current regex never matches \boxed{} and breaks grading.

Pattern is missing the backslash and rejects scientific notation; accuracy will be 0 for valid outputs.

Apply this diff:

-def extract_boxed_answers(text):
-    # Find all boxed contents
-    matches = re.findall(r'boxed{([^}]*)}', text)
-    for m in matches:
-        # Strip spaces
-        candidate = m.strip()
-        # Keep only the numeric ones (int or decimal, with optional sign)
-        if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
-            return candidate
-    return None
+from typing import Optional
+
+def extract_boxed_answers(text: str) -> Optional[str]:
+    """Return the first numeric token inside a \boxed{...} span, if any."""
+    matches = re.findall(r'\\boxed\{([^}]*)\}', text)
+    for m in matches:
+        candidate = m.strip()
+        # Supports ints/decimals and optional scientific notation (e/E).
+        mnum = re.search(
+            r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate
+        )
+        if mnum:
+            return mnum.group(0)
+    return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_boxed_answers(text):
# Find all boxed contents
matches = re.findall(r'boxed{([^}]*)}', text)
for m in matches:
# Strip spaces
candidate = m.strip()
# Keep only the numeric ones (int or decimal, with optional sign)
if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
return candidate
return None
from typing import Optional
def extract_boxed_answers(text: str) -> Optional[str]:
"""Return the first numeric token inside a \boxed{...} span, if any."""
matches = re.findall(r'\\boxed\{([^}]*)\}', text)
for m in matches:
candidate = m.strip()
# Supports ints/decimals and optional scientific notation (e/E).
mnum = re.search(
r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate
)
if mnum:
return mnum.group(0)
return None
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 36 to 45, the regex used to
find boxed answers fails because it omits the backslash in `\boxed{}` and its
numeric check rejects scientific notation; update the extraction pattern to
match LaTeX boxed macros (e.g. use a raw pattern that searches for a literal
backslash like \\boxed\{([^}]*)\}) and replace the numeric validation with a
regex that accepts integers, decimals and scientific notation (e.g. allow
optional sign, digits with optional decimal point or leading decimal, and
optional exponent like [eE][+-]?\d+); keep trimming whitespace from the captured
group and return the numeric string when it matches the improved numeric
pattern, otherwise continue or return None as before.

Comment on lines 47 to 87
def remove_not(x):
match_number = re.compile('[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?')
result=re.findall(match_number, x)
if len(result) !=0:
return re.split(match_number, x)[-1]
return None

def cal_not(inputs):
try:
x,ab=list(inputs)
match_number = re.compile('10\^[{]?\ *-?[0-9]+\ *[}]?')
ab=re.findall(match_number, ab)[0]
ab=ab[ab.find('^')+1:]
if '{' in ab:
ab=ab[ab.find('{')+1:]
if '}' in ab:
ab=ab[:ab.find('}')]
x=x.strip()
out=float(x)*10**float(ab)
# print(float(x)*10**float(ab))
return str(out)
except:
print('error')
return inputs

def parse_not(inputs):
try:
if not inputs:
return '',''
if '\\times' in inputs:
x,ab=inputs.split('\\times')
elif '\times' in inputs:
x,ab=inputs.split('\times')
elif '*' in inputs:
x,ab=inputs.split('*')
else:
return inputs
return x,ab
except:
return '',''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Remove or fix unused helpers and bare except blocks.

These helpers are unused here and contain except: (E722). Prefer removal to reduce surface area.

Run to confirm they’re unused before removal:

If unused, delete them; otherwise, add type hints and narrow exceptions.


🏁 Script executed:

#!/bin/bash
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -g '!**/test/**' -S -C2

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "rg version: $(rg --version 2>/dev/null || true)"
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -S -C2 || true
echo "---- definitions ----"
rg -nP '^def\s+(remove_not|cal_not|parse_not)\s*\(' -n -S || true
echo "---- show lmms_eval/tasks/scibench/utils.py (if present) ----"
if [ -f "lmms_eval/tasks/scibench/utils.py" ]; then
  sed -n '1,240p' lmms_eval/tasks/scibench/utils.py || true
else
  echo "lmms_eval/tasks/scibench/utils.py not found"
fi

Length of output: 8084


Remove unused helpers and replace bare except blocks

remove_not, cal_not, parse_not are only defined in lmms_eval/tasks/scibench/utils.py (≈ lines 47–87) and contain bare except: handlers — delete them to reduce surface area. If you keep them, add type hints and replace bare except with specific exceptions (e.g., ValueError/IndexError) and return/raise explicit errors instead of printing.

🧰 Tools
🪛 Ruff (0.13.1)

68-68: Do not use bare except

(E722)


84-84: Consider moving this statement to an else block

(TRY300)


85-85: Do not use bare except

(E722)

🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 47–87 the helper functions
remove_not, cal_not, and parse_not are unused and contain bare except: handlers;
either delete these three functions to reduce surface area, or (if you intend to
keep them) add precise type hints for inputs/returns, replace bare except
clauses with specific exceptions (e.g., ValueError, IndexError, AttributeError)
and handle errors explicitly by returning well-typed sentinel values or raising
informative exceptions instead of printing, ensure regex lookups check for
non-empty matches before indexing, and remove any silent prints so behavior is
deterministic and testable.

Comment on lines +108 to +116
def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
pred = result[0]
pred = extract_boxed_answers(pred)
if pred:
res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
score = 1 if res_equiv else 0
else:
score = 0
return {"accuracy": score}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Harden result processing: handle empty results and non‑string targets safely.

Avoids IndexError and type issues when answer_number is numeric.

-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
-    pred = result[0]
-    pred = extract_boxed_answers(pred)
-    if pred:
-        res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
-        score = 1 if res_equiv else 0
-    else:
-        score = 0
-    return {"accuracy": score}
+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+    """Compute accuracy with 5% tolerance using the first \boxed{} number."""
+    pred_text = result[0] if result else ""
+    boxed = extract_boxed_answers(pred_text)
+    if boxed is None:
+        return {"accuracy": 0}
+    try:
+        gold = float(clean_number_string(str(doc["answer_number"])))
+        pred_val = float(clean_number_string(boxed))
+        return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
+    except (TypeError, ValueError):
+        return {"accuracy": 0}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
pred = result[0]
pred = extract_boxed_answers(pred)
if pred:
res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
score = 1 if res_equiv else 0
else:
score = 0
return {"accuracy": score}
def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
"""Compute accuracy with 5% tolerance using the first \boxed{} number."""
pred_text = result[0] if result else ""
boxed = extract_boxed_answers(pred_text)
if boxed is None:
return {"accuracy": 0}
try:
gold = float(clean_number_string(str(doc["answer_number"])))
pred_val = float(clean_number_string(boxed))
return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
except (TypeError, ValueError):
return {"accuracy": 0}
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 108 to 116, the function
assumes result[0] exists and doc["answer_number"] is a string; update it to
first check that result is a non-empty list and that result[0] is a string
before processing, and coerce doc["answer_number"] to a string (or handle
numeric types) before cleaning/converting; wrap the float conversions in a
try/except (catch ValueError/TypeError) and return {"accuracy": 0} on any
parsing error so no IndexError or type errors can propagate.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lmms_eval/tasks/medqa/utils.py (1)

92-123: Make parsing deterministic and case-insensitive; fix last-occurrence logic without NumPy

Current logic can fail to detect lowercase letters and uses randomness when no match is found, making evaluations non-reproducible. Also, the “last occurrence” computation only considers the " space letter space " pattern. Make it case-insensitive, deterministic, and compute last occurrence across all patterns.

-def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str:
-    # Clean punctuation around the response
-    for ch in [",", ".", "!", "?", ";", ":", "'"]:
-        response = response.strip(ch)
-    response = " " + response + " "
-
-    candidates = []
-    # (A) style
-    for c in all_choices:
-        if f"({c})" in response:
-            candidates.append(c)
-
-    # plain letter surrounded by spaces
-    if len(candidates) == 0:
-        for c in all_choices:
-            if f" {c} " in response:
-                candidates.append(c)
-
-    # A., B., etc.
-    if len(candidates) == 0:
-        for c in all_choices:
-            if f"{c}." in response:
-                candidates.append(c)
-
-    if len(candidates) == 0:
-        return random.choice(all_choices)
-    if len(candidates) > 1:
-        # choose the last occurrence to mitigate explanations mentioning multiple letters
-        start_indexes = [response.rfind(f" {can} ") for can in candidates]
-        return candidates[int(np.argmax(start_indexes))]
-    return candidates[0]
+def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str:
+    # Normalize casing and pad to simplify boundary searches
+    resp = f" {str(response).upper()} "
+
+    def last_pos(c: str) -> int:
+        # Consider common patterns: (A), A., A), (A, plain " A "
+        patterns = [f"({c})", f"{c}.", f"{c})", f"({c}", f" {c} "]
+        return max(resp.rfind(pat) for pat in patterns)
+
+    best_choice = None
+    best_idx = -1
+    for c in [ch.upper() for ch in all_choices]:
+        idx = last_pos(c)
+        if idx > best_idx:
+            best_idx = idx
+            best_choice = c
+
+    # Deterministic fallback if nothing matched
+    return best_choice if best_idx != -1 else all_choices[0]
🧹 Nitpick comments (6)
lmms_eval/tasks/scibench/utils.py (3)

27-33: Add docstring and None‑safe unit handling; keep lines ≤88 chars

Prevents KeyError/AttributeError and documents the public API.

 def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return f"{pre_prompt}{question}{post_prompt}"
+    """Build the single-shot prompt for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = str(doc.get("problem_text", ""))
+    unit = str(doc.get("unit") or "").strip()
+    if unit:
+        question = f"{question} The unit of the answer is {unit}."
+    return f"{pre_prompt}{question}{post_prompt}"

110-111: Add type hints and docstring for public utility

-def clean_number_string(s):
-    return s.replace(",", "").replace("−", "-").strip()
+def clean_number_string(s: str) -> str:
+    """Normalize numeric strings: strip, remove commas, normalize minus sign."""
+    return s.replace(",", "").replace("−", "-").strip()

92-107: Remove unused equiv_with_unit (lmms_eval/tasks/scibench/utils.py:92)
Function prints to stdout and uses bare excepts — delete it. If needed later, reintroduce with type hints and no prints. Verified: whole-repo search (rg / git grep / find) found only the definition at lmms_eval/tasks/scibench/utils.py:92; no callers.

lmms_eval/tasks/medqa/utils.py (3)

1-5: Drop unnecessary imports; avoid randomness dependency

random and numpy are only used in parsing; both can be removed with a deterministic parser. This also addresses S311 and improves reproducibility.

Apply this diff:

-import random
-from typing import Any, Dict, List
-
-import numpy as np
+from typing import Any, Dict, List

32-35: Avoid KeyError on pre/post prompt; provide safe defaults

Use .get with default and keep behavior stable if keys are missing.

-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    prompt = f"{question}\n{options_block}"
-    return f"{pre_prompt}{prompt}{post_prompt}"
+    pre_prompt = str(lmms_eval_specific_kwargs.get("pre_prompt", ""))
+    post_prompt = str(lmms_eval_specific_kwargs.get("post_prompt", ""))
+    prompt = f"{question}\n{options_block}" if options_block else question
+    return f"{pre_prompt}{prompt}{post_prompt}"

66-76: Add docstring to public API

Public APIs must have docstrings.

-def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]:
-    # Detect how many choices are present and return corresponding letters
+def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]:
+    """
+    Infer present choice letters (A–E) from the options structure.
+    """
+    # Detect how many choices are present and return corresponding letters
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 197a934 and b906383.

📒 Files selected for processing (3)
  • lmms_eval/tasks/medqa/medqa.yaml (1 hunks)
  • lmms_eval/tasks/medqa/utils.py (1 hunks)
  • lmms_eval/tasks/scibench/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/tasks/medqa/utils.py
  • lmms_eval/tasks/scibench/utils.py
**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

  • lmms_eval/tasks/medqa/medqa.yaml
🪛 Ruff (0.13.1)
lmms_eval/tasks/medqa/utils.py

117-117: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

lmms_eval/tasks/scibench/utils.py

9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)


70-70: Do not use bare except

(E722)


87-87: Consider moving this statement to an else block

(TRY300)


88-88: Do not use bare except

(E722)


92-92: Unused function argument: unit

(ARG001)


98-98: Do not use bare except

(E722)


103-103: Do not use bare except

(E722)


111-111: String contains ambiguous (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?

(RUF001)


126-126: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)


127-127: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🔇 Additional comments (5)
lmms_eval/tasks/scibench/utils.py (4)

5-24: FEWSHOT_PROMPT content looks good for anchoring multi‑shot examples


48-53: Remove unused helpers and bare except: blocks

These introduce noise and lint errors (E722/F401) and aren’t used.

-def remove_not(x):
-    match_number = re.compile("[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?")
-    result = re.findall(match_number, x)
-    if len(result) != 0:
-        return re.split(match_number, x)[-1]
-    return None
-
-
-def cal_not(inputs):
-    try:
-        x, ab = list(inputs)
-        match_number = re.compile("10\^[{]?\ *-?[0-9]+\ *[}]?")
-        ab = re.findall(match_number, ab)[0]
-        ab = ab[ab.find("^") + 1 :]
-        if "{" in ab:
-            ab = ab[ab.find("{") + 1 :]
-        if "}" in ab:
-            ab = ab[: ab.find("}")]
-        x = x.strip()
-        out = float(x) * 10 ** float(ab)
-        # print(float(x)*10**float(ab))
-        return str(out)
-    except:
-        print("error")
-    return inputs
-
-
-def parse_not(inputs):
-    try:
-        if not inputs:
-            return "", ""
-        if "\\times" in inputs:
-            x, ab = inputs.split("\\times")
-        elif "\times" in inputs:
-            x, ab = inputs.split("\times")
-        elif "*" in inputs:
-            x, ab = inputs.split("*")
-        else:
-            return inputs
-        return x, ab
-    except:
-        return "", ""

Also applies to: 56-73, 75-90


36-45: Fix boxed‑answer extraction: missing backslash; accept scientific notation

Current regex never matches \boxed and rejects 1e-3; breaks grading.

-def extract_boxed_answers(text):
-    # Find all boxed contents
-    matches = re.findall(r"boxed{([^}]*)}", text)
-    for m in matches:
-        # Strip spaces
-        candidate = m.strip()
-        # Keep only the numeric ones (int or decimal, with optional sign)
-        if re.fullmatch(r"[-+]?\d*\.?\d+", candidate):
-            return candidate
-    return None
+def extract_boxed_answers(text: str) -> Optional[str]:
+    """Return the first numeric token inside a \boxed{...} span, if any."""
+    matches = re.findall(r"\\boxed\{([^}]*)\}", text)
+    for m in matches:
+        candidate = m.strip()
+        num_pat = r"[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?"
+        mnum = re.search(num_pat, candidate)
+        if mnum:
+            return mnum.group(0)
+    return None

114-122: Harden result processing: empty results, typing, and 88‑char wrapping

Prevents IndexError/type errors; documents the public API.

-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
-    pred = result[0]
-    pred = extract_boxed_answers(pred)
-    if pred:
-        res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
-        score = 1 if res_equiv else 0
-    else:
-        score = 0
-    return {"accuracy": score}
+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+    """Compute accuracy with 5% tolerance using the first \\boxed{} number."""
+    pred_text = result[0] if result else ""
+    boxed = extract_boxed_answers(pred_text)
+    if boxed is None:
+        return {"accuracy": 0}
+    try:
+        gold = float(clean_number_string(str(doc["answer_number"])))
+        pred_val = float(clean_number_string(boxed))
+        return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
+    except (TypeError, ValueError):
+        return {"accuracy": 0}
lmms_eval/tasks/medqa/medqa.yaml (1)

7-11: No change needed — !function utils.* resolves to the task-local utils.py
import_function in lmms_eval/utils.py builds module_path from os.path.dirname(loader.name), so !function utils.<fn> loads lmms_eval/tasks/medqa/utils.py (which defines the referenced functions).

import numpy as np


def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add return type annotation for public API

Type hints are required. Annotate return type.

-def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
+def medqa_doc_to_text(
+    doc: Dict[str, Any],
+    lmms_eval_specific_kwargs: Dict[str, Any],
+) -> str:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
def medqa_doc_to_text(
doc: Dict[str, Any],
lmms_eval_specific_kwargs: Dict[str, Any],
) -> str:
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around line 7, the function signature def
medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str,
Any]): is missing a return type annotation; update the signature to include an
explicit return type (e.g., -> str) that matches the function's actual return
value, and add any needed typing imports (from typing import Any, Dict,
Optional, Union) if you choose a more complex type.

Comment on lines +20 to +31
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential IndexError when options is a list; wrap long lines

Enumerating beyond 5 options will raise IndexError. Use zip with fixed letters. Also wrap long joins to respect line length.

-    # Normalize options into A..E style lines
+    # Normalize options into A..E style lines
     options = doc.get("options")
     if isinstance(options, dict):
         # Keep only A-E in sorted letter order if present
         ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
-        options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
+        options_block = "\n".join(
+            f"{k}. {str(options[k]).strip()}" for k in ordered_keys
+        )
     elif isinstance(options, list):
         letters = ["A", "B", "C", "D", "E"]
-        options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
+        options_block = "\n".join(
+            f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)
+        )
     else:
         # Fallback: try to format if already string-like
         options_block = str(options) if options is not None else ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join(
f"{k}. {str(options[k]).strip()}" for k in ordered_keys
)
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join(
f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)
)
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 20 to 31, the branch that formats
options when options is a list can raise IndexError if there are more than five
options and also contains long unwrapped joins; replace the enumerate-based
indexing with a safe zip between a fixed list of letters ["A","B","C","D","E"]
and the options list so extra options are ignored without throwing, and when
constructing the options_block join, wrap or truncate long option strings (e.g.,
use a textwrap.fill or truncate each option to a max width before joining) to
keep line length reasonable.

Comment on lines +38 to +46
def medqa_doc_to_target(doc: Dict[str, Any]):
"""
Return the ground-truth answer letter.

MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add return type; normalize answer letter casing

Guarantee uppercase to match choice letters consistently.

-def medqa_doc_to_target(doc: Dict[str, Any]):
+def medqa_doc_to_target(doc: Dict[str, Any]) -> str:
@@
-    if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
-        return doc["answer_idx"].strip()
+    if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
+        return doc["answer_idx"].strip().upper()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_doc_to_target(doc: Dict[str, Any]):
"""
Return the ground-truth answer letter.
MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
def medqa_doc_to_target(doc: Dict[str, Any]) -> str:
"""
Return the ground-truth answer letter.
MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
return doc["answer_idx"].strip().upper()
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 38-46, add an explicit return
type for medqa_doc_to_target (str) and ensure the returned answer letter is
always uppercase: prefer doc.get("answer_idx") and if present return its
.upper(); otherwise take doc.get("answer"), strip whitespace, extract the first
non-empty character (or first token) and return it uppercased; if neither exists
return an empty string. Also ensure you handle non-string inputs by converting
to str before stripping/uppercasing.

Comment on lines +79 to +90
def medqa_process_results(doc: Dict[str, Any], result: List[str]):
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = result[0].strip()
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Type hints + guard empty result; ensure deterministic behavior

Add return type; handle empty result defensively.

-def medqa_process_results(doc: Dict[str, Any], result: List[str]):
+def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:
@@
-    response = result[0].strip()
+    response = str(result[0]).strip() if result else ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_process_results(doc: Dict[str, Any], result: List[str]):
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = result[0].strip()
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}
def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = str(result[0]).strip() if result else ""
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 79-90, the function lacks a
return type and doesn’t guard against an empty result list; add a return type
annotation (-> Dict[str, float]) and make the function defensive by checking if
not result or not result[0]. If the result is empty, return a deterministic
{"accuracy": 0.0}; otherwise proceed to compute response = result[0].strip(),
parse prediction, compare to gold, and return {"accuracy": score}.

Comment on lines +1 to +4
import re
from math import isclose
from typing import Dict, List, Tuple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Silence RUF001 or replace ambiguous unicode; add Optional for types

Ruff flags the Unicode ×/− in literals. Easiest: per‑file suppression. Also prep for Optional return types.

+ # ruff: noqa: RUF001  # allow ×/− in prompt examples
 import re
 from math import isclose
-from typing import Dict, List, Tuple
+from typing import Dict, List, Optional
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import re
from math import isclose
from typing import Dict, List, Tuple
# ruff: noqa: RUF001 # allow ×/− in prompt examples
import re
from math import isclose
from typing import Dict, List, Optional
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py lines 1-4, Ruff warns about ambiguous
Unicode characters (×/−) and the comment requests preparing return types as
Optional; to fix, add a per-file Ruff suppression directive at the top (e.g., a
single-line ruff noqa comment targeting RUF001) or replace any ambiguous Unicode
multiplication/division/minus characters in literals with ASCII equivalents (x,
*, -) throughout the file, and update any function return type hints that may be
nullable to use typing.Optional[...] (import Optional if not already) so
signatures reflect possible None returns.

Comment on lines +125 to +131
def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["problem_text"]
if doc["unit"].strip():
question = question + " The unit of the answer is " + doc["unit"] + "."
return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use pre/post prompts; add docstring; avoid unused vars (F841)

Also keeps lines within 88 chars.

-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
+def scibench_multishot_doc_to_text(
+    doc: Dict, lmms_eval_specific_kwargs: Dict
+) -> str:
+    """Build the multi-shot prompt for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = str(doc.get("problem_text", ""))
+    unit = str(doc.get("unit") or "").strip()
+    if unit:
+        question = f"{question} The unit of the answer is {unit}."
+    return (
+        f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
+        "Answer: Let's think step by step."
+        f"{post_prompt}"
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["problem_text"]
if doc["unit"].strip():
question = question + " The unit of the answer is " + doc["unit"] + "."
return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
def scibench_multishot_doc_to_text(
doc: Dict, lmms_eval_specific_kwargs: Dict
) -> str:
"""Build the multi-shot prompt for SciBench."""
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = str(doc.get("problem_text", ""))
unit = str(doc.get("unit") or "").strip()
if unit:
question = f"{question} The unit of the answer is {unit}."
return (
f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
"Answer: Let's think step by step."
f"{post_prompt}"
)
🧰 Tools
🪛 Ruff (0.13.1)

126-126: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)


127-127: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 125-131, the function
currently defines pre_prompt and post_prompt but never uses them, lacks a
docstring, and has long concatenated strings; update the function to include a
short docstring describing parameters and return value, use pre_prompt before
the question and post_prompt after the answer (combine with FEWSHOT_PROMPT),
replace string concatenation with f-strings to avoid unused variable lint
(F841), and reflow string construction so no line exceeds ~88 characters.

@Luodian Luodian merged commit df477b1 into EvolvingLMMs-Lab:main Sep 25, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants