-
Notifications
You must be signed in to change notification settings - Fork 452
add scibench(math) task #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds SciBench (single- and multi-shot) and MEDQA task configurations plus two utility modules implementing prompt construction, parsing, normalization, target extraction, and result-processing hooks wired into evaluation configs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner
participant Config as Task YAML
participant Utils as Task Utils
participant Model as LM
Runner->>Config: Load task config (dataset, mappings, metrics)
Runner->>Utils: doc_to_text(doc, lmms_eval_specific_kwargs)
Utils-->>Runner: Prompt text
Runner->>Model: Send prompt -> generate completion
Model-->>Runner: Completion / answers
Runner->>Utils: process_results(doc, [completion])
Utils-->>Runner: {"metric_name": value}
Runner->>Runner: Aggregate metrics (e.g., mean accuracy)
note right of Runner: Reports final metric (higher_is_better)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
lmms_eval/tasks/scibench/utils.py (5)
28-35: Guard optional unit and add a minimal docstring.-def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: - pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] - post_prompt = lmms_eval_specific_kwargs["post_prompt"] - question = doc["problem_text"] - if doc["unit"].strip(): - question = question + " The unit of the answer is " + doc["unit"] + "." - return f"{pre_prompt}{question}{post_prompt}" +def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: + """Single-shot prompt builder for SciBench.""" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + question = doc["problem_text"] + unit = str(doc.get("unit", "") or "").strip() + if unit: + question += f" The unit of the answer is {unit}." + return f"{pre_prompt}{question}{post_prompt}"
118-125: Usepre_prompt/post_promptin multishot prompt; add docstring.-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: - pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] - post_prompt = lmms_eval_specific_kwargs["post_prompt"] - question = doc["problem_text"] - if doc["unit"].strip(): - question = question + " The unit of the answer is " + doc["unit"] + "." - return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step." +def scibench_multishot_doc_to_text( + doc: Dict, lmms_eval_specific_kwargs: Dict +) -> str: + """Few-shot prompt builder for SciBench.""" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + question = doc["problem_text"] + unit = str(doc.get("unit", "") or "").strip() + if unit: + question += f" The unit of the answer is {unit}." + return ( + f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n" + f"Answer: Let's think step by step.{post_prompt}" + )
88-103: Remove debug prints, type annotate, and use theunitparameter.-def equiv_with_unit(model_output, answer, unit): - model_output=model_output.replace(',', '') - print("Model_output: ", model_output) - try: - ans=float(answer.strip()) - first=isclose(float(model_output.strip()), ans, rel_tol=0.05) - except: - first=False - try: - model=model_output.strip().split()[0] - second=isclose(float(model.strip()), ans, rel_tol=0.05) - except: - second=False - if first or second: - return True - return False +def equiv_with_unit(model_output: str, answer: str, unit: str) -> bool: + """Compare numeric values, ignoring commas and an optional trailing unit.""" + try: + ans = float(clean_number_string(answer)) + except (TypeError, ValueError): + return False + candidates = [ + model_output, + model_output.split()[0] if model_output.split() else model_output, + ] + for c in candidates: + try: + c_num = clean_number_string(c.replace(",", "").replace(unit, "")) + if isclose(float(c_num), ans, rel_tol=0.05): + return True + except (TypeError, ValueError): + continue + return False
105-107: Replace ambiguous Unicode minus with escape to satisfy Ruff (RUF001).-def clean_number_string(s): - return s.replace(",", "").replace("−", "-").strip() +def clean_number_string(s: str) -> str: + # \u2212 is the Unicode MINUS SIGN; normalize to ASCII hyphen-minus. + return s.replace(",", "").replace("\u2212", "-").strip()
5-26: Replace ambiguous×with ASCIIxor LaTeX\timesin FEWSHOT_PROMPT.Ruff flags MULTIPLICATION SIGN (RUF001). It also improves copy/paste robustness.
-... P = (10.0 mol × 0.0821 L·atm/(mol·K) × 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ... +... P = (10.0 mol x 0.0821 L·atm/(mol·K) x 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. ... -... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 × 313.15 × 2.797 ≈ 7274.5 J/mol ... +... Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 x 313.15 x 2.797 ≈ 7274.5 J/mol ... -... the numerator at 45° as 1.697×10⁻² m/s² ... +... the numerator at 45° as 1.697x10⁻² m/s² ...(Apply similarly to any remaining
×in the block.)lmms_eval/tasks/scibench/scibench.yaml (1)
12-15: Clarify unit guidance to avoid boxed-unit leakage.Pre-prompt bans units in the answer;
doc_to_textalso appends “The unit of the answer is …”. Consider clarifying “Do not include the unit inside the boxed number.”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lmms_eval/tasks/scibench/scibench.yaml(1 hunks)lmms_eval/tasks/scibench/scibench_multishot.yaml(1 hunks)lmms_eval/tasks/scibench/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{yml,yaml,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Prettier (via pre-commit) to format YAML and JSON files
Files:
lmms_eval/tasks/scibench/scibench_multishot.yamllmms_eval/tasks/scibench/scibench.yaml
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly
Files:
lmms_eval/tasks/scibench/utils.py
🪛 Ruff (0.13.1)
lmms_eval/tasks/scibench/utils.py
10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
10-10: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
17-17: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
24-24: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
68-68: Do not use bare except
(E722)
84-84: Consider moving this statement to an else block
(TRY300)
85-85: Do not use bare except
(E722)
88-88: Unused function argument: unit
(ARG001)
94-94: Do not use bare except
(E722)
99-99: Do not use bare except
(E722)
106-106: String contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?
(RUF001)
119-119: Local variable pre_prompt is assigned to but never used
Remove assignment to unused variable pre_prompt
(F841)
120-120: Local variable post_prompt is assigned to but never used
Remove assignment to unused variable post_prompt
(F841)
🔇 Additional comments (2)
lmms_eval/tasks/scibench/scibench.yaml (1)
4-4: Confirm split selection; usingtrainas the evaluation split is unusual.Switch to
testif available.-test_split: train +test_split: testlmms_eval/tasks/scibench/scibench_multishot.yaml (1)
8-8: Confirm target field exists — resolved.
SciBench includes a numeric ground-truth field "answer_number" (e.g., 1.5, -1.49); keeping doc_to_target: "answer_number" is correct.
| @@ -0,0 +1,23 @@ | |||
| dataset_path: lmms-lab/SuperGPQA | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong dataset referenced; points to SuperGPQA instead of SciBench.
This will evaluate the wrong corpus.
-dataset_path: lmms-lab/SuperGPQA
+dataset_path: lmms-lab/SciBench📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dataset_path: lmms-lab/SuperGPQA | |
| dataset_path: lmms-lab/SciBench |
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/scibench_multishot.yaml around line 1, the
dataset_path is set to lmms-lab/SuperGPQA which references the wrong corpus;
replace this value with the correct SciBench dataset path (e.g., the proper
repository/dataset identifier for SciBench used elsewhere in the project) so the
task evaluates the SciBench corpus instead of SuperGPQA, and verify the dataset
name matches other configs and is accessible.
| def extract_boxed_answers(text): | ||
| # Find all boxed contents | ||
| matches = re.findall(r'boxed{([^}]*)}', text) | ||
| for m in matches: | ||
| # Strip spaces | ||
| candidate = m.strip() | ||
| # Keep only the numeric ones (int or decimal, with optional sign) | ||
| if re.fullmatch(r'[-+]?\d*\.?\d+', candidate): | ||
| return candidate | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix boxed-answer extraction: current regex never matches \boxed{} and breaks grading.
Pattern is missing the backslash and rejects scientific notation; accuracy will be 0 for valid outputs.
Apply this diff:
-def extract_boxed_answers(text):
- # Find all boxed contents
- matches = re.findall(r'boxed{([^}]*)}', text)
- for m in matches:
- # Strip spaces
- candidate = m.strip()
- # Keep only the numeric ones (int or decimal, with optional sign)
- if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
- return candidate
- return None
+from typing import Optional
+
+def extract_boxed_answers(text: str) -> Optional[str]:
+ """Return the first numeric token inside a \boxed{...} span, if any."""
+ matches = re.findall(r'\\boxed\{([^}]*)\}', text)
+ for m in matches:
+ candidate = m.strip()
+ # Supports ints/decimals and optional scientific notation (e/E).
+ mnum = re.search(
+ r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate
+ )
+ if mnum:
+ return mnum.group(0)
+ return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def extract_boxed_answers(text): | |
| # Find all boxed contents | |
| matches = re.findall(r'boxed{([^}]*)}', text) | |
| for m in matches: | |
| # Strip spaces | |
| candidate = m.strip() | |
| # Keep only the numeric ones (int or decimal, with optional sign) | |
| if re.fullmatch(r'[-+]?\d*\.?\d+', candidate): | |
| return candidate | |
| return None | |
| from typing import Optional | |
| def extract_boxed_answers(text: str) -> Optional[str]: | |
| """Return the first numeric token inside a \boxed{...} span, if any.""" | |
| matches = re.findall(r'\\boxed\{([^}]*)\}', text) | |
| for m in matches: | |
| candidate = m.strip() | |
| # Supports ints/decimals and optional scientific notation (e/E). | |
| mnum = re.search( | |
| r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate | |
| ) | |
| if mnum: | |
| return mnum.group(0) | |
| return None |
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 36 to 45, the regex used to
find boxed answers fails because it omits the backslash in `\boxed{}` and its
numeric check rejects scientific notation; update the extraction pattern to
match LaTeX boxed macros (e.g. use a raw pattern that searches for a literal
backslash like \\boxed\{([^}]*)\}) and replace the numeric validation with a
regex that accepts integers, decimals and scientific notation (e.g. allow
optional sign, digits with optional decimal point or leading decimal, and
optional exponent like [eE][+-]?\d+); keep trimming whitespace from the captured
group and return the numeric string when it matches the improved numeric
pattern, otherwise continue or return None as before.
| def remove_not(x): | ||
| match_number = re.compile('[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?') | ||
| result=re.findall(match_number, x) | ||
| if len(result) !=0: | ||
| return re.split(match_number, x)[-1] | ||
| return None | ||
|
|
||
| def cal_not(inputs): | ||
| try: | ||
| x,ab=list(inputs) | ||
| match_number = re.compile('10\^[{]?\ *-?[0-9]+\ *[}]?') | ||
| ab=re.findall(match_number, ab)[0] | ||
| ab=ab[ab.find('^')+1:] | ||
| if '{' in ab: | ||
| ab=ab[ab.find('{')+1:] | ||
| if '}' in ab: | ||
| ab=ab[:ab.find('}')] | ||
| x=x.strip() | ||
| out=float(x)*10**float(ab) | ||
| # print(float(x)*10**float(ab)) | ||
| return str(out) | ||
| except: | ||
| print('error') | ||
| return inputs | ||
|
|
||
| def parse_not(inputs): | ||
| try: | ||
| if not inputs: | ||
| return '','' | ||
| if '\\times' in inputs: | ||
| x,ab=inputs.split('\\times') | ||
| elif '\times' in inputs: | ||
| x,ab=inputs.split('\times') | ||
| elif '*' in inputs: | ||
| x,ab=inputs.split('*') | ||
| else: | ||
| return inputs | ||
| return x,ab | ||
| except: | ||
| return '','' | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
🧩 Analysis chain
Remove or fix unused helpers and bare except blocks.
These helpers are unused here and contain except: (E722). Prefer removal to reduce surface area.
Run to confirm they’re unused before removal:
If unused, delete them; otherwise, add type hints and narrow exceptions.
🏁 Script executed:
#!/bin/bash
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -g '!**/test/**' -S -C2Length of output: 148
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "rg version: $(rg --version 2>/dev/null || true)"
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -S -C2 || true
echo "---- definitions ----"
rg -nP '^def\s+(remove_not|cal_not|parse_not)\s*\(' -n -S || true
echo "---- show lmms_eval/tasks/scibench/utils.py (if present) ----"
if [ -f "lmms_eval/tasks/scibench/utils.py" ]; then
sed -n '1,240p' lmms_eval/tasks/scibench/utils.py || true
else
echo "lmms_eval/tasks/scibench/utils.py not found"
fiLength of output: 8084
Remove unused helpers and replace bare except blocks
remove_not, cal_not, parse_not are only defined in lmms_eval/tasks/scibench/utils.py (≈ lines 47–87) and contain bare except: handlers — delete them to reduce surface area. If you keep them, add type hints and replace bare except with specific exceptions (e.g., ValueError/IndexError) and return/raise explicit errors instead of printing.
🧰 Tools
🪛 Ruff (0.13.1)
68-68: Do not use bare except
(E722)
84-84: Consider moving this statement to an else block
(TRY300)
85-85: Do not use bare except
(E722)
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 47–87 the helper functions
remove_not, cal_not, and parse_not are unused and contain bare except: handlers;
either delete these three functions to reduce surface area, or (if you intend to
keep them) add precise type hints for inputs/returns, replace bare except
clauses with specific exceptions (e.g., ValueError, IndexError, AttributeError)
and handle errors explicitly by returning well-typed sentinel values or raising
informative exceptions instead of printing, ensure regex lookups check for
non-empty matches before indexing, and remove any silent prints so behavior is
deterministic and testable.
| def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: | ||
| pred = result[0] | ||
| pred = extract_boxed_answers(pred) | ||
| if pred: | ||
| res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05) | ||
| score = 1 if res_equiv else 0 | ||
| else: | ||
| score = 0 | ||
| return {"accuracy": score} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harden result processing: handle empty results and non‑string targets safely.
Avoids IndexError and type issues when answer_number is numeric.
-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
- pred = result[0]
- pred = extract_boxed_answers(pred)
- if pred:
- res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
- score = 1 if res_equiv else 0
- else:
- score = 0
- return {"accuracy": score}
+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+ """Compute accuracy with 5% tolerance using the first \boxed{} number."""
+ pred_text = result[0] if result else ""
+ boxed = extract_boxed_answers(pred_text)
+ if boxed is None:
+ return {"accuracy": 0}
+ try:
+ gold = float(clean_number_string(str(doc["answer_number"])))
+ pred_val = float(clean_number_string(boxed))
+ return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
+ except (TypeError, ValueError):
+ return {"accuracy": 0}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: | |
| pred = result[0] | |
| pred = extract_boxed_answers(pred) | |
| if pred: | |
| res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05) | |
| score = 1 if res_equiv else 0 | |
| else: | |
| score = 0 | |
| return {"accuracy": score} | |
| def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: | |
| """Compute accuracy with 5% tolerance using the first \boxed{} number.""" | |
| pred_text = result[0] if result else "" | |
| boxed = extract_boxed_answers(pred_text) | |
| if boxed is None: | |
| return {"accuracy": 0} | |
| try: | |
| gold = float(clean_number_string(str(doc["answer_number"]))) | |
| pred_val = float(clean_number_string(boxed)) | |
| return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))} | |
| except (TypeError, ValueError): | |
| return {"accuracy": 0} |
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 108 to 116, the function
assumes result[0] exists and doc["answer_number"] is a string; update it to
first check that result is a non-empty list and that result[0] is a string
before processing, and coerce doc["answer_number"] to a string (or handle
numeric types) before cleaning/converting; wrap the float conversions in a
try/except (catch ValueError/TypeError) and return {"accuracy": 0} on any
parsing error so no IndexError or type errors can propagate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lmms_eval/tasks/medqa/utils.py (1)
92-123: Make parsing deterministic and case-insensitive; fix last-occurrence logic without NumPyCurrent logic can fail to detect lowercase letters and uses randomness when no match is found, making evaluations non-reproducible. Also, the “last occurrence” computation only considers the " space letter space " pattern. Make it case-insensitive, deterministic, and compute last occurrence across all patterns.
-def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str: - # Clean punctuation around the response - for ch in [",", ".", "!", "?", ";", ":", "'"]: - response = response.strip(ch) - response = " " + response + " " - - candidates = [] - # (A) style - for c in all_choices: - if f"({c})" in response: - candidates.append(c) - - # plain letter surrounded by spaces - if len(candidates) == 0: - for c in all_choices: - if f" {c} " in response: - candidates.append(c) - - # A., B., etc. - if len(candidates) == 0: - for c in all_choices: - if f"{c}." in response: - candidates.append(c) - - if len(candidates) == 0: - return random.choice(all_choices) - if len(candidates) > 1: - # choose the last occurrence to mitigate explanations mentioning multiple letters - start_indexes = [response.rfind(f" {can} ") for can in candidates] - return candidates[int(np.argmax(start_indexes))] - return candidates[0] +def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str: + # Normalize casing and pad to simplify boundary searches + resp = f" {str(response).upper()} " + + def last_pos(c: str) -> int: + # Consider common patterns: (A), A., A), (A, plain " A " + patterns = [f"({c})", f"{c}.", f"{c})", f"({c}", f" {c} "] + return max(resp.rfind(pat) for pat in patterns) + + best_choice = None + best_idx = -1 + for c in [ch.upper() for ch in all_choices]: + idx = last_pos(c) + if idx > best_idx: + best_idx = idx + best_choice = c + + # Deterministic fallback if nothing matched + return best_choice if best_idx != -1 else all_choices[0]
🧹 Nitpick comments (6)
lmms_eval/tasks/scibench/utils.py (3)
27-33: Add docstring and None‑safe unit handling; keep lines ≤88 charsPrevents KeyError/AttributeError and documents the public API.
def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: - pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] - post_prompt = lmms_eval_specific_kwargs["post_prompt"] - question = doc["problem_text"] - if doc["unit"].strip(): - question = question + " The unit of the answer is " + doc["unit"] + "." - return f"{pre_prompt}{question}{post_prompt}" + """Build the single-shot prompt for SciBench.""" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + question = str(doc.get("problem_text", "")) + unit = str(doc.get("unit") or "").strip() + if unit: + question = f"{question} The unit of the answer is {unit}." + return f"{pre_prompt}{question}{post_prompt}"
110-111: Add type hints and docstring for public utility-def clean_number_string(s): - return s.replace(",", "").replace("−", "-").strip() +def clean_number_string(s: str) -> str: + """Normalize numeric strings: strip, remove commas, normalize minus sign.""" + return s.replace(",", "").replace("−", "-").strip()
92-107: Remove unused equiv_with_unit (lmms_eval/tasks/scibench/utils.py:92)
Function prints to stdout and uses bare excepts — delete it. If needed later, reintroduce with type hints and no prints. Verified: whole-repo search (rg / git grep / find) found only the definition at lmms_eval/tasks/scibench/utils.py:92; no callers.lmms_eval/tasks/medqa/utils.py (3)
1-5: Drop unnecessary imports; avoid randomness dependency
randomandnumpyare only used in parsing; both can be removed with a deterministic parser. This also addresses S311 and improves reproducibility.Apply this diff:
-import random -from typing import Any, Dict, List - -import numpy as np +from typing import Any, Dict, List
32-35: Avoid KeyError on pre/post prompt; provide safe defaultsUse .get with default and keep behavior stable if keys are missing.
- pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] - post_prompt = lmms_eval_specific_kwargs["post_prompt"] - prompt = f"{question}\n{options_block}" - return f"{pre_prompt}{prompt}{post_prompt}" + pre_prompt = str(lmms_eval_specific_kwargs.get("pre_prompt", "")) + post_prompt = str(lmms_eval_specific_kwargs.get("post_prompt", "")) + prompt = f"{question}\n{options_block}" if options_block else question + return f"{pre_prompt}{prompt}{post_prompt}"
66-76: Add docstring to public APIPublic APIs must have docstrings.
-def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]: - # Detect how many choices are present and return corresponding letters +def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]: + """ + Infer present choice letters (A–E) from the options structure. + """ + # Detect how many choices are present and return corresponding letters
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lmms_eval/tasks/medqa/medqa.yaml(1 hunks)lmms_eval/tasks/medqa/utils.py(1 hunks)lmms_eval/tasks/scibench/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly
Files:
lmms_eval/tasks/medqa/utils.pylmms_eval/tasks/scibench/utils.py
**/*.{yml,yaml,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Prettier (via pre-commit) to format YAML and JSON files
Files:
lmms_eval/tasks/medqa/medqa.yaml
🪛 Ruff (0.13.1)
lmms_eval/tasks/medqa/utils.py
117-117: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
lmms_eval/tasks/scibench/utils.py
9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
9-9: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
16-16: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
23-23: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF001)
70-70: Do not use bare except
(E722)
87-87: Consider moving this statement to an else block
(TRY300)
88-88: Do not use bare except
(E722)
92-92: Unused function argument: unit
(ARG001)
98-98: Do not use bare except
(E722)
103-103: Do not use bare except
(E722)
111-111: String contains ambiguous − (MINUS SIGN). Did you mean - (HYPHEN-MINUS)?
(RUF001)
126-126: Local variable pre_prompt is assigned to but never used
Remove assignment to unused variable pre_prompt
(F841)
127-127: Local variable post_prompt is assigned to but never used
Remove assignment to unused variable post_prompt
(F841)
🔇 Additional comments (5)
lmms_eval/tasks/scibench/utils.py (4)
5-24: FEWSHOT_PROMPT content looks good for anchoring multi‑shot examples
48-53: Remove unused helpers and bareexcept:blocksThese introduce noise and lint errors (E722/F401) and aren’t used.
-def remove_not(x): - match_number = re.compile("[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?") - result = re.findall(match_number, x) - if len(result) != 0: - return re.split(match_number, x)[-1] - return None - - -def cal_not(inputs): - try: - x, ab = list(inputs) - match_number = re.compile("10\^[{]?\ *-?[0-9]+\ *[}]?") - ab = re.findall(match_number, ab)[0] - ab = ab[ab.find("^") + 1 :] - if "{" in ab: - ab = ab[ab.find("{") + 1 :] - if "}" in ab: - ab = ab[: ab.find("}")] - x = x.strip() - out = float(x) * 10 ** float(ab) - # print(float(x)*10**float(ab)) - return str(out) - except: - print("error") - return inputs - - -def parse_not(inputs): - try: - if not inputs: - return "", "" - if "\\times" in inputs: - x, ab = inputs.split("\\times") - elif "\times" in inputs: - x, ab = inputs.split("\times") - elif "*" in inputs: - x, ab = inputs.split("*") - else: - return inputs - return x, ab - except: - return "", ""Also applies to: 56-73, 75-90
36-45: Fix boxed‑answer extraction: missing backslash; accept scientific notationCurrent regex never matches \boxed and rejects 1e-3; breaks grading.
-def extract_boxed_answers(text): - # Find all boxed contents - matches = re.findall(r"boxed{([^}]*)}", text) - for m in matches: - # Strip spaces - candidate = m.strip() - # Keep only the numeric ones (int or decimal, with optional sign) - if re.fullmatch(r"[-+]?\d*\.?\d+", candidate): - return candidate - return None +def extract_boxed_answers(text: str) -> Optional[str]: + """Return the first numeric token inside a \boxed{...} span, if any.""" + matches = re.findall(r"\\boxed\{([^}]*)\}", text) + for m in matches: + candidate = m.strip() + num_pat = r"[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?" + mnum = re.search(num_pat, candidate) + if mnum: + return mnum.group(0) + return None
114-122: Harden result processing: empty results, typing, and 88‑char wrappingPrevents IndexError/type errors; documents the public API.
-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: - pred = result[0] - pred = extract_boxed_answers(pred) - if pred: - res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05) - score = 1 if res_equiv else 0 - else: - score = 0 - return {"accuracy": score} +def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]: + """Compute accuracy with 5% tolerance using the first \\boxed{} number.""" + pred_text = result[0] if result else "" + boxed = extract_boxed_answers(pred_text) + if boxed is None: + return {"accuracy": 0} + try: + gold = float(clean_number_string(str(doc["answer_number"]))) + pred_val = float(clean_number_string(boxed)) + return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))} + except (TypeError, ValueError): + return {"accuracy": 0}lmms_eval/tasks/medqa/medqa.yaml (1)
7-11: No change needed —!function utils.*resolves to the task-local utils.py
import_function in lmms_eval/utils.py builds module_path from os.path.dirname(loader.name), so!function utils.<fn>loads lmms_eval/tasks/medqa/utils.py (which defines the referenced functions).
| import numpy as np | ||
|
|
||
|
|
||
| def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add return type annotation for public API
Type hints are required. Annotate return type.
-def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
+def medqa_doc_to_text(
+ doc: Dict[str, Any],
+ lmms_eval_specific_kwargs: Dict[str, Any],
+) -> str:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]): | |
| def medqa_doc_to_text( | |
| doc: Dict[str, Any], | |
| lmms_eval_specific_kwargs: Dict[str, Any], | |
| ) -> str: |
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around line 7, the function signature def
medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str,
Any]): is missing a return type annotation; update the signature to include an
explicit return type (e.g., -> str) that matches the function's actual return
value, and add any needed typing imports (from typing import Any, Dict,
Optional, Union) if you choose a more complex type.
| options = doc.get("options") | ||
| if isinstance(options, dict): | ||
| # Keep only A-E in sorted letter order if present | ||
| ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options] | ||
| options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys]) | ||
| elif isinstance(options, list): | ||
| letters = ["A", "B", "C", "D", "E"] | ||
| options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)]) | ||
| else: | ||
| # Fallback: try to format if already string-like | ||
| options_block = str(options) if options is not None else "" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential IndexError when options is a list; wrap long lines
Enumerating beyond 5 options will raise IndexError. Use zip with fixed letters. Also wrap long joins to respect line length.
- # Normalize options into A..E style lines
+ # Normalize options into A..E style lines
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
- options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
+ options_block = "\n".join(
+ f"{k}. {str(options[k]).strip()}" for k in ordered_keys
+ )
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
- options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
+ options_block = "\n".join(
+ f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)
+ )
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| options = doc.get("options") | |
| if isinstance(options, dict): | |
| # Keep only A-E in sorted letter order if present | |
| ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options] | |
| options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys]) | |
| elif isinstance(options, list): | |
| letters = ["A", "B", "C", "D", "E"] | |
| options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)]) | |
| else: | |
| # Fallback: try to format if already string-like | |
| options_block = str(options) if options is not None else "" | |
| options = doc.get("options") | |
| if isinstance(options, dict): | |
| # Keep only A-E in sorted letter order if present | |
| ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options] | |
| options_block = "\n".join( | |
| f"{k}. {str(options[k]).strip()}" for k in ordered_keys | |
| ) | |
| elif isinstance(options, list): | |
| letters = ["A", "B", "C", "D", "E"] | |
| options_block = "\n".join( | |
| f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options) | |
| ) | |
| else: | |
| # Fallback: try to format if already string-like | |
| options_block = str(options) if options is not None else "" |
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 20 to 31, the branch that formats
options when options is a list can raise IndexError if there are more than five
options and also contains long unwrapped joins; replace the enumerate-based
indexing with a safe zip between a fixed list of letters ["A","B","C","D","E"]
and the options list so extra options are ignored without throwing, and when
constructing the options_block join, wrap or truncate long option strings (e.g.,
use a textwrap.fill or truncate each option to a max width before joining) to
keep line length reasonable.
| def medqa_doc_to_target(doc: Dict[str, Any]): | ||
| """ | ||
| Return the ground-truth answer letter. | ||
|
|
||
| MEDQA on HF commonly provides either: | ||
| - "answer_idx": a letter like "A"/"B"/... OR | ||
| - "answer": a full string like "C" or the option text. We prioritize letter if available. | ||
| """ | ||
| # Prefer explicit answer letter field when present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add return type; normalize answer letter casing
Guarantee uppercase to match choice letters consistently.
-def medqa_doc_to_target(doc: Dict[str, Any]):
+def medqa_doc_to_target(doc: Dict[str, Any]) -> str:
@@
- if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
- return doc["answer_idx"].strip()
+ if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
+ return doc["answer_idx"].strip().upper()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def medqa_doc_to_target(doc: Dict[str, Any]): | |
| """ | |
| Return the ground-truth answer letter. | |
| MEDQA on HF commonly provides either: | |
| - "answer_idx": a letter like "A"/"B"/... OR | |
| - "answer": a full string like "C" or the option text. We prioritize letter if available. | |
| """ | |
| # Prefer explicit answer letter field when present | |
| def medqa_doc_to_target(doc: Dict[str, Any]) -> str: | |
| """ | |
| Return the ground-truth answer letter. | |
| MEDQA on HF commonly provides either: | |
| - "answer_idx": a letter like "A"/"B"/... OR | |
| - "answer": a full string like "C" or the option text. We prioritize letter if available. | |
| """ | |
| # Prefer explicit answer letter field when present | |
| if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1: | |
| return doc["answer_idx"].strip().upper() |
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 38-46, add an explicit return
type for medqa_doc_to_target (str) and ensure the returned answer letter is
always uppercase: prefer doc.get("answer_idx") and if present return its
.upper(); otherwise take doc.get("answer"), strip whitespace, extract the first
non-empty character (or first token) and return it uppercased; if neither exists
return an empty string. Also ensure you handle non-string inputs by converting
to str before stripping/uppercasing.
| def medqa_process_results(doc: Dict[str, Any], result: List[str]): | ||
| """ | ||
| Parse model output and compute accuracy against the gold letter. | ||
| We robustly extract a single letter from the response. | ||
| """ | ||
| response = result[0].strip() | ||
| all_choices = medqa_doc_to_choice(doc) | ||
| pred = _parse_multi_choice_response(response, all_choices) | ||
| gt_ans = medqa_doc_to_target(doc) | ||
| score = 1.0 if pred == gt_ans else 0.0 | ||
| return {"accuracy": score} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Type hints + guard empty result; ensure deterministic behavior
Add return type; handle empty result defensively.
-def medqa_process_results(doc: Dict[str, Any], result: List[str]):
+def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:
@@
- response = result[0].strip()
+ response = str(result[0]).strip() if result else ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def medqa_process_results(doc: Dict[str, Any], result: List[str]): | |
| """ | |
| Parse model output and compute accuracy against the gold letter. | |
| We robustly extract a single letter from the response. | |
| """ | |
| response = result[0].strip() | |
| all_choices = medqa_doc_to_choice(doc) | |
| pred = _parse_multi_choice_response(response, all_choices) | |
| gt_ans = medqa_doc_to_target(doc) | |
| score = 1.0 if pred == gt_ans else 0.0 | |
| return {"accuracy": score} | |
| def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]: | |
| """ | |
| Parse model output and compute accuracy against the gold letter. | |
| We robustly extract a single letter from the response. | |
| """ | |
| response = str(result[0]).strip() if result else "" | |
| all_choices = medqa_doc_to_choice(doc) | |
| pred = _parse_multi_choice_response(response, all_choices) | |
| gt_ans = medqa_doc_to_target(doc) | |
| score = 1.0 if pred == gt_ans else 0.0 | |
| return {"accuracy": score} |
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 79-90, the function lacks a
return type and doesn’t guard against an empty result list; add a return type
annotation (-> Dict[str, float]) and make the function defensive by checking if
not result or not result[0]. If the result is empty, return a deterministic
{"accuracy": 0.0}; otherwise proceed to compute response = result[0].strip(),
parse prediction, compare to gold, and return {"accuracy": score}.
| import re | ||
| from math import isclose | ||
| from typing import Dict, List, Tuple | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silence RUF001 or replace ambiguous unicode; add Optional for types
Ruff flags the Unicode ×/− in literals. Easiest: per‑file suppression. Also prep for Optional return types.
+ # ruff: noqa: RUF001 # allow ×/− in prompt examples
import re
from math import isclose
-from typing import Dict, List, Tuple
+from typing import Dict, List, Optional📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import re | |
| from math import isclose | |
| from typing import Dict, List, Tuple | |
| # ruff: noqa: RUF001 # allow ×/− in prompt examples | |
| import re | |
| from math import isclose | |
| from typing import Dict, List, Optional |
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py lines 1-4, Ruff warns about ambiguous
Unicode characters (×/−) and the comment requests preparing return types as
Optional; to fix, add a per-file Ruff suppression directive at the top (e.g., a
single-line ruff noqa comment targeting RUF001) or replace any ambiguous Unicode
multiplication/division/minus characters in literals with ASCII equivalents (x,
*, -) throughout the file, and update any function return type hints that may be
nullable to use typing.Optional[...] (import Optional if not already) so
signatures reflect possible None returns.
| def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: | ||
| pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] | ||
| post_prompt = lmms_eval_specific_kwargs["post_prompt"] | ||
| question = doc["problem_text"] | ||
| if doc["unit"].strip(): | ||
| question = question + " The unit of the answer is " + doc["unit"] + "." | ||
| return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pre/post prompts; add docstring; avoid unused vars (F841)
Also keeps lines within 88 chars.
-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
- pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
- post_prompt = lmms_eval_specific_kwargs["post_prompt"]
- question = doc["problem_text"]
- if doc["unit"].strip():
- question = question + " The unit of the answer is " + doc["unit"] + "."
- return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
+def scibench_multishot_doc_to_text(
+ doc: Dict, lmms_eval_specific_kwargs: Dict
+) -> str:
+ """Build the multi-shot prompt for SciBench."""
+ pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+ post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+ question = str(doc.get("problem_text", ""))
+ unit = str(doc.get("unit") or "").strip()
+ if unit:
+ question = f"{question} The unit of the answer is {unit}."
+ return (
+ f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
+ "Answer: Let's think step by step."
+ f"{post_prompt}"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str: | |
| pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] | |
| post_prompt = lmms_eval_specific_kwargs["post_prompt"] | |
| question = doc["problem_text"] | |
| if doc["unit"].strip(): | |
| question = question + " The unit of the answer is " + doc["unit"] + "." | |
| return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step." | |
| def scibench_multishot_doc_to_text( | |
| doc: Dict, lmms_eval_specific_kwargs: Dict | |
| ) -> str: | |
| """Build the multi-shot prompt for SciBench.""" | |
| pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] | |
| post_prompt = lmms_eval_specific_kwargs["post_prompt"] | |
| question = str(doc.get("problem_text", "")) | |
| unit = str(doc.get("unit") or "").strip() | |
| if unit: | |
| question = f"{question} The unit of the answer is {unit}." | |
| return ( | |
| f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n" | |
| "Answer: Let's think step by step." | |
| f"{post_prompt}" | |
| ) |
🧰 Tools
🪛 Ruff (0.13.1)
126-126: Local variable pre_prompt is assigned to but never used
Remove assignment to unused variable pre_prompt
(F841)
127-127: Local variable post_prompt is assigned to but never used
Remove assignment to unused variable post_prompt
(F841)
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 125-131, the function
currently defines pre_prompt and post_prompt but never uses them, lacks a
docstring, and has long concatenated strings; update the function to include a
short docstring describing parameters and return value, use pre_prompt before
the question and post_prompt after the answer (combine with FEWSHOT_PROMPT),
replace string concatenation with f-strings to avoid unused variable lint
(F841), and reflow string construction so no line exceeds ~88 characters.
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!
Summary by CodeRabbit