[New Benchmark] Add Video-TT Benchmark #742

dongyh20 · 2025-07-08T07:05:12Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

New Features
- Introduced multiple new video-text evaluation tasks with diverse configurations, including single and open-ended question types, and support for audio-based evaluation.
- Added robust utilities for video question-answering, such as prompt formatting, result aggregation, and accuracy scoring.
- Integrated GPT-based evaluation and scoring for open-ended video tasks, supporting multiple model backends and custom prompt templates.
- Provided extensive model-specific prompt customization for improved compatibility and evaluation accuracy across supported models.

coderabbitai · 2025-07-08T07:05:20Z

Walkthrough

A comprehensive set of files has been added to introduce and configure a new suite of video-based question-answering evaluation tasks under the "video-tt" domain. The changes include YAML configuration files for multiple task variants, utility modules for scoring, prompt construction, and result aggregation, and integration of GPT-based evaluation logic for open-ended responses.

Changes

File(s)	Change Summary
`lmms_eval/tasks/video-tt/_default_template.yaml`	Added default YAML template for video-text tasks specifying dataset path and initialization parameters.
`lmms_eval/tasks/video-tt/gpt_utils.py`	Introduced GPT-based evaluation utilities: scoring, aggregation, prompt formatting, and error handling for video QA tasks.
`lmms_eval/tasks/video-tt/utils.py`	Added utility functions/constants for video QA: time/frame conversion, file path resolution, prompt building, and result aggregation.
`lmms_eval/tasks/video-tt/videott_all.yaml` `lmms_eval/tasks/video-tt/videott_all_audio.yaml` `lmms_eval/tasks/video-tt/videott_single_mc.yaml` `lmms_eval/tasks/video-tt/videott_single_mc_description.yaml`	Added YAML configs for various video QA tasks specifying input/output processing, metrics, and model-specific prompt variants.
`lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml` `lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml` `lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml` `lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml`	Added YAML configs for open-ended video QA task variants, each with tailored splits, prompts, and GPT-based scoring pipelines.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TaskConfig (YAML)
    participant Utils
    participant GPTUtils
    participant Model

    User->>TaskConfig: Selects and loads video-tt task config
    TaskConfig->>Utils: Calls doc_to_visual/doc_to_text for input prep
    TaskConfig->>Model: Sends formatted input for prediction
    Model-->>TaskConfig: Returns prediction
    TaskConfig->>GPTUtils: process_results (for open-ended tasks)
    GPTUtils->>GPT API: get_eval (question, answer, pred)
    GPT API-->>GPTUtils: Returns evaluation (yes/no, score)
    GPTUtils-->>TaskConfig: Returns processed score
    TaskConfig->>Utils: Aggregates results (aggregate_results/oe)
    Utils-->>User: Returns final metric(s)

Poem

🐇
A hop, a skip, a leap through time,
Video questions, answers in rhyme.
With YAML and code, the tasks align,
GPT checks if answers shine.
Frames and prompts, all in sync—
This rabbit’s proud, what do you think?
📹✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (13)

lmms_eval/tasks/video-tt/_default_template.yaml (1)
1-5: Add missing newline + prefer canonical booleans

YAML-lint fails because the file is missing the terminating newline and uses capitalised booleans.
Fixing both keeps CI green and avoids diff-only churn later.
 dataset_path: lmms-lab/video-tt
 dataset_kwargs:
-  token: True
-  cache_dir: video-tt
-  video: True
+  token: true
+  cache_dir: video-tt
+  video: true
+
lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1)
32-32: Trim trailing whitespace
-  xcomposer2_4khd:
+  xcomposer2_4khd:
lmms_eval/tasks/video-tt/videott_all_audio.yaml (1)
32-32: Remove trailing spaces to satisfy YAML-lint
-  llava_vid:
+  llava_vid:
lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1)
32-32: Trailing whitespace
-  llava_vid:
+  llava_vid:
lmms_eval/tasks/video-tt/videott_single_mc.yaml (2)
2-3: Dangling indent inside commented block

YAML treats indented comments as part of the previous mapping level; keep comment indentation consistent to avoid accidental key insertion when uncommented later.
-# dataset_name: 'test_mc_new'
-  # From_YouTube: True
+# dataset_name: 'test_mc_new'
+# From_YouTube: True
33-33: Trim trailing whitespace
-    post_prompt: ""
+    post_prompt: ""
lmms_eval/tasks/video-tt/videott_single_mc_description.yaml (1)
32-32: Fix trailing spaces.

The static analysis tool detected trailing spaces on this line. Please remove them to maintain consistent formatting.
-  # qwen_vl:  
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1)
32-32: Fix trailing spaces.

The static analysis tool detected trailing spaces on this line. Please remove them to maintain consistent formatting.
-  # qwen_vl:  
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_all.yaml (1)
32-32: Fix trailing spaces.

The static analysis tool detected trailing spaces on this line. Please remove them to maintain consistent formatting.
-  # qwen_vl:  
+  # qwen_vl:
lmms_eval/tasks/video-tt/utils.py (2)
108-109: Simplify conditional expressions using .get() method.

The conditional expressions can be simplified using the .get() method as suggested by static analysis.
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
-    pre_promt = lmms_eval_specific_kwargs["pre_prompt"] if "pre_prompt" in lmms_eval_specific_kwargs else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")
189-189: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, so the f prefix is unnecessary.
-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (2)
32-32: Trailing whitespace flagged by yamllint
Line 32 contains stray spaces after the comment marker. While innocuous, it fails the pre-commit yamllint hook used in this repo.
-  # qwen_vl:·· 
+  # qwen_vl:
17-17: Minor docstring spelling
registed ➜ registered for professionalism.
-# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+# Note that the metric name can be either a registered metric function (such as for GQA) or a key returned by process_results

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24a7d8 and 77a9ed9.

📒 Files selected for processing (11)

lmms_eval/tasks/video-tt/_default_template.yaml (1 hunks)
lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)
lmms_eval/tasks/video-tt/utils.py (1 hunks)
lmms_eval/tasks/video-tt/videott_all.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_all_audio.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_single_mc.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_single_mc_description.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lmms_eval/tasks/video-tt/utils.py (1)

lmms_eval/tasks/_task_utils/file_utils.py (1)

generate_submission_file (4-8)

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/video-tt/_default_template.yaml

[error] 5-5: no new line character at the end of file

(new-line-at-end-of-file)

lmms_eval/tasks/video-tt/videott_single_mc.yaml

[error] 33-33: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_all.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_single_mc_description.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_all_audio.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)

2-2: json imported but unused

Remove unused import: json

(F401)

6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)

8-8: typing.Dict imported but unused

Remove unused import

(F401)

8-8: typing.List imported but unused

Remove unused import

(F401)

8-8: typing.Optional imported but unused

Remove unused import

(F401)

8-8: typing.Union imported but unused

Remove unused import

(F401)

10-10: cv2 imported but unused

Remove unused import: cv2

(F401)

11-11: numpy imported but unused

Remove unused import: numpy

(F401)

15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)

58-58: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

89-89: Local variable cache_dir is assigned to but never used

Remove assignment to unused variable cache_dir

(F841)

108-108: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

109-109: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

120-120: Do not use bare except

(E722)

123-123: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

124-124: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

189-189: f-string without any placeholders

Remove extraneous f prefix

(F541)

206-206: f-string without any placeholders

Remove extraneous f prefix

(F541)

238-238: Loop control variable k not used within loop body

(B007)

275-275: Loop control variable k not used within loop body

(B007)

lmms_eval/tasks/video-tt/gpt_utils.py

2-2: decord.VideoReader imported but unused

Remove unused import

(F401)

2-2: decord.cpu imported but unused

Remove unused import

(F401)

3-3: numpy imported but unused

Remove unused import: numpy

(F401)

5-5: sys imported but unused

Remove unused import: sys

(F401)

6-6: datetime imported but unused

Remove unused import: datetime

(F401)

7-7: json imported but unused

Remove unused import: json

(F401)

9-9: yaml imported but unused

Remove unused import: yaml

(F401)

12-12: openai imported but unused

Remove unused import: openai

(F401)

13-13: openai.OpenAI imported but unused

Remove unused import: openai.OpenAI

(F401)

15-15: Redefinition of unused ast from line 1

Remove definition: ast

(F811)

111-111: Undefined name e

(F821)

153-153: Local variable question is assigned to but never used

Remove assignment to unused variable question

(F841)

155-155: Local variable pred is assigned to but never used

Remove assignment to unused variable pred

(F841)

156-156: Local variable review is assigned to but never used

Remove assignment to unused variable review

(F841)

157-157: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

163-163: f-string without any placeholders

Remove extraneous f prefix

(F541)

193-193: Local variable review is assigned to but never used

Remove assignment to unused variable review

(F841)

194-194: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

261-261: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 GitHub Actions: Lint

lmms_eval/tasks/video-tt/utils.py

[error] 1-1: Black formatting check failed. The file was reformatted by the black hook.

[error] 1-1: isort import sorting check failed. The file was modified by the isort hook.

lmms_eval/tasks/video-tt/gpt_utils.py

[error] 1-1: Black formatting check failed. The file was reformatted by the black hook.

[error] 1-1: isort import sorting check failed. The file was modified by the isort hook.

coderabbitai · 2025-07-08T07:12:46Z

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml

+task: videott_wrongly_led_oe
+test_split: test_wrongly_led_oe
+output_type: generate_until
+doc_to_visual: !function utils.videott_doc_to_visual
+doc_to_text: !function utils.videott_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 50
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+# The return value of process_results will be used by metrics
+process_results: !function gpt_utils.gpt_score_proccess


⚠️ Potential issue

Task id & handler typo will break task registration

task: value (videott_wrongly_led_oe) does not match the file name (videott_wrong_leading_oe.yaml).

Handler name gpt_score_proccess is miss-spelled (double c).
Either issue will raise at load time.

-task: videott_wrongly_led_oe +task: videott_wrong_leading_oe -process_results: !function gpt_utils.gpt_score_proccess +process_results: !function gpt_utils.gpt_score_process

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

task: videott_wrongly_led_oe

test_split: test_wrongly_led_oe

output_type: generate_until

doc_to_visual: !function utils.videott_doc_to_visual

doc_to_text: !function utils.videott_doc_to_text

doc_to_target: "answer"

generation_kwargs:

max_new_tokens: 50

temperature: 0

top_p: 1.0

num_beams: 1

do_sample: false

# The return value of process_results will be used by metrics

process_results: !function gpt_utils.gpt_score_proccess

task: videott_wrong_leading_oe

test_split: test_wrongly_led_oe

output_type: generate_until

doc_to_visual: !function utils.videott_doc_to_visual

doc_to_text: !function utils.videott_doc_to_text

doc_to_target: "answer"

generation_kwargs:

max_new_tokens: 50

temperature: 0

top_p: 1.0

num_beams: 1

do_sample: false

# The return value of process_results will be used by metrics

process_results: !function gpt_utils.gpt_score_process

🤖 Prompt for AI Agents

In lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml lines 3 to 16, fix the task id by changing the `task:` value from `videott_wrongly_led_oe` to `videott_wrong_leading_oe` to match the file name. Also, correct the handler name in `process_results` from `gpt_score_proccess` to `gpt_score_process` to fix the typo. These changes will prevent task registration errors at load time.

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

lmms_eval/tasks/video-tt/videott_all.yaml

+  gpt4v:
+    pre_prompt: ""
+    post_prompt: "Answer the question with A, B, C, or D."


lmms_eval/tasks/video-tt/gpt_utils.py

coderabbitai · 2025-07-08T07:12:48Z

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml

+task: videott_correctly_led_oe
+test_split: test_correctly_led_oe


⚠️ Potential issue

Task name/file name mismatch could break automatic discovery
The task id inside the YAML (videott_correctly_led_oe) does not match the file’s slug videott_correct_leading_oe.yaml. Loader logic in lmms-eval commonly relies on filename ↔ task-id symmetry; a mismatch risks the task being silently skipped or duplicated under two names.
Either rename the file or update the task: field so they are identical.

🤖 Prompt for AI Agents

In lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml at lines 3-4, the task id 'videott_correctly_led_oe' does not match the filename slug 'videott_correct_leading_oe', causing potential issues with automatic task discovery. To fix this, update the 'task:' field in the YAML to exactly match the filename slug 'videott_correct_leading_oe' or rename the file to match the current task id, ensuring consistency between the task id and filename.

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (8)

lmms_eval/tasks/video-tt/gpt_utils.py (5)

1-15: Remove unused imports and fix import organization.

This duplicates the previous review comment about cleaning up unused imports. The static analysis correctly identifies multiple unused imports that should be removed.

111-111: Fix undefined variable reference.

This duplicates the previous review comment about the undefined variable e. The variable is referenced but not defined in this scope, causing a runtime error.

143-163: Remove commented-out code and unused variables.

This duplicates the previous review comment about removing the large block of commented-out code and unused variable assignments.

180-201: Remove commented-out code and implement GPT evaluation.

This duplicates the previous review comment about removing commented-out code. Additionally, like gpt_score_proccess, this function also returns default values without performing actual GPT evaluation.

163-163: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, so the f prefix is unnecessary.

lmms_eval/tasks/video-tt/utils.py (3)

1-15: Remove unused imports.

This duplicates the previous review comment about cleaning up unused imports. The static analysis correctly identifies multiple unused imports that should be removed.

113-113: Replace hard-coded path with configurable parameter.

This duplicates the previous review comment about making the hard-coded path configurable via environment variables or parameters.

117-117: Replace bare except with specific exception handling.

This duplicates the previous review comment about using specific exception types instead of bare except.

🧹 Nitpick comments (5)

lmms_eval/tasks/video-tt/gpt_utils.py (1)
135-135: Fix function name typo.

The function name gpt_score_proccess contains a typo - it should be gpt_score_process (missing 's' in 'process').
-def gpt_score_proccess(doc, result):
+def gpt_score_process(doc, result):
Note: This change will require updating all references to this function in the YAML configuration files.
lmms_eval/tasks/video-tt/utils.py (4)
83-96: Remove dead code and clarify function purpose.

The videott_doc_to_visual_tos function contains a large block of commented-out code that should be removed. The function appears to construct URLs instead of local file paths, which should be documented.
def videott_doc_to_visual_tos(doc):
-    cache_dir = os.path.join(base_cache_dir, cache_name)
-    # import pdb;pdb.set_trace()
+    """
+    Constructs a TOS (remote) URL for video access instead of local file paths.
+    """
    video_path = doc["video_id"] + ".mp4"
    video_path = os.path.join("https://tosv.byted.org/obj/tiktok-maas-us/robustness-benchmark/", video_path)
-    # if os.path.exists(video_path):
-    #     video_path = video_path
-    # elif os.path.exists(video_path.replace("mp4", "MP4")):
-    #     video_path = video_path.replace("mp4", "MP4")
-    # elif os.path.exists(video_path.replace("mp4", "mkv")):
-    #     video_path = video_path.replace("mp4", "mkv")
-    # else:
-    #     sys.exit(f"video path:{video_path} does not exist, please check")
    return [video_path]
101-107: Simplify conditional logic using dict.get().

The conditional logic can be simplified using the dict.get() method as suggested by static analysis.
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")
+    pre_promt = lmms_eval_specific_kwargs.get(
+        "pre_prompt",
+        "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
+    )
190-190: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, so the f prefix is unnecessary.
-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}
208-208: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, so the f prefix is unnecessary.
-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77a9ed9 and 6c15459.

📒 Files selected for processing (2)

lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)
lmms_eval/tasks/video-tt/utils.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lmms_eval/tasks/video-tt/utils.py (1)

lmms_eval/tasks/_task_utils/file_utils.py (1)

generate_submission_file (4-8)

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)

2-2: json imported but unused

Remove unused import: json

(F401)

6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)

8-8: typing.Dict imported but unused

Remove unused import

(F401)

8-8: typing.List imported but unused

Remove unused import

(F401)

8-8: typing.Optional imported but unused

Remove unused import

(F401)

8-8: typing.Union imported but unused

Remove unused import

(F401)

10-10: cv2 imported but unused

Remove unused import: cv2

(F401)

11-11: numpy imported but unused

Remove unused import: numpy

(F401)

15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)

56-56: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

84-84: Local variable cache_dir is assigned to but never used

Remove assignment to unused variable cache_dir

(F841)

101-101: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

103-105: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

117-117: Do not use bare except

(E722)

120-120: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

122-124: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

190-190: f-string without any placeholders

Remove extraneous f prefix

(F541)

208-208: f-string without any placeholders

Remove extraneous f prefix

(F541)

239-239: Loop control variable k not used within loop body

(B007)

274-274: Loop control variable k not used within loop body

(B007)

lmms_eval/tasks/video-tt/gpt_utils.py

2-2: datetime imported but unused

Remove unused import: datetime

(F401)

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: sys imported but unused

Remove unused import: sys

(F401)

8-8: numpy imported but unused

Remove unused import: numpy

(F401)

9-9: openai imported but unused

Remove unused import: openai

(F401)

11-11: yaml imported but unused

Remove unused import: yaml

(F401)

12-12: decord.VideoReader imported but unused

Remove unused import

(F401)

12-12: decord.cpu imported but unused

Remove unused import

(F401)

14-14: openai.OpenAI imported but unused

Remove unused import: openai.OpenAI

(F401)

111-111: Undefined name e

(F821)

153-153: Local variable question is assigned to but never used

Remove assignment to unused variable question

(F841)

155-155: Local variable pred is assigned to but never used

Remove assignment to unused variable pred

(F841)

156-156: Local variable review is assigned to but never used

Remove assignment to unused variable review

(F841)

157-157: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

163-163: f-string without any placeholders

Remove extraneous f prefix

(F541)

193-193: Local variable review is assigned to but never used

Remove assignment to unused variable review

(F841)

194-194: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

261-261: f-string without any placeholders

Remove extraneous f prefix

(F541)

lmms_eval/tasks/video-tt/gpt_utils.py

lmms_eval/tasks/video-tt/utils.py

lmms_eval/tasks/video-tt/gpt_utils.py

lmms_eval/tasks/video-tt/utils.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (4)

lmms_eval/tasks/video-tt/utils.py (2)

1-15: Clean up unused imports.

Multiple imports are unused and should be removed to improve code maintainability.

102-102: Replace bare except with specific exception handling.

Using bare except is discouraged as it can mask important errors.

lmms_eval/tasks/video-tt/gpt_utils.py (2)

1-15: Clean up unused imports.

Multiple imports are unused and should be removed to improve code maintainability.

111-111: Fix undefined variable reference.

The variable e is referenced but not defined in this scope.

🧹 Nitpick comments (6)

lmms_eval/tasks/video-tt/utils.py (4)

86-90: Simplify conditional logic using dict.get().

The conditional logic can be simplified using the get() method as suggested by static analysis.

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

105-109: Simplify conditional logic using dict.get().

Similar to the previous function, this can be simplified using the get() method.

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

177-177: Remove unnecessary f-string prefix.

The f-string has no placeholders and should be a regular string.

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

195-195: Remove unnecessary f-string prefix.

Similar to the previous function, this f-string has no placeholders.

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

lmms_eval/tasks/video-tt/gpt_utils.py (2)

163-163: Remove unnecessary f-string prefix.

The f-string has no placeholders and should be a regular string.
-    return {f"videott_open_ended_score": data_dict}
+    return {"videott_open_ended_score": data_dict}
222-222: Remove unnecessary f-string prefix.

Similar to previous instances, this f-string has no placeholders.
-    return {f"accuracy": pred == doc["answer"]}
+    return {"accuracy": pred == doc["answer"]}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c15459 and 1973d53.

📒 Files selected for processing (2)

lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)
lmms_eval/tasks/video-tt/utils.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lmms_eval/tasks/video-tt/utils.py (1)

lmms_eval/tasks/_task_utils/file_utils.py (1)

generate_submission_file (4-8)

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/gpt_utils.py

2-2: datetime imported but unused

Remove unused import: datetime

(F401)

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: sys imported but unused

Remove unused import: sys

(F401)

8-8: numpy imported but unused

Remove unused import: numpy

(F401)

9-9: openai imported but unused

Remove unused import: openai

(F401)

11-11: yaml imported but unused

Remove unused import: yaml

(F401)

12-12: decord.VideoReader imported but unused

Remove unused import

(F401)

12-12: decord.cpu imported but unused

Remove unused import

(F401)

14-14: openai.OpenAI imported but unused

Remove unused import: openai.OpenAI

(F401)

111-111: Undefined name e

(F821)

157-157: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

163-163: f-string without any placeholders

Remove extraneous f prefix

(F541)

222-222: f-string without any placeholders

Remove extraneous f prefix

(F541)

lmms_eval/tasks/video-tt/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)

2-2: json imported but unused

Remove unused import: json

(F401)

6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)

8-8: typing.Dict imported but unused

Remove unused import

(F401)

8-8: typing.List imported but unused

Remove unused import

(F401)

8-8: typing.Optional imported but unused

Remove unused import

(F401)

8-8: typing.Union imported but unused

Remove unused import

(F401)

10-10: cv2 imported but unused

Remove unused import: cv2

(F401)

11-11: numpy imported but unused

Remove unused import: numpy

(F401)

15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)

56-56: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

86-86: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

88-90: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

102-102: Do not use bare except

(E722)

105-105: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

107-109: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

177-177: f-string without any placeholders

Remove extraneous f prefix

(F541)

195-195: f-string without any placeholders

Remove extraneous f prefix

(F541)

226-226: Loop control variable k not used within loop body

(B007)

261-261: Loop control variable k not used within loop body

(B007)

🪛 GitHub Actions: Lint

lmms_eval/tasks/video-tt/gpt_utils.py

[error] 1-1: Black formatting check failed. The file was automatically reformatted by Black.

lmms_eval/tasks/video-tt/utils.py

lmms_eval/tasks/video-tt/gpt_utils.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

lmms_eval/tasks/video-tt/gpt_utils.py (5)
1-14: Clean up unused imports to improve code maintainability.

Multiple imports are unused according to static analysis. This matches previous review feedback and should be addressed.

Remove the following unused imports:

datetime, json, sys (lines 2-5)

numpy, openai, yaml (lines 8-11)

VideoReader, cpu from decord (line 12)

OpenAI from openai (line 14)
 import ast
-import datetime
-import json
 import os
-import sys
 import time

-import numpy as np
-import openai
 import requests
-import yaml
-from decord import VideoReader, cpu
 from loguru import logger as eval_logger
-from openai import OpenAI
68-68: Fix data type inconsistency in example response.

The example shows a float score (4.8) while the instruction specifies INTEGER format, which could confuse the GPT model.
-        "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."
+        "For example, your response should look like this: {'pred': 'yes', 'score': 4}."
111-111: Fix undefined variable reference that will cause runtime error.

The variable e is referenced but not defined in this scope, causing a runtime error.
         else:  # If this was the last attempt, log and return empty
-            eval_logger.error(f"All {retries} attempts failed. Last error message: {e}")
+            eval_logger.error(f"All {retries} attempts failed.")
149-149: Increase max_tokens for meaningful GPT responses.

The max_tokens=1 is too low for GPT to generate the expected dictionary response format.
-        review, model_name = get_eval(question, answer, pred, 1)
+        review, model_name = get_eval(question, answer, pred, 100)
153-158: Remove redundant variable assignments in exception handler.

The exception block redundantly assigns variables that are not used afterward.
    except Exception as e:
        eval_logger.error(f"Error for Question ID: {doc.get('question_id', 'Unknown')}: {e}")
-        question = doc["question"]
-        answer = doc["answer"]
-        pred = result[0]
-        review = "Failed to Get a Proper Review."
-        model_name = "Failed Request"
        scores = ["no", 0]

🧹 Nitpick comments (2)

lmms_eval/tasks/video-tt/gpt_utils.py (2)
163-163: Remove unnecessary f-string prefix.

The f-string has no placeholders, making the f prefix unnecessary.
-    return {f"videott_open_ended_score": data_dict}
+    return {"videott_open_ended_score": data_dict}
223-223: Remove unnecessary f-string prefix.

The f-string has no placeholders, making the f prefix unnecessary.
-    return {f"accuracy": pred == doc["answer"]}
+    return {"accuracy": pred == doc["answer"]}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1973d53 and dc4d925.

📒 Files selected for processing (1)

lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/gpt_utils.py

2-2: datetime imported but unused

Remove unused import: datetime

(F401)

3-3: json imported but unused

Remove unused import: json

(F401)

5-5: sys imported but unused

Remove unused import: sys

(F401)

8-8: numpy imported but unused

Remove unused import: numpy

(F401)

9-9: openai imported but unused

Remove unused import: openai

(F401)

11-11: yaml imported but unused

Remove unused import: yaml

(F401)

12-12: decord.VideoReader imported but unused

Remove unused import

(F401)

12-12: decord.cpu imported but unused

Remove unused import

(F401)

14-14: openai.OpenAI imported but unused

Remove unused import: openai.OpenAI

(F401)

111-111: Undefined name e

(F821)

157-157: Local variable model_name is assigned to but never used

Remove assignment to unused variable model_name

(F841)

163-163: f-string without any placeholders

Remove extraneous f prefix

(F541)

223-223: f-string without any placeholders

Remove extraneous f prefix

(F541)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

lmms_eval/tasks/video-tt/utils.py (3)

1-15: Remove unused imports to clean up the code.

Multiple imports are flagged as unused by static analysis tools and should be removed to improve code maintainability.

62-62: Fix the AUDIO_PATH environment variable assignment.

The environment variable should be assigned only to AUDIO_PATH, not to both variables.

102-102: Replace bare except with specific exception handling.

Using bare except is discouraged as it can mask important errors. Specify the expected exception types.

lmms_eval/tasks/video-tt/gpt_utils.py (1)

60-60: Fix inconsistent data type in example response.

The example shows a float score (4.8) but the instruction specifies INTEGER, which could confuse the GPT model.

🧹 Nitpick comments (9)

lmms_eval/tasks/video-tt/utils.py (6)

86-90: Simplify conditional assignments using dict.get().

Replace the conditional block with a more concise .get() method call.

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

105-109: Simplify conditional assignments using dict.get().

Similar to the previous function, use .get() method for cleaner code.

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

177-177: Remove unnecessary f-string prefix.

The f-string has no placeholders, so the f prefix is unnecessary.

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

195-195: Remove unnecessary f-string prefix.

The f-string has no placeholders, so the f prefix is unnecessary.

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

226-228: Simplify loop variable usage.

The loop variable k is not used in the loop body. Consider restructuring for clarity.

-    for k, v in category2score.items():
-        total_correct += v["correct"]
-        total_answered += v["answered"]
+    for v in category2score.values():
+        total_correct += v["correct"]
+        total_answered += v["answered"]

261-263: Simplify loop variable usage.

Similar to the above, the loop variable k is not used in the loop body.

-    for k, v in category2score.items():
-        total_correct += v["correct"]
-        total_answered += v["answered"]
+    for v in category2score.values():
+        total_correct += v["correct"]
+        total_answered += v["answered"]

lmms_eval/tasks/video-tt/gpt_utils.py (3)

78-106: Improve error handling structure and logging.

The retry logic is well-implemented, but there's a potential issue with the loop structure and return statements.

The function has two return statements at the end (lines 104 and 106) which creates unreachable code. Consider restructuring:

    for attempt in range(retries):
        try:
            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
            response.raise_for_status()
            try:
                response_data = response.json()
            except requests.exceptions.JSONDecodeError:
                eval_logger.error(f"JSON decode error on attempt {attempt + 1}. Response text: {response.text}")
                continue
            content = response_data["choices"][0]["message"]["content"].strip()
            if content != "":
                return content, response_data["model"]
        except requests.exceptions.HTTPError as e:
            eval_logger.error(f"HTTP error on attempt {attempt + 1}: {e}")
        except requests.exceptions.RequestException as e:
            eval_logger.error(f"Request exception on attempt {attempt + 1}: {e}")
        except Exception as e:
            eval_logger.error(f"Unexpected error on attempt {attempt + 1}: {e}")

        if attempt < retries - 1:
            time.sleep(NUM_SECONDS_TO_SLEEP)

-    return "", ""
+    # All retries failed
+    eval_logger.error(f"All {retries} attempts failed.")
+    return "", ""

211-211: Remove unnecessary f-string prefix.

The f-string has no placeholders, so the f prefix is unnecessary.

-    return {f"accuracy": pred == doc["answer"]}
+    return {"accuracy": pred == doc["answer"]}

155-173: Consolidate duplicate aggregation functions.

The aggregate_score and aggregate_accuracy functions are nearly identical with only the return value differing. Consider consolidating them.

-def aggregate_score(results, args):
-    yes_count = 0
-    no_count = 0
-    total_score = 0
-
-    # Iterate over the results to count correctness and sum scores
-    for result_dict in results:
-        if result_dict["Correctness"] == "yes":
-            yes_count += 1
-        else:
-            no_count += 1
-        total_score += result_dict["score"]
-
-    # Calculate accuracy and average score
-    accuracy = yes_count / (yes_count + no_count) if (yes_count + no_count) > 0 else 0
-    average_score = total_score / len(results) if results else 0
-    eval_logger.info(f"Accuracy: {accuracy}")
-    eval_logger.info(f"Average Score: {average_score}")
-    return average_score
-
-
-def aggregate_accuracy(results, args):
-    yes_count = 0
-    no_count = 0
-    total_score = 0
-
-    # Iterate over the results to count correctness and sum scores
-    for result_dict in results:
-        if result_dict["Correctness"] == "yes":
-            yes_count += 1
-        else:
-            no_count += 1
-        total_score += result_dict["score"]
-
-    # Calculate accuracy and average score
-    accuracy = yes_count / (yes_count + no_count) if (yes_count + no_count) > 0 else 0
-    average_score = total_score / len(results) if results else 0
-    eval_logger.info(f"Accuracy: {accuracy}")
-    eval_logger.info(f"Average Score: {average_score}")
-    return accuracy * 100

+def _calculate_metrics(results):
+    """Helper function to calculate metrics from results."""
+    yes_count = sum(1 for result in results if result["Correctness"] == "yes")
+    no_count = len(results) - yes_count
+    total_score = sum(result["score"] for result in results)
+    
+    accuracy = yes_count / len(results) if results else 0
+    average_score = total_score / len(results) if results else 0
+    
+    eval_logger.info(f"Accuracy: {accuracy}")
+    eval_logger.info(f"Average Score: {average_score}")
+    
+    return accuracy, average_score
+
+def aggregate_score(results, args):
+    accuracy, average_score = _calculate_metrics(results)
+    return average_score
+
+def aggregate_accuracy(results, args):
+    accuracy, average_score = _calculate_metrics(results)
+    return accuracy * 100

Also applies to: 176-194

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc4d925 and a256442.

📒 Files selected for processing (2)

lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)
lmms_eval/tasks/video-tt/utils.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lmms_eval/tasks/video-tt/utils.py (1)

lmms_eval/tasks/_task_utils/file_utils.py (1)

generate_submission_file (4-8)

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/gpt_utils.py

211-211: f-string without any placeholders

Remove extraneous f prefix

(F541)

lmms_eval/tasks/video-tt/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)

2-2: json imported but unused

Remove unused import: json

(F401)

6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)

8-8: typing.Dict imported but unused

Remove unused import

(F401)

8-8: typing.List imported but unused

Remove unused import

(F401)

8-8: typing.Optional imported but unused

Remove unused import

(F401)

8-8: typing.Union imported but unused

Remove unused import

(F401)

10-10: cv2 imported but unused

Remove unused import: cv2

(F401)

11-11: numpy imported but unused

Remove unused import: numpy

(F401)

15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)

56-56: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

86-86: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

88-90: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

102-102: Do not use bare except

(E722)

105-105: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

107-109: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

177-177: f-string without any placeholders

Remove extraneous f prefix

(F541)

195-195: f-string without any placeholders

Remove extraneous f prefix

(F541)

226-226: Loop control variable k not used within loop body

(B007)

261-261: Loop control variable k not used within loop body

(B007)

lmms_eval/tasks/video-tt/utils.py

lmms_eval/tasks/video-tt/gpt_utils.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

lmms_eval/tasks/video-tt/utils.py (2)

1-15: Remove unused imports to clean up the code.

Many imports are not used in this file and should be removed to improve code clarity and reduce dependencies.

-import datetime
-import json
 import os
 import re
 import sys
-from collections import defaultdict
 from pathlib import Path
-from typing import Dict, List, Optional, Union

-import cv2
-import numpy as np
 import yaml
 from loguru import logger as eval_logger

-from lmms_eval.tasks._task_utils.file_utils import generate_submission_file

102-107: Fix potential undefined variable and improve error handling.

The variable audio_path is only defined within the else block but is used in the try block regardless. Also, the bare except clause should be more specific.

     if not AUDIO_PATH:
         eval_logger.warning("AUDIO_PATH environment variable not set, skipping audio subtitles")
         subtitle = ""
     else:
         audio_path = os.path.join(AUDIO_PATH, f'{doc["video_id"]}.txt')
-    try:
-        with open(audio_path) as f:
-            subtitle = f.read()
-    except:
-        subtitle = ""
+        try:
+            with open(audio_path) as f:
+                subtitle = f.read()
+        except (FileNotFoundError, IOError, OSError):
+            subtitle = ""

🧹 Nitpick comments (9)

lmms_eval/tasks/video-tt/utils.py (9)

39-48: Remove commented-out code.

This commented-out code block should be removed to keep the codebase clean, as it's replaced by the active implementation below.

-# with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
-#     raw_data = f.readlines()
-#     safe_data = []
-#     for i, line in enumerate(raw_data):
-#         # remove function definition since yaml load cannot handle it
-#         if "!function" not in line:
-#             safe_data.append(line)
-
-#     config = yaml.safe_load("".join(safe_data))

56-56: Replace unused loop variable with underscore.

The loop variable i is not used within the loop body and should be replaced with _ to indicate it's intentionally unused.

-    for i, line in enumerate(raw_data):
+    for _, line in enumerate(raw_data):

86-86: Simplify conditional expressions using dict.get().

Replace the conditional blocks with more concise dict.get() calls for better readability.

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

Also applies to: 88-90

87-87: Fix typo in variable name.

The variable name pre_promt should be pre_prompt for consistency and correctness.

-    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")
+    pre_prompt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

And update the usage:

-    full_prompt = pre_promt + "\n" + question + "\n" + post_prompt
+    full_prompt = pre_prompt + "\n" + question + "\n" + post_prompt

109-113: Fix typo and simplify conditional expressions.

Same issues as in the previous function - typo in variable name and can be simplified with dict.get().

-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")
+    pre_prompt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

And update the usage:

-    full_prompt = subtitles_prompt + subtitle + "\n" + pre_promt + "\n" + question + "\n" + post_prompt
+    full_prompt = subtitles_prompt + subtitle + "\n" + pre_prompt + "\n" + question + "\n" + post_prompt

119-130: Remove commented-out code and documentation.

This commented-out code appears to be documentation or examples that should be removed to keep the code clean.

-# Frames + Subs
-# This video's subtitles are listed below:
-# 【subtitles】
-
-# Select the best answer to the following multiple-choice question based on the video and the subtitles. Respond with only the letter (A, B, C, or D) of the correct option.
-# 【question】
-# The best answer is:
-# Frames / Frames + Audio
-# Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
-# 【question】
-# The best answer is:

156-162: Remove unused global variables and commented code.

The matrices list is empty and unused, and the commented loop should be removed.

-matrices = []
-
-# for i in VIDEO_TYPE:
-#     for j in CATEGORIES:
-#         for k in SUB_CATEGORIES:
-#             for l in TASK_CATEGORIES:
-#                 matrices.append(f"{i}_{j}_{k}_{l}")

181-181: Remove unnecessary f-string prefix.

The f-strings don't contain any placeholders and should be regular strings.

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

Apply the same fix to line 199:

-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}

Also applies to: 199-199

230-230: Replace unused loop variable with underscore.

The loop variable k is not used within the loop body and should be replaced with _ to indicate it's intentionally unused.

-    for k, v in category2score.items():
+    for _, v in category2score.items():

Apply the same fix to line 265:

-    for k, v in category2score.items():
+    for _, v in category2score.items():

Also applies to: 265-265

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a256442 and 9819e70.

📒 Files selected for processing (1)

lmms_eval/tasks/video-tt/utils.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)

2-2: json imported but unused

Remove unused import: json

(F401)

6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)

8-8: typing.Dict imported but unused

Remove unused import

(F401)

8-8: typing.List imported but unused

Remove unused import

(F401)

8-8: typing.Optional imported but unused

Remove unused import

(F401)

8-8: typing.Union imported but unused

Remove unused import

(F401)

10-10: cv2 imported but unused

Remove unused import: cv2

(F401)

11-11: numpy imported but unused

Remove unused import: numpy

(F401)

15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)

56-56: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

86-86: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

88-90: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

106-106: Do not use bare except

(E722)

109-109: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

111-113: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

181-181: f-string without any placeholders

Remove extraneous f prefix

(F541)

199-199: f-string without any placeholders

Remove extraneous f prefix

(F541)

230-230: Loop control variable k not used within loop body

(B007)

265-265: Loop control variable k not used within loop body

(B007)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1)
3-4: Fix task id typo – prevents task registration

task: is set to videott_wrongly_led_oe, which mismatches both the filename and the intended dataset split. At runtime this causes the loader to register an incorrect task id (or fail if the split list is keyed by the correct id).
-task: videott_wrongly_led_oe
+task: videott_wrong_leading_oe
lmms_eval/tasks/video-tt/videott_all.yaml (1)

26-29: Duplicate: gpt4v prompt forces MC answer on open-ended task

Same observation made in earlier review – prompt should not request A/B/C/D.
Refer to prior comment.
lmms_eval/tasks/video-tt/gpt_utils.py (1)
60-60: Fix inconsistent data type in example response.

The example shows a float score (4.8) but the instruction clearly specifies INTEGER. This inconsistency could confuse the GPT model during evaluation.
-        "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."
+        "For example, your response should look like this: {'pred': 'yes', 'score': 4}."
lmms_eval/tasks/video-tt/utils.py (1)
98-98: Replace bare except with specific exception handling.

Using bare except is discouraged as it can mask important errors. Please specify the expected exception types.
-    except:
+    except (FileNotFoundError, IOError, OSError):
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1)
3-4: Fix task name/filename mismatch.

The task ID videott_correctly_led_oe doesn't match the filename videott_correct_leading_oe.yaml. This inconsistency could cause issues with automatic task discovery in the evaluation framework.
-task: videott_correctly_led_oe
+task: videott_correct_leading_oe

🧹 Nitpick comments (10)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1)
32-33: Remove trailing spaces to satisfy YAML-lint

YAML-lint is flagging the blank comment line (32) for trailing spaces. Strip the whitespace to keep CI green.
-  # qwen_vl:··
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1)
32-33: Clean trailing whitespace

Line 32 has trailing spaces that violate the repo’s lint rules.
-  # qwen_vl:··
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_all_audio.yaml (1)

32-33: Trim trailing whitespace

Line 32 is flagged by YAML-lint; remove the spaces.

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1)

32-33: YAML-lint trailing spaces

Remove spaces on the comment line 32.

lmms_eval/tasks/video-tt/videott_all.yaml (1)

32-33: Strip trailing whitespace

Line 32 contains stray spaces, breaking lint.
lmms_eval/tasks/video-tt/gpt_utils.py (1)
211-211: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, making the f prefix unnecessary.
-    return {f"accuracy": pred == doc["answer"]}
+    return {"accuracy": pred == doc["answer"]}
lmms_eval/tasks/video-tt/utils.py (3)
173-173: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, making the f prefix unnecessary.
-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}
191-191: Remove unnecessary f-string prefix.

The f-string doesn't contain any placeholders, making the f prefix unnecessary.
-    return {f"videott_perception_score": data_dict}
+    return {"videott_perception_score": data_dict}
78-84: Consider using dict.get() for cleaner code.

The conditional logic can be simplified using the get() method.
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
+    post_prompt = lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")
-    pre_promt = (
-        lmms_eval_specific_kwargs["pre_prompt"]
-        if "pre_prompt" in lmms_eval_specific_kwargs
-        else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
-    )
+    pre_promt = lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1)
32-32: Remove trailing spaces.

Static analysis detected trailing spaces which should be removed for clean formatting.
-  # qwen_vl:  
+  # qwen_vl:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9819e70 and cd87f61.

📒 Files selected for processing (8)

lmms_eval/tasks/video-tt/gpt_utils.py (1 hunks)
lmms_eval/tasks/video-tt/utils.py (1 hunks)
lmms_eval/tasks/video-tt/videott_all.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_all_audio.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

lmms_eval/tasks/video-tt/gpt_utils.py

211-211: f-string without any placeholders

Remove extraneous f prefix

(F541)

lmms_eval/tasks/video-tt/utils.py

48-48: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

78-78: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

80-82: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

98-98: Do not use bare except

(E722)

101-101: Use lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:") instead of an if block

Replace with lmms_eval_specific_kwargs.get("post_prompt", "The best answer is:")

(SIM401)

103-105: Use lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.") instead of an if block

Replace with lmms_eval_specific_kwargs.get("pre_prompt", "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.")

(SIM401)

173-173: f-string without any placeholders

Remove extraneous f prefix

(F541)

191-191: f-string without any placeholders

Remove extraneous f prefix

(F541)

222-222: Loop control variable k not used within loop body

(B007)

257-257: Loop control variable k not used within loop body

(B007)

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/video-tt/videott_all.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_all_audio.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

🔇 Additional comments (6)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1)

4-4: Verify split-name consistency

test_split: test_paraphrased_oe differs from the task id suffix paraphrase.
If the dataset file is actually called test_paraphrase_oe, evaluation will crash.
Please double-check the dataset naming or adjust one side accordingly.

lmms_eval/tasks/video-tt/videott_all_audio.yaml (1)

38-40: Same MC prompt issue for xcomposer2_4khd

The specialised prompt still forces A/B/C/D answers. Align it with the open-ended nature or switch this YAML to a multiple-choice metric.

lmms_eval/tasks/video-tt/gpt_utils.py (1)

127-151: LGTM - GPT evaluation properly implemented.

The function now correctly implements GPT-based evaluation with proper error handling and fallback values. The try-except structure ensures robust operation even when API calls fail.

lmms_eval/tasks/video-tt/utils.py (1)

194-227: LGTM - Robust aggregation logic implemented.

The aggregation function properly handles category-based scoring with comprehensive logging and error handling for division by zero cases.

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (2)

16-16: LGTM - Proper function reference.

The function reference gpt_utils.gpt_score_process correctly points to the GPT evaluation function implemented in the gpt_utils module.

9-21: Well-configured evaluation parameters.

The generation parameters and metric configuration are properly set up for open-ended video evaluation with appropriate token limits and aggregation functions.

lmms_eval/tasks/video-tt/videott_all_audio.yaml

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1)
3-4: Task id & split name still mis-spelled – task will not load

The task: and test_split: values must exactly match the dataset identifiers as well as stay consistent with the filename.
They still read videott_wrongly_led_oe / test_wrongly_led_oe, which will raise a registration error at runtime.
-task: videott_wrongly_led_oe
-test_split: test_wrongly_led_oe
+task: videott_wrong_leading_oe
+test_split: test_wrong_leading_oe
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1)
3-4: Filename ↔ task:/test_split mismatch will break task discovery

task: videott_correctly_led_oe and test_split: test_correctly_led_oe don’t match the filename slug videott_correct_leading_oe.
lmms-eval’s loader expects a 1-to-1 mapping; the current mismatch will cause the benchmark to be skipped or duplicated under an unexpected id.

Fix by making them identical to the filename (or renaming the file). Suggested patch:
-task: videott_correctly_led_oe
-test_split: test_correctly_led_oe
+task: videott_correct_leading_oe
+test_split: test_correct_leading_oe

🧹 Nitpick comments (4)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1)
32-32: Remove trailing whitespace to satisfy YAML-lint
-  # qwen_vl:··
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1)
32-32: Strip trailing spaces to satisfy YAML-lint

Line 32 has superfluous whitespace that triggers the linter.
-  # qwen_vl:  
+  # qwen_vl:
lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1)
10-14: Drop redundant sampling-related kwargs

temperature: 0 and do_sample: false already disable stochastic sampling, so top_p is ignored.
Cleaning it up avoids confusion and makes intent explicit.
   max_new_tokens: 50
   temperature: 0
-  top_p: 1.0
   num_beams: 1
   do_sample: false
lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1)
32-37: YAML-lint trailing-space error

Static analysis flags trailing spaces on the commented block (line 32). Although harmless at runtime, it will fail the repo’s pre-commit hooks.
-  # qwen_vl:  
+  # qwen_vl:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd87f61 and 96032de.

📒 Files selected for processing (4)

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1 hunks)
lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

lmms_eval/tasks/video-tt/videott_wrong_leading_oe.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)

lmms_eval/tasks/video-tt/videott_paraphrase_oe.yaml (1)

4-4: Double-check split identifier test_paraphrased_oe matches the dataset registry
Other files in the series use the “paraphrase” spelling. Please confirm the split name exactly matches what’s exported by the dataset loader; otherwise the task will raise a KeyError at runtime.

lmms_eval/tasks/video-tt/videott_correct_leading_oe.yaml (1)

16-16: Good catch on the typo correction

process_results now points to gpt_utils.gpt_score_process, which aligns with the actual helper function name—no further action needed here.

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Luodian · 2025-07-08T08:42:34Z

lmms_eval/tasks/video-tt/gpt_utils.py

+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+NEXSTONE_HTTP_CHAT_URL = "https://maas.byteintl.net/service/api/v1/chat/completions"


@dongyh20 hi I think we should remove the private url?

cursor

Bug: Variable Definition Error and Typo

Two issues are present:

NameError for audio_path: In the videott_doc_to_text_audio function, the audio_path variable is conditionally defined only when the AUDIO_PATH environment variable is set. However, it is unconditionally used in a subsequent try block. If AUDIO_PATH is not set, audio_path remains undefined, leading to a NameError.
Typo in pre_promt: The variable pre_promt (used in videott_doc_to_text and videott_doc_to_text_audio functions) contains a typo and should be pre_prompt. While functionally consistent, this affects code readability and maintainability.

lmms_eval/tasks/video-tt/utils.py#L78-L107

lmms-eval/lmms_eval/tasks/video-tt/utils.py

Lines 78 to 107 in f96e2ed

    
               post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:" 
        
               pre_promt = ( 
        
                   lmms_eval_specific_kwargs["pre_prompt"] 
        
                   if "pre_prompt" in lmms_eval_specific_kwargs 
        
                   else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option." 
        
               ) 
        
               full_prompt = pre_promt + "\n" + question + "\n" + post_prompt 
        
               return full_prompt 
        
           def videott_doc_to_text_audio(doc, lmms_eval_specific_kwargs=None): 
        
               subtitles_prompt = "This video's subtitles are listed below: \n" 
        
               if not AUDIO_PATH: 
        
                   eval_logger.warning("AUDIO_PATH environment variable not set, skipping audio subtitles") 
        
                   subtitle = "" 
        
               else: 
        
                   audio_path = os.path.join(AUDIO_PATH, f'{doc["video_id"]}.txt') 
        
               try: 
        
                   with open(audio_path) as f: 
        
                       subtitle = f.read() 
        
               except: 
        
                   subtitle = "" 
        
               question = doc["question"] + "\n" + doc["question_prompt"] 
        
               post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:" 
        
               pre_promt = ( 
        
                   lmms_eval_specific_kwargs["pre_prompt"] 
        
                   if "pre_prompt" in lmms_eval_specific_kwargs 
        
                   else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option." 
        
               ) 
        
               full_prompt = subtitles_prompt + subtitle + "\n" + pre_promt + "\n" + question + "\n" + post_prompt

Fix in Cursor • Fix in Web

Bug: Score Type Mismatch in GPT Prompt

The GPT evaluation prompt contains contradictory instructions regarding the score data type. It explicitly states that the score should be an integer, but the provided example demonstrates a float value (e.g., 4.8). This inconsistency can confuse the model, leading to inconsistent output formats and potential parsing issues, especially since the parse_score function expects and converts the score to a float.

lmms_eval/tasks/video-tt/gpt_utils.py#L55-L60

lmms-eval/lmms_eval/tasks/video-tt/gpt_utils.py

Lines 55 to 60 in f96e2ed

    
           f"Correct Answer: {answer}\n" 
        
           f"Predicted Answer: {pred}\n\n" 
        
           "Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. " 
        
           "Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is  a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING." 
        
           "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. " 
        
           "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."

Fix in Cursor • Fix in Web

Was this report helpful? Give feedback by reacting with 👍 or 👎

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d84ef0 and f96e2ed.

📒 Files selected for processing (1)

lmms_eval/tasks/video-tt/videott_all_audio.yaml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/video-tt/videott_all_audio.yaml

[error] 32-32: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor BugBot

🔇 Additional comments (2)

lmms_eval/tasks/video-tt/videott_all_audio.yaml (2)

32-34: Trailing whitespace – fails YAML linting

YAMLlint flags line 32 for trailing spaces, which can break strict CI checks.
-  # qwen_vl:··
+  # qwen_vl:
[ suggest_nitpick ]

6-8: Verify helper function names exist

doc_to_visual and doc_to_text point to utils.videott_doc_to_visual and utils.videott_doc_to_text_audio.
If these were renamed or live in a sub-package (e.g. lmms_eval.tasks.video_tt.utils), YAML loading will raise ImportError.

Please run a quick search to ensure both functions are defined and exported.

coderabbitai · 2025-07-09T16:50:35Z

lmms_eval/tasks/video-tt/videott_all_audio.yaml

+  xcomposer2_4khd:
+    pre_prompt: "[UNUSED_TOKEN_146]user\n"
+    post_prompt: " Answer this question with A, B, C, or D.[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"


⚠️ Potential issue

xcomposer2_4khd prompt contradicts open-ended task – will mis-score answers

The post-prompt forces the model to reply with A/B/C/D, but this YAML is configured for videott_open_ended_score, i.e. free-form answers judged by GPT.
If left as-is, the model is likely to emit a single letter that the scorer will treat as a full answer, producing meaningless scores.

- xcomposer2_4khd: - pre_prompt: "[UNUSED_TOKEN_146]user\n" - post_prompt: " Answer this question with A, B, C, or D.[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n" + xcomposer2_4khd: + pre_prompt: "[UNUSED_TOKEN_146]user\n" + # Keep the special tokens but drop the multiple-choice instruction. + post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"

Either remove the MC instruction (as above) or move this model to a dedicated multiple-choice variant of the task.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

xcomposer2_4khd:

pre_prompt: "[UNUSED_TOKEN_146]user\n"

post_prompt: " Answer this question with A, B, C, or D.[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"

xcomposer2_4khd:

pre_prompt: "[UNUSED_TOKEN_146]user\n"

# Keep the special tokens but drop the multiple-choice instruction.

post_prompt: "[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"

🤖 Prompt for AI Agents

In lmms_eval/tasks/video-tt/videott_all_audio.yaml around lines 38 to 40, the post_prompt for xcomposer2_4khd incorrectly instructs the model to answer with A, B, C, or D, which conflicts with the open-ended scoring setup. To fix this, remove the multiple-choice instruction from the post_prompt so the model can generate free-form answers compatible with the videott_open_ended_score evaluation.

add video-tt benchmark

77a9ed9

fix lint

6c15459

coderabbitai bot reviewed Jul 8, 2025

View reviewed changes

lmms_eval/tasks/video-tt/gpt_utils.py Outdated Show resolved Hide resolved

lmms_eval/tasks/video-tt/utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/gpt_utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/gpt_utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/gpt_utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/utils.py Outdated Show resolved Hide resolved

kcz358 reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/utils.py Show resolved Hide resolved

fix comments

1973d53

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/utils.py Outdated Show resolved Hide resolved

lmms_eval/tasks/video-tt/gpt_utils.py Show resolved Hide resolved

lmms_eval/tasks/video-tt/gpt_utils.py Outdated Show resolved Hide resolved

lmms_eval/tasks/video-tt/gpt_utils.py Show resolved Hide resolved

fix lint

dc4d925

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

fix bugs and typos

a256442

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/utils.py Outdated Show resolved Hide resolved

lmms_eval/tasks/video-tt/gpt_utils.py Show resolved Hide resolved

update suggestions

9819e70

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

fix typos

cd87f61

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/videott_all_audio.yaml Show resolved Hide resolved

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml Outdated Show resolved Hide resolved

fix typos

96032de

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml Outdated Show resolved Hide resolved

Update lmms_eval/tasks/video-tt/videott_no_leading_oe.yaml

0d84ef0

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

Update lmms_eval/tasks/video-tt/videott_all_audio.yaml

f96e2ed

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Luodian approved these changes Jul 9, 2025

View reviewed changes

cursor bot reviewed Jul 9, 2025

View reviewed changes

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

Luodian approved these changes Jul 9, 2025

View reviewed changes

Luodian merged commit b9de191 into EvolvingLMMs-Lab:main Jul 9, 2025
3 checks passed

		task: videott_correctly_led_oe
		test_split: test_correctly_led_oe


		API_TYPE = os.getenv("API_TYPE", "openai")

		NEXSTONE_HTTP_CHAT_URL = "https://maas.byteintl.net/service/api/v1/chat/completions"

	post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
	pre_promt = (
	lmms_eval_specific_kwargs["pre_prompt"]
	if "pre_prompt" in lmms_eval_specific_kwargs
	else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
	)
	full_prompt = pre_promt + "\n" + question + "\n" + post_prompt
	return full_prompt


	def videott_doc_to_text_audio(doc, lmms_eval_specific_kwargs=None):
	subtitles_prompt = "This video's subtitles are listed below: \n"
	if not AUDIO_PATH:
	eval_logger.warning("AUDIO_PATH environment variable not set, skipping audio subtitles")
	subtitle = ""
	else:
	audio_path = os.path.join(AUDIO_PATH, f'{doc["video_id"]}.txt')
	try:
	with open(audio_path) as f:
	subtitle = f.read()
	except:
	subtitle = ""
	question = doc["question"] + "\n" + doc["question_prompt"]
	post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
	pre_promt = (
	lmms_eval_specific_kwargs["pre_prompt"]
	if "pre_prompt" in lmms_eval_specific_kwargs
	else "Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
	)
	full_prompt = subtitles_prompt + subtitle + "\n" + pre_promt + "\n" + question + "\n" + post_prompt

	f"Correct Answer: {answer}\n"
	f"Predicted Answer: {pred}\n\n"
	"Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
	"Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
	"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
	"For example, your response should look like this: {'pred': 'yes', 'score': 4.8}."

[New Benchmark] Add Video-TT Benchmark #742

[New Benchmark] Add Video-TT Benchmark #742

Uh oh!

Conversation

dongyh20 commented Jul 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

dongyh20 commented Jul 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)