Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702

hanoonaR · 2025-06-05T07:23:06Z

This pull request adds VideoMathQA task to lmms-eval.

VideoMathQA Overview

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modalities and moments in the video.

Resources

🌐 Project Website: https://mbzuai-oryx.github.io/VideoMathQA
🤗 Dataset Access: https://huggingface.co/datasets/MBZUAI/VideoMathQA
🏅 Leaderboard (Reasoning): https://hanoonar.github.io/VideoMathQA/#leaderboard-2
🏅 Leaderboard (Direct): https://hanoonar.github.io/VideoMathQA/#leaderboard
📂 GitHub Repository: https://github.com/mbzuai-oryx/VideoMathQA

Evaluation

VideoMathQA supports the following evaluation strategies to comprehensively assess model performance:

MCQ and Multi-Binary (MBin)
- Tasks with mcq use a 5-way multiple-choice format.
- Tasks with mbin use a stricter binary-pairwise evaluation format (correct vs each distractor).
- Both formats are available with and without subtitles, indicated by _w_subtitles in the task name.
Direct Answering vs. Chain-of-Thought (CoT)
- Each task can be evaluated under Direct or CoT prompting.
- Tasks containing _cot use CoT prompting, where models generate reasoning before the final answer.
- Direct answering tasks expect the final answer only, without intermediate reasoning.
- CoT tasks require post-processing to extract the final answer (see Post Processing).
- We maintain separate leaderboards for Direct and CoT settings.
Step-wise CoT Evaluation
- For CoT tasks, we additionally evaluate the quality of generated reasoning.
- Each response is scored by comparing against annotated solution steps (typically 4–10 steps).
- Scoring is done using a small open-source model (Qwen-3-4B in thinking mode), which returns a score (0–10) and rationale.

Run Evaluation

We provide a sample command to run the evaluation using the Qwen2.5-VL model, as a reference.

accelerate launch --num_processes=8 -m lmms_eval \
    --model qwen2_5_vl \
    --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=151200,min_pixels=100352,use_flash_attention_2=True,device_map=auto \
    --tasks videomathqa_mbin \
    --batch_size 1 --log_samples --log_samples_suffix qwen_2_5_vl \
    --output_path output

This command starts evaluating the Qwen2.5-VL-3B model on VideoMathQA for multi-binary accuracy. The other available VideoMathQA tasks are:

videomathqa_mcq
videomathqa_mcq_w_subtitles
videomathqa_mcq_cot
videomathqa_mcq_cot_w_subtitles
videomathqa_mbin
videomathqa_mbin_w_subtitles
videomathqa_mbin_cot
videomathqa_mbin_cot_w_subtitles

w_subtitles tasks additionally use subtitles during evaluation. cot tasks prompt the model to think step-by-step before answering the question.

hanoonaR added 2 commits June 4, 2025 22:54

Adds VideoMathQA (https://mbzuai-oryx.github.io/VideoMathQA) task.

cfcbd46

Adds VideoMathQA (https://mbzuai-oryx.github.io/VideoMathQA) task.

d4317c3

Luodian approved these changes Jun 7, 2025

View reviewed changes

kcz358 approved these changes Jun 7, 2025

View reviewed changes

Luodian merged commit d438332 into EvolvingLMMs-Lab:main Jun 8, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702

Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702

Uh oh!

hanoonaR commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702

Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702

Uh oh!

Conversation

hanoonaR commented Jun 5, 2025

VideoMathQA Overview

Resources

Evaluation

Run Evaluation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants