Skip to content

Conversation

@hanoonaR
Copy link
Contributor

@hanoonaR hanoonaR commented Jun 5, 2025

This pull request adds VideoMathQA task to lmms-eval.

VideoMathQA Overview

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modalities and moments in the video.

Resources

Evaluation

VideoMathQA supports the following evaluation strategies to comprehensively assess model performance:

  1. MCQ and Multi-Binary (MBin)

    • Tasks with mcq use a 5-way multiple-choice format.
    • Tasks with mbin use a stricter binary-pairwise evaluation format (correct vs each distractor).
    • Both formats are available with and without subtitles, indicated by _w_subtitles in the task name.
  2. Direct Answering vs. Chain-of-Thought (CoT)

    • Each task can be evaluated under Direct or CoT prompting.
    • Tasks containing _cot use CoT prompting, where models generate reasoning before the final answer.
    • Direct answering tasks expect the final answer only, without intermediate reasoning.
    • CoT tasks require post-processing to extract the final answer (see Post Processing).
    • We maintain separate leaderboards for Direct and CoT settings.
  3. Step-wise CoT Evaluation

    • For CoT tasks, we additionally evaluate the quality of generated reasoning.
    • Each response is scored by comparing against annotated solution steps (typically 4–10 steps).
    • Scoring is done using a small open-source model (Qwen-3-4B in thinking mode), which returns a score (0–10) and rationale.

Run Evaluation

We provide a sample command to run the evaluation using the Qwen2.5-VL model, as a reference.

accelerate launch --num_processes=8 -m lmms_eval \
    --model qwen2_5_vl \
    --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=151200,min_pixels=100352,use_flash_attention_2=True,device_map=auto \
    --tasks videomathqa_mbin \
    --batch_size 1 --log_samples --log_samples_suffix qwen_2_5_vl \
    --output_path output

This command starts evaluating the Qwen2.5-VL-3B model on VideoMathQA for multi-binary accuracy. The other available VideoMathQA tasks are:

  1. videomathqa_mcq
  2. videomathqa_mcq_w_subtitles
  3. videomathqa_mcq_cot
  4. videomathqa_mcq_cot_w_subtitles
  5. videomathqa_mbin
  6. videomathqa_mbin_w_subtitles
  7. videomathqa_mbin_cot
  8. videomathqa_mbin_cot_w_subtitles

w_subtitles tasks additionally use subtitles during evaluation. cot tasks prompt the model to think step-by-step before answering the question.

@Luodian Luodian merged commit d438332 into EvolvingLMMs-Lab:main Jun 8, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants