Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos #702
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request adds VideoMathQA task to lmms-eval.
VideoMathQA Overview
VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modalities and moments in the video.
Resources
Evaluation
VideoMathQA supports the following evaluation strategies to comprehensively assess model performance:
MCQ and Multi-Binary (MBin)
mcquse a 5-way multiple-choice format.mbinuse a stricter binary-pairwise evaluation format (correct vs each distractor)._w_subtitlesin the task name.Direct Answering vs. Chain-of-Thought (CoT)
_cotuse CoT prompting, where models generate reasoning before the final answer.Step-wise CoT Evaluation
Run Evaluation
We provide a sample command to run the evaluation using the Qwen2.5-VL model, as a reference.
This command starts evaluating the Qwen2.5-VL-3B model on
VideoMathQAfor multi-binary accuracy. The other availableVideoMathQAtasks are:w_subtitlestasks additionally use subtitles during evaluation.cottasks prompt the model to think step-by-step before answering the question.