Releases: EvolvingLMMs-Lab/lmms-eval
v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.
Introduction
Key Highlights:
- Audio-First: Comprehensive audio evaluation with paralinguistic analysis
- Response Caching: Production-ready caching system for faster re-evaluation
- 5 New Models: Including audio-capable GPT-4o, LongViLA, Gemma-3
- 50+ New Benchmark Variants: Audio, vision, coding, and STEM tasks
- MCP Integration: Model Context Protocol client support
Table of Contents
- Introduction
- Major Features
- Usage Examples
- Technical Details
- Migration Guide
- Bug Fixes and Improvements
- Deprecated Features
- Contributing
- Acknowledgments
- Getting Help
Major Features
1. Response Caching System
A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:
Key Features:
- Per-document caching: Cached at
(task_name, doc_id)level - Distributed-safe: Separate cache files per rank/world size
- Zero-overhead: Automatic cache hits with no code changes
- Multi-backend: Works with async OpenAI, vLLM, and custom models
Enable Caching:
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root" # optional
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
--tasks mmmu_val \
--batch_size 1 \
--output_path ./logs/Cache Location:
- Default:
~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl - Each line:
{"doc_id": <doc_id>, "response": <string>}
API Integration:
def generate_until(self, requests):
self.load_cache()
cached, pending = self.get_response_from_cache(requests)
results = [c["response"] for c in cached]
for req in pending:
out = call_backend(req)
self.add_request_response_to_cache(req, out)
results.append(out)
return resultsSee full documentation in docs/caching.md.
2. Audio Evaluation Suite
Comprehensive audio understanding capabilities with three major benchmark families:
Step2 Audio Paralinguistic (11 tasks)
Fine-grained paralinguistic feature evaluation:
- Acoustic Features: pitch, rhythm, speed, voice_tone, voice_styles
- Speaker Attributes: age, gender, emotions
- Environmental: scene, event, vocalsound
- Sematic Match metrics
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic \
--batch_size 1VoiceBench (9 main categories, 30+ subtasks)
Comprehensive voice and speech evaluation:
- Instruction Following: ifeval, alpacaeval, advbench
- Reasoning: bbh (Big Bench Hard), commoneval
- Knowledge: mmsu (13 subject areas: biology, chemistry, physics, etc.)
- Q&A: openbookqa
- Accent Diversity: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
- Expressiveness: wildvoice
- Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.
# Full VoiceBench
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks voicebench \
--batch_size 1
# Specific accent evaluation
python -m lmms_eval \
--tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
--batch_size 1WenetSpeech (2 splits)
Large-scale ASR and speech evaluation:
- dev: Development set for validation
- test_meeting: Meeting domain evaluation
- MER (Mixed Error Rate) metrics
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks wenet_speech_dev,wenet_speech_test_meeting \
--batch_size 1Audio Pipeline Features:
- HuggingFace audio dataset integration
- Unified audio message format
- Multiple metric support (Accuracy, WER, GPT-4 Judge)
- Task grouping for multi-subset benchmarks
3. New Model Support
Five new model integrations expanding audio and vision capabilities:
| Model | Type | Key Features | Usage Example |
|---|---|---|---|
| GPT-4o Audio Preview | Audio+Text | Paralinguistic understanding, multi-turn audio | --model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17 |
| Gemma-3 | Vision+Text | Enhanced video handling, efficient architecture | --model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it |
| LLaVA-OneVision 1.5 | Vision+Text | Improved vision understanding, latest LLaVA | --model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b |
| LongViLA-R1 | Video+Text | Long-context video, efficient video processing | --model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B |
| Thyme | Vision+Text | Reasoning-focused, enhanced image handling | --model thyme --model_args pretrained=thyme-ai/thyme-7b |
Example Usage:
# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 1
# LongViLA for video understanding
python -m lmms_eval \
--model longvila \
--model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
--tasks videomme,egoschema \
--batch_size 14. New Benchmarks
Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:
Vision & Reasoning Benchmarks
| Benchmark | Variants | Focus | Metrics |
|---|---|---|---|
| CSBench | 3 (MCQ, Assertion, Combined) | Code understanding, debugging | Accuracy |
| SciBench | 4 (Math, Physics, Chemistry, Combined) | College-level STEM | GPT-4 Judge, Accuracy |
| MedQA | 1 | Medical question answering | Accuracy |
| SuperGPQA | 1 | Graduate-level science Q&A | Accuracy |
| Lemonade | 1 | Video action recognition | Accuracy |
| CharXiv | 3 (Descriptive, Reasoning, Combined) | Scientific chart interpretation | Accuracy, GPT-4 Judge |
Example Usage:
# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1
# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1
# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1Reproducibility Validation
We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:
| Model | Task | lmms-eval | Reported | Δ | Status |
|---|---|---|---|---|---|
| Qwen-2.5-7B-Instruct | MedQA | 53.89 | 54.28 | -0.39 | ✓ |
| SciBench | 43.86 | 42.97 | +0.89 | ✓ | |
| CSBench | 69.01 | 69.51 | -0.50 | ✓ | |
| SuperGPQA | 29.24 | 28.78 | +0.46 | ✓ | |
| Llama-3.1-8B | MedQA | 64.49 | 67.01 | -2.52 | ✓ |
| SciBench | 15.35 | 10.78 | +4.57 | +- | |
| CSBench | 62.49 | 57.87 | +4.62 | +- | |
| SuperGPQA | 21.94 | 19.72 | +2.22 | ✓ |
Status Legend: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)
5. Model Context Protocol (MCP) Integration
Support for MCP-enabled models with tool calling:
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
--tasks mmmu_val \
--batch_size 1Features:
- Tool call parsing and execution
- Multi-step reasoning with tools
- Custom MCP server integration
- See
examples/chat_templates/tool_call_qwen2_5_vl.jinjafor templates
6. Async OpenAI Improvements
Enhanced async API integration:
- Better rate limit handling
- Configurable retry logic with delays
- Improved error handling
- Batch size optimization for OpenAI-compatible endpoints
Common Args Support:
# Now supports additional parameters
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
--tasks mmstarUsage Examples
Audio Evaluation with Caching
# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 8 \
--output_path ./audio_results/ \
--log_samples
# Second run will use cache - much faster!Multi-Benchmark Evaluation
# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20 \
--tasks voicebench_mmsu,csbench,scibench_math,charxiv \
--batch_size 4 \
--output_path ./multimodal_results/Distributed Evaluation with Caching
export LMMS_EVAL_USE_CACHE=True
torchrun --nproc_per_node=8 -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks step2_audio_paralinguistic,csbench,scibench \
--batch...v0.4.1: Tool Calling evaluation, cache API and more models and benchmarks for v0.4.1
Main Features
- Tool Calling evaluation through mcp and openai server
- A unified cache API for resuming the response
Tool Calling Examples
We have now support for tool calling evaluation for models through a openai server and mcp server. To start with, you first need to setup an openai-compatible server through vllm/sglang or any of the framework
Then, you will need to write your own mcp server so that our client can connect with. An example launching commands:
accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
--model async_openai \
--model_args model_version=$CKPT_PATH,mcp_server_path=/path/to/mcp_server.py \
--tasks $TASK_NAME \
--batch_size 1 \
--output_path ./logs/ \
--log_samplesCache Api
To handle cases such that the evaluation got terminate, we have create a cache api for people to resume the evaluation instead of start a completely new one. Examples of using the cache api in your generate until:
def generate_until(self, requests):
self.load_cache()
cached, pending = self.get_response_from_cache(requests)
results = [c["response"] for c in cached]
for req in pending:
out = call_backend(req) # your model inference
self.add_request_response_to_cache(req, out)
results.append(out)
return resultsMore information can be found in caching.md
What's Changed
- [NEW TASK] Add video task support for LSDBench by @taintaintainu in #778
- feat: Add max_frame_num parameter to encode_video by @Luodian in #783
- fix: Fix video loading logic and add protocol for loading by @kcz358 in #788
- Fix broken references by @yaojingguo in #787
- fix(qwen2_5_vl): Ensure unique frame indices for videos with few frames by @Luodian in #789
- New tasks supported: EMMA by @Devininthelab in #790
- Add httpx_trust_env arg for openai_compatible model by @yaojingguo in #791
- [TASK] MMRefine Benchmark by @skyil7 in #793
- Fix accuracy computation for VQAv2 by @oscmansan in #794
- docs: Update installation instructions to use uv package manager by @Luodian in #799
- Remove the duplicated Aero-1-Audio content by @yaojingguo in #803
- feat: Add Charxiv, videomme long, and vllm threadpool for decoding inputs by @kcz358 in #802
- [Feature] Add GPT-4o Audio by @YichenG170 in #798
- [Feature] Add Thyme Model by @xjtupanda in #811
- [Fix] Fix tools and add mcp client by @kcz358 in #812
- [Feat] Adding cache api for model by @kcz358 in #814
- fix: Async OpenAI caching order and more common args by @kcz358 in #816
- [Feature] Support for Gemma-3 Models by @RadhaGulhane13 in #821
- feat: Add longvila-r1 and benchmarks by @kcz358 in #819
- [bugfix] fix bug in srt_api.py by @zzhbrr in #826
- fix(gemma3): use Gemma3ForConditionalGeneration to load by @Luodian in #827
- feat: add llava_onevision1_5 by @mathCrazyy in #825
- [Feature] Add VoiceBench by @YichenG170 in #809
- add script of LLaVA-OneVision1_5 by @mathCrazyy in #828
- add scibench(math) task by @KelvinDo183 in #834
New Contributors
- @taintaintainu made their first contribution in #778
- @oscmansan made their first contribution in #794
- @YichenG170 made their first contribution in #798
- @xjtupanda made their first contribution in #811
- @zzhbrr made their first contribution in #826
- @mathCrazyy made their first contribution in #825
- @KelvinDo183 made their first contribution in #834
Full Changelog: v0.4...v0.4.1
v0.4: multi-node, tp + dp parallel, unified llm-as-judge api, `doc_to_message` support
😻 LMMs-Eval upgrades to v0.4, better evals for better models.
- multi-node evals, tp+dp parallel.
- new
doc_to_messagesupport for interleaved modalities inputs, fully compatible with OpenAI official message format, suitable for evaluation in more complicated tasks. - unified
llm-as-judgeAPI to support more versatile metric functions, async mode support for large concurrency and throughput. - more features:
- tool-uses for agentic tasks
- programmic API for supporting more third-party training frameworks like nanoVLM, now call LMMs-Eval in your training loop to inspect your models on more tasks.
This upgrade focuses on accelerating evaluation and improves consistency, addressing the needs of reasoning models with longer outputs, multiple rollouts, and in scenarios that LLM-as-judge is required for general domain tasks.
With LMMs-Eval, we dedicated to build the frontier evaluation toolkit to accelerate development for better multimodal models.
More at: https://github.com/EvolvingLMMs-Lab/lmms-eval
Meanwhile, we are currently building the next frontier fully open multimodal models and new supporting frameworks.
Vibe check with us: https://lmms-lab.com
What's Changed
- [Improvement] Accept chat template string in vLLM models by @VincentYCYao in #768
- [Feat] fix tasks and vllm to reproduce better results. by @Luodian in #774
- Remove the deprecated tasks related to the nonexistent lmms-lab/OlympiadBench dataset by @yaojingguo in #776
- [Feat] LMMS-Eval 0.4 by @Luodian in #721
Full Changelog: v0.3.5...v0.4
v0.3.5
What's Changed
- pip 0.3.4 by @pufanyi in #697
- [Fix] Minor fix on some warning messages by @kcz358 in #704
- [FIX] Add macro metric to task xlrs-lite by @nanocm in #700
- [Fix] Fix evaluator crash with accelerate backend when num_processes=1 by @miikatoi in #699
- [Fix] Enable the ignored API_URL in the MathVista evaluation. by @MoyusiteruIori in #705
- Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos by @hanoonaR in #702
- Update sentencepiece dependency and add new parameters to mathvista_t… by @Luodian in #716
- [fix ] Refactor Accelerator initialization by @Luodian in #717
- [Minor] typo fixed in task_guide.md by @JulyanZhu in #720
- add mmsi-bench (https://arxiv.org/abs/2505.23764) by @sihany077 in #715
- add mmvu task by @pbcong in #713
- Dev/tomato by @Devininthelab in #709
- [fix] update korean benchmark's post_prompt by @jujeongho0 in #719
- [fix] ensure synchronization not be used without distributed execution by @debugdoctor in #714
- [FIX] Resolve MMMU-test submission file generation issue by @xyyandxyy in #724
- Add CameraBench_VQA by @chancharikmitra in #725
- [vLLM] centralize VLLM_WORKER_MULTIPROC_METHOD by @kylesayrs in #728
- [fix] cli_evaluate to properly handle Namespace arguments by @Luodian in #733
- Fix three bugs in the codebase by @Luodian in #734
- [Bug] fix a bug in post processing stage of ScienceQA. by @ashun989 in #723
- fix: add
max_frames_numtoOpenAICompatibleby @loongfeili in #740 - [Bugfix] Add min image resolution requirement for vLLM Qwen-VL models by @zch42 in #737
- Revert "Pass in the 'cache_dir' to use local cache" by @kcz358 in #741
- [New Benchmark] Add Video-TT Benchmark by @dongyh20 in #742
- Add claude GitHub actions 1752118403023 by @Luodian in #749
- [Bugfix] Fix handling of encode_video output in vllm.py so each frame’s Base64 by @LiamLian0727 in #754
- [New Benchmark] Request for supporting TimeScope by @ruili33 in #756
- Remove Claude GitHub workflows for code review by @Luodian in #757
- [fix] Fixed applying process_* twice on resAns for VQAv2 by @Avelina9X in #760
- [fix] update korean benchmark's post_prompt by @jujeongho0 in #759
- Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) by @dunghuynhandy in #744
- [fix] vqav2 evaluation yaml by @mletrasdl in #764
- [New Task] Add support for benchmark PhyX by @wutaiqiang in #766
New Contributors
- @miikatoi made their first contribution in #699
- @MoyusiteruIori made their first contribution in #705
- @hanoonaR made their first contribution in #702
- @sihany077 made their first contribution in #715
- @debugdoctor made their first contribution in #714
- @xyyandxyy made their first contribution in #724
- @chancharikmitra made their first contribution in #725
- @loongfeili made their first contribution in #740
- @zch42 made their first contribution in #737
- @LiamLian0727 made their first contribution in #754
- @ruili33 made their first contribution in #756
- @Avelina9X made their first contribution in #760
- @dunghuynhandy made their first contribution in #744
- @mletrasdl made their first contribution in #764
- @wutaiqiang made their first contribution in #766
Full Changelog: v0.3.4...v0.3.5
What's Changed
- pip 0.3.4 by @pufanyi in #697
- [Fix] Minor fix on some warning messages by @kcz358 in #704
- [FIX] Add macro metric to task xlrs-lite by @nanocm in #700
- [Fix] Fix evaluator crash with accelerate backend when num_processes=1 by @miikatoi in #699
- [Fix] Enable the ignored API_URL in the MathVista evaluation. by @MoyusiteruIori in #705
- Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos by @hanoonaR in #702
- Update sentencepiece dependency and add new parameters to mathvista_t… by @Luodian in #716
- [fix ] Refactor Accelerator initialization by @Luodian in #717
- [Minor] typo fixed in task_guide.md by @JulyanZhu in #720
- add mmsi-bench (https://arxiv.org/abs/2505.23764) by @sihany077 in #715
- add mmvu task by @pbcong in #713
- Dev/tomato by @Devininthelab in #709
- [fix] update korean benchmark's post_prompt by @jujeongho0 in #719
- [fix] ensure synchronization not be used without distributed execution by @debugdoctor in #714
- [FIX] Resolve MMMU-test submission file generation issue by @xyyandxyy in #724
- Add CameraBench_VQA by @chancharikmitra in #725
- [vLLM] centralize VLLM_WORKER_MULTIPROC_METHOD by @kylesayrs in #728
- [fix] cli_evaluate to properly handle Namespace arguments by @Luodian in #733
- Fix three bugs in the codebase by @Luodian in #734
- [Bug] fix a bug in post processing stage of ScienceQA. by @ashun989 in #723
- fix: add
max_frames_numtoOpenAICompatibleby @loongfeili in #740 - [Bugfix] Add min image resolution requirement for vLLM Qwen-VL models by @zch42 in #737
- Revert "Pass in the 'cache_dir' to use local cache" by @kcz358 in #741
- [New Benchmark] Add Video-TT Benchmark by @dongyh20 in #742
- Add claude GitHub actions 1752118403023 by @Luodian in #749
- [Bugfix] Fix handling of encode_video output in vllm.py so each frame’s Base64 by @LiamLian0727 in #754
- [New Benchmark] Request for supporting TimeScope by @ruili33 in #756
- Remove Claude GitHub workflows for code review by @Luodian in #757
- [fix] Fixed applying process_* twice on resAns for VQAv2 by @Avelina9X in #760
- [fix] update korean benchmark's post_prompt by @jujeongho0 in #759
- Title: Add Benchmark from "Vision-Language Models Can’t See the Obvious" (ICCV 2025) by @dunghuynhandy in #744
- [fix] vqav2 evaluation yaml by @mletrasdl in #764
- [New Task] Add support for benchmark PhyX by @wutaiqiang in #766
New Contributors
v0.3.4
What's Changed
- Support VSI-Bench Evaluation by @vealocia in #511
- [Fix] Better Qwen omni and linting by @kcz358 in #647
- Fix the bug in issue #648 by @ashun989 in #649
- [New Model] Aero-1-Audio by @kcz358 in #658
- [improve]: catch import error; remove unused modules by @VincentYCYao in #650
- [Fix] fixing the video path of MVBench & adding default hf_home to percepti… by @jihanyang in #655
- Update vllm.py by @VincentYCYao in #652
- [FIX]: Fix question_for_eval key in MathVerse evaluator for Vision-Only data by @ForJadeForest in #657
- [Task] Add new benchmark: CAPability by @lntzm in #656
- Mathvision bug fixes , Reproduce Qwen2.5VL results by @RadhaGulhane13 in #660
- Fix issue with killing process in sglang by @ravi03071991 in #666
- Fixes Metadata Reading from Released PLM Checkpoints by @mmaaz60 in #665
- [fix] modify the GPT evaluation model by @jujeongho0 in #668
- [Fix] Correct rating logic for VITATECS benchmark by @erfanbsoula in #671
- Update README.md by @pufanyi in #675
- delete unused test_parse.py file by @pbcong in #676
- [fix] add reminder for
interleave_visualfor Qwen2.5-VL, update version control. by @Luodian in #678 - [fix] Fix task listing in CLI evaluation by updating to use 'all_tasks' instead of 'list_all_tasks' for improved clarity. by @Luodian in #687
- [Task] V*-Bench (Visual Star Benchmark) by @Luodian in #683
- support distributed executor backend - torchrun by @kaiyuyue in #680
- [Task] Add new task: XLRS-Bench-lite by @nanocm in #684
- Added direction for locally cached dataset in task_guide.md by @JulyanZhu in #691
- Pass in the 'cache_dir' to use local cache by @JulyanZhu in #690
- [FIX] Fix parameter name in qwen25vl.sh by @MasterBeeee in #693
- [TASK & FIX] add task VideoEval-Pro and fix tar file concat by @iamtonymwt in #694
New Contributors
- @vealocia made their first contribution in #511
- @ashun989 made their first contribution in #649
- @VincentYCYao made their first contribution in #650
- @jihanyang made their first contribution in #655
- @ForJadeForest made their first contribution in #657
- @lntzm made their first contribution in #656
- @RadhaGulhane13 made their first contribution in #660
- @ravi03071991 made their first contribution in #666
- @erfanbsoula made their first contribution in #671
- @kaiyuyue made their first contribution in #680
- @nanocm made their first contribution in #684
- @JulyanZhu made their first contribution in #691
- @MasterBeeee made their first contribution in #693
- @iamtonymwt made their first contribution in #694
Full Changelog: v0.3.3...v0.3.4
v0.3.3 Fix models and add model examples
What's Changed
- [Fix] Add padding_side="left" for Qwen2.5 to enable flash_attention by @robinhad in #620
- Add ability to pass options to VLLM by @robinhad in #621
- Fix Qwen by @Devininthelab in #633
- Whisper + vLLM: FLEURS Evaluation Fixes and Language Prompt Injection by @shubhra in #624
- Fix loading datasets from disk by @CLARKBENHAM in #629
- Cache stringifies where not needed by @CLARKBENHAM in #631
- openai chat.completions uses max_completion_tokens by @CLARKBENHAM in #630
- MAC decord equivalent by @CLARKBENHAM in #632
- [Task] Add support for VisualPuzzles by @yueqis in #637
- Adds PerceptionLM and PLM-VideoBench by @mmaaz60 in #638
- [Fix] Aria and LLama Vision and OpenAI compatible models by @Luodian in #641
- [Feat] Enhance Qwen model with additional parameters and improved visual handling by @Luodian in #639
- [Fix] add more model examples by @Luodian in #644
New Contributors
- @shubhra made their first contribution in #624
- @CLARKBENHAM made their first contribution in #629
- @yueqis made their first contribution in #637
- @mmaaz60 made their first contribution in #638
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
- Merging Ola by @Devininthelab in #558
- [Evaluation] Improve string processing order for better whitespace handling by @Ryoo72 in #554
- modify utils.py by @shuyansy in #555
- Contribute EgoLife model and evaluation pipeline for EgoPlan & Egothink by @choiszt in #560
- Fix Ola path for GPUs by @Devininthelab in #562
- add OCRBench v2 by @99Franklin in #570
- [WIP][Model] Whisper + vLLM by @kylesayrs in #545
- Add VideoChat-Flash and InternVideo2.5 by @leexinhao in #568
- Add LiveXiv benchmark [ICLR 2025] by @NimrodShabtay in #572
- [Add Dataset] K-MMBench, K-SEED, K-MMStar, K-DTCBench, K-LLaVA-W by @jujeongho0 in #575
- [Dataset] Support VMCBench (CVPR 25) by @yuhui-zh15 in #573
- update merge mlvu_dev and mlvu_test by @shuyansy in #582
- add mmau task by @pbcong in #585
- [Feat] Support VideoLLaMA3 by @CircleRadon in #588
- Add WorldSense by @Devininthelab in #589
- [Add] Task "Multimodal RewardBench" by @seungyeonlj in #591
- [Task] adding MME-COT by @Luodian in #593
- Add README.md for MME-CoT by @CaraJ7 in #601
- [Tasks] New tasks for Visual Reasoning Collection by @Luodian in #600
- [Enhancement] Add LLM evaluation metric and integrate GPT-4o reasoning by @Luodian in #604
- [Feat] fix MME COT, add llm as judge eval by @Luodian in #605
- Fix hard-coded max_new_tokens for qwen2_5_vl model by @robinhad in #609
- Add Omni Bench by @ngquangtrung57 in #613
- [Feat] Fix MEGA-Bench evaluator, update doc by @woodfrog in #606
- [Feat] Adding libri long by @kcz358 in #618
- Update a new model Qwen-2.5-Omni by @Devininthelab in #615
- Modify the openai api to support o1 and o3 by @wenhuchen in #614
- [Model] support VoRA model by @sty-yyj in #616
New Contributors
- @Devininthelab made their first contribution in #558
- @Ryoo72 made their first contribution in #554
- @99Franklin made their first contribution in #570
- @kylesayrs made their first contribution in #545
- @leexinhao made their first contribution in #568
- @NimrodShabtay made their first contribution in #572
- @jujeongho0 made their first contribution in #575
- @yuhui-zh15 made their first contribution in #573
- @CircleRadon made their first contribution in #588
- @seungyeonlj made their first contribution in #591
- @robinhad made their first contribution in #609
- @wenhuchen made their first contribution in #614
- @sty-yyj made their first contribution in #616
Full Changelog: v0.3.1...v0.3.2
v0.3.1
What's Changed
- BugFix: Fixed input to llama_vision processor by @Danielohayon in #431
- MixEval-X Image / Video by @pufanyi in #434
- MixEval-X Readme by @pufanyi in #444
- Fix Llama vision mentioned in #434 by @pufanyi in #447
- add task MMVet-v2 by @frankRenlf in #451
- Fix mmt output format by @ngquangtrung57 in #454
- [Fixed] metric names in NaturalBench dataset by @Baiqi-Li in #455
- [Fix] fix mia-bench evaluation by @Luodian in #456
- Fix MMVet V2 by @pufanyi in #457
- Mmvetv2 by @frankRenlf in #458
- [Fix] remove useless print statements by @pufanyi in #460
- [Fix] remove unused text processing notebook by @pufanyi in #485
- [Fix] Use no media iterator by @kcz358 in #486
- Add VL-RewardBench dataset by @TobiasLee in #484
- Delete model and cache before multigpu data gathering by @xumingze0308 in #489
- Add MEGA-Bench by @woodfrog in #496
- [WIP] style(megabench): improve code formatting and import ordering by @Luodian in #497
- Fix llama_vision chat_template and decode by @coding-famer in #498
- [Support] Support new model: Ross by @Haochen-Wang409 in #494
- [FIX] Minor errors in
gemini_api.pyandinternvl2.py. by @skyil7 in #502 - Fix NoneType Error in
flattenFunction for Text-Only Tasks in LLAVA Models by @bibisbar in #501 - Fix device_map by @coding-famer in #505
- Fix custom model wrapper to enable usage of instanced model by @ErezSC42 in #508
- fix output format of airbench and vocal sound by @pbcong in #510
- Update README.md by @KairuiHu in #513
- Add covost2 zh en by @pbcong in #515
- [Fix] Fix dataset processing logic for common voice and gigaspeech by @kcz358 in #517
- Fix language in common voice by @kcz358 in #518
- [Dataset] Adding Fleurs en/cn split by @kcz358 in #516
- Change fleurs path by @kcz358 in #519
- add covost2_en_zh task by @ngquangtrung57 in #520
- [Feat] Add VITA 1.5 into lmms-eval by @kcz358 in #521
- [Fix] megabench evaluator metric type determination by @woodfrog in #523
- fix aggregation function and remove redundancies by @pbcong in #522
- Add VideoMMMU task and Support Qwen2.5-vl Model by @KairuiHu in #524
- [Add Dataset] HR-Bench (AAAI 2025) by @DreamMr in #525
- [Feat] add @maj and @pass to support sampling multiples times during evaluation by @Luodian in #526
- [Feat] add mathvision datasets by @Luodian in #527
- [Fix] of "Model llavavid not found in available models." by @zhshj0110 in #528
- Replaced incorrect variable name self._word_size to self._world_size by @priancho in #535
- Yhzhang/add charades sta by @ZhangYuanhan-AI in #536
- [Fix] of "evaluation of llava_vid on mvbench" by @zhshj0110 in #541
- [Model] add vllm compatible models by @Luodian in #544
- [Model] add openai compatible API interface by @Luodian in #546
New Contributors
- @Danielohayon made their first contribution in #431
- @frankRenlf made their first contribution in #451
- @TobiasLee made their first contribution in #484
- @xumingze0308 made their first contribution in #489
- @woodfrog made their first contribution in #496
- @coding-famer made their first contribution in #498
- @Haochen-Wang409 made their first contribution in #494
- @bibisbar made their first contribution in #501
- @ErezSC42 made their first contribution in #508
- @KairuiHu made their first contribution in #513
- @DreamMr made their first contribution in #525
- @zhshj0110 made their first contribution in #528
- @priancho made their first contribution in #535
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Bump version to 0.2.4 and remove unused dependencies by @pufanyi in #292
- Load package for NExT-QA evaluation by @zhijian-liu in #295
- Fix MMMU-Pro evaluation by @zhijian-liu in #296
- [Feat] LiveBench 2409 by @pufanyi in #304
- [Doc] add more detailed task guide to explain the variables in yaml configuration file by @Luodian in #303
- [fix] Invalid group in mmsearch.yaml by @skyil7 in #305
- [Fix] Fix cache_dir issue where MVBench cannot be found by @yinanhe in #306
- [Fix] LiveBench 2409 by @pufanyi in #308
- [Fix] A small fix for the
LiveBenchchecker by @pufanyi in #310 - [Fix] Change "Basic Understanding" to "Concrete Recognition" by @pufanyi in #311
- [Feat] LLaMA-3.2-Vision by @kcz358 in #314
- [Fix] Fix extra calling in qwen_vl_api, use tempfile for tmp by @kcz358 in #312
- Fix
LMMS_EVAL_PLUGINSby @zhijian-liu in #297 - [feat] changes on llava_vid model by @ZhangYuanhan-AI in #291
- Update video_decode_backend to "decord" by @ZhangYuanhan-AI in #318
- Update the prompt to be consistent with the current
LiveBenchdesign by @pufanyi in #319 - Add AI2D evaluation without masks by @zhijian-liu in #325
- add vinoground by @HanSolo9682 in #326
- Update evaluator.py to load datasets first before loading models by @LooperXX in #327
- Update llava_onevision.py to avoid erros on evaluation benchmarks with both single- and multi-image samples. by @LooperXX in #338
- Upload Tasks: CinePile by @JARVVVIS in #343
- [Update] Allow pass in max pixels and num frames in qwen2vl by @kcz358 in #346
- funqa update by @Nicous20 in #341
- Update Vinoground to make evaluation consistent with paper by @HanSolo9682 in #354
- Update mmmu_pro_standard.yaml by @zhijian-liu in #353
- Upload tasks: MovieChat-1K, VDC by @Espere-1119-Song in #342
- [Feat] Add AuroraCap, MovieChat, LLaVA-OneVision-MovieChat by @Espere-1119-Song in #358
- update docs for VDC and MovieChat by @rese1f in #359
- [WIP] feat: update to use azure api by @Luodian in #340
- Update MLVU answer parsing by @Xiuyu-Li in #364
- Add task docs for Vinoground by @HanSolo9682 in #372
- [Add Dataset] NaturalBench(NeurIPS24) by @Baiqi-Li in #371
- Update README.md by @kcz358 in #377
- fix model_specific_prompt_kwargs of VDC and MovieChat-1K by @Espere-1119-Song in #382
- Add os import to mathverse_evals.py by @spacecraft1013 in #381
- [Fix] Fix hallu bench by @kcz358 in #392
- Fix "percetion" typo (issue #396) by @Qu3tzal in #397
- Add TemporalBench by @mu-cai in #402
- [Tiny Fix] fix dataset_kwargs in lmms_eval/api/task.py by @Li-Qingyun in #404
- Add model aria & fix on LongVideoBench by @teowu in #391
- [update] NaturalBench to README by @Baiqi-Li in #406
- add model Slime and Benchmark mme_realworld_lite by @yfzhang114 in #409
- Update VDC with SGLang by @Espere-1119-Song in #411
- Add video processing logic for idefics2 by @kcz358 in #418
- update the introduction of mme-realworld by @yfzhang114 in #416
- [Task] add MIA-Bench by @Luodian in #419
- Modify typos in run_example.md by @Espere-1119-Song in #422
- [Release] lmms-eval v0.3.0 release by @kcz358 in #428
- PyPI 0.3.0 by @pufanyi in #432
New Contributors
- @ZhangYuanhan-AI made their first contribution in #291
- @HanSolo9682 made their first contribution in #326
- @LooperXX made their first contribution in #327
- @JARVVVIS made their first contribution in #343
- @Nicous20 made their first contribution in #341
- @Espere-1119-Song made their first contribution in #342
- @rese1f made their first contribution in #359
- @Xiuyu-Li made their first contribution in #364
- @Baiqi-Li made their first contribution in #371
- @spacecraft1013 made their first contribution in #381
- @Qu3tzal made their first contribution in #397
- @mu-cai made their first contribution in #402
- @Li-Qingyun made their first contribution in #404
Full Changelog: v0.2.4...v0.3.0
v0.2.4 add `generate_until_multi_round` to support interative and multi-round evaluations; add models and fix glitches
What's Changed
- [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
- fix: fix wrong args in wandb logger by @Luodian in #226
- [feat] Add check for existence of accelerator before waiting by @Luodian in #227
- add more language tasks and fix fewshot evaluation bugs by @Luodian in #228
- Remove unnecessary LM object removal in evaluator by @Luodian in #229
- [fix] Shallow copy issue by @pufanyi in #231
- [Minor] Fix max_new_tokens in video llava by @kcz358 in #237
- Update LMMS evaluation tasks for various subjects by @Luodian in #240
- [Fix] Fix async append result in different order issue by @kcz358 in #244
- Update the version requirement for
transformersby @zhijian-liu in #235 - Add new LMMS evaluation task for wild vision benchmark by @Luodian in #247
- Add raw score to wildvision bench by @Luodian in #250
- [Fix] Strict video to be single processing by @kcz358 in #246
- Refactor wild_vision_aggregation_raw_scores to calculate average score by @Luodian in #252
- [Fix] Bring back process result pbar by @kcz358 in #251
- [Minor] Update utils.py by @YangYangGirl in #249
- Refactor distributed gathering of logged samples and metrics by @Luodian in #253
- Refactor caching module and fix serialization issue by @Luodian in #255
- [Minor] Bring back fix for metadata by @kcz358 in #258
- [Model] support minimonkey model by @white2018 in #257
- [Feat] add regression test and change saving logic related to
output_pathby @Luodian in #259 - [Feat] Add support for llava_hf video, better loading logic for llava_hf ckpt by @kcz358 in #260
- [Model] support cogvlm2 model by @white2018 in #261
- [Docs] Update and sort current_tasks.md by @pbcong in #262
- fix error name with infovqa task by @ZhaoyangLi-nju in #265
- [Task] Add MMT and MMT_MI (Multiple Image) Task by @ngquangtrung57 in #270
- mme-realworld by @yfzhang114 in #266
- [Model] support Qwen2 VL by @abzb1 in #268
- Support new task mmworld by @jkooy in #269
- Update current tasks.md by @pbcong in #272
- [feat] support video evaluation for qwen2-vl and add mix-evals-video2text by @Luodian in #275
- [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by @CaraJ7 in #277
- [Fix] Model name None in Task manager, mix eval model specific kwargs, claude retrying fix by @kcz358 in #278
- [Feat] Add support for evaluation of Oryx models by @dongyh20 in #276
- [Fix] Fix the error when running models caused by
generate_until_multi_roundby @pufanyi in #281 - [fix] Refactor GeminiAPI class to add video pooling and freeing by @pufanyi in #287
- add jmmmu by @AtsuMiyai in #286
- [Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluation for mvbench by @yinanhe in #280
New Contributors
- @YangYangGirl made their first contribution in #249
- @white2018 made their first contribution in #257
- @pbcong made their first contribution in #262
- @ZhaoyangLi-nju made their first contribution in #265
- @ngquangtrung57 made their first contribution in #270
- @yfzhang114 made their first contribution in #266
- @jkooy made their first contribution in #269
- @dongyh20 made their first contribution in #276
- @yinanhe made their first contribution in #280
Full Changelog: v0.2.3...v0.2.4