Skip to content

Releases: EvolvingLMMs-Lab/lmms-eval

v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.

07 Oct 05:20
8f142bc

Choose a tag to compare

Introduction

Key Highlights:

  • Audio-First: Comprehensive audio evaluation with paralinguistic analysis
  • Response Caching: Production-ready caching system for faster re-evaluation
  • 5 New Models: Including audio-capable GPT-4o, LongViLA, Gemma-3
  • 50+ New Benchmark Variants: Audio, vision, coding, and STEM tasks
  • MCP Integration: Model Context Protocol client support

Table of Contents

Major Features

1. Response Caching System

A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:

Key Features:

  • Per-document caching: Cached at (task_name, doc_id) level
  • Distributed-safe: Separate cache files per rank/world size
  • Zero-overhead: Automatic cache hits with no code changes
  • Multi-backend: Works with async OpenAI, vLLM, and custom models

Enable Caching:

export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root"  # optional

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
  --tasks mmmu_val \
  --batch_size 1 \
  --output_path ./logs/

Cache Location:

  • Default: ~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl
  • Each line: {"doc_id": <doc_id>, "response": <string>}

API Integration:

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

See full documentation in docs/caching.md.

2. Audio Evaluation Suite

Comprehensive audio understanding capabilities with three major benchmark families:

Step2 Audio Paralinguistic (11 tasks)

Fine-grained paralinguistic feature evaluation:

  • Acoustic Features: pitch, rhythm, speed, voice_tone, voice_styles
  • Speaker Attributes: age, gender, emotions
  • Environmental: scene, event, vocalsound
  • Sematic Match metrics
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic \
  --batch_size 1

VoiceBench (9 main categories, 30+ subtasks)

Comprehensive voice and speech evaluation:

  • Instruction Following: ifeval, alpacaeval, advbench
  • Reasoning: bbh (Big Bench Hard), commoneval
  • Knowledge: mmsu (13 subject areas: biology, chemistry, physics, etc.)
  • Q&A: openbookqa
  • Accent Diversity: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
  • Expressiveness: wildvoice
  • Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.
# Full VoiceBench
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks voicebench \
  --batch_size 1

# Specific accent evaluation
python -m lmms_eval \
  --tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
  --batch_size 1

WenetSpeech (2 splits)

Large-scale ASR and speech evaluation:

  • dev: Development set for validation
  • test_meeting: Meeting domain evaluation
  • MER (Mixed Error Rate) metrics
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks wenet_speech_dev,wenet_speech_test_meeting \
  --batch_size 1

Audio Pipeline Features:

  • HuggingFace audio dataset integration
  • Unified audio message format
  • Multiple metric support (Accuracy, WER, GPT-4 Judge)
  • Task grouping for multi-subset benchmarks

3. New Model Support

Five new model integrations expanding audio and vision capabilities:

Model Type Key Features Usage Example
GPT-4o Audio Preview Audio+Text Paralinguistic understanding, multi-turn audio --model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17
Gemma-3 Vision+Text Enhanced video handling, efficient architecture --model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it
LLaVA-OneVision 1.5 Vision+Text Improved vision understanding, latest LLaVA --model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b
LongViLA-R1 Video+Text Long-context video, efficient video processing --model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B
Thyme Vision+Text Reasoning-focused, enhanced image handling --model thyme --model_args pretrained=thyme-ai/thyme-7b

Example Usage:

# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 1

# LongViLA for video understanding
python -m lmms_eval \
  --model longvila \
  --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
  --tasks videomme,egoschema \
  --batch_size 1

4. New Benchmarks

Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:

Vision & Reasoning Benchmarks

Benchmark Variants Focus Metrics
CSBench 3 (MCQ, Assertion, Combined) Code understanding, debugging Accuracy
SciBench 4 (Math, Physics, Chemistry, Combined) College-level STEM GPT-4 Judge, Accuracy
MedQA 1 Medical question answering Accuracy
SuperGPQA 1 Graduate-level science Q&A Accuracy
Lemonade 1 Video action recognition Accuracy
CharXiv 3 (Descriptive, Reasoning, Combined) Scientific chart interpretation Accuracy, GPT-4 Judge

Example Usage:

# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1

# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1

# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1

Reproducibility Validation

We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:

Model Task lmms-eval Reported Δ Status
Qwen-2.5-7B-Instruct MedQA 53.89 54.28 -0.39
SciBench 43.86 42.97 +0.89
CSBench 69.01 69.51 -0.50
SuperGPQA 29.24 28.78 +0.46
Llama-3.1-8B MedQA 64.49 67.01 -2.52
SciBench 15.35 10.78 +4.57 +-
CSBench 62.49 57.87 +4.62 +-
SuperGPQA 21.94 19.72 +2.22

Status Legend: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)

5. Model Context Protocol (MCP) Integration

Support for MCP-enabled models with tool calling:

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
  --tasks mmmu_val \
  --batch_size 1

Features:

  • Tool call parsing and execution
  • Multi-step reasoning with tools
  • Custom MCP server integration
  • See examples/chat_templates/tool_call_qwen2_5_vl.jinja for templates

6. Async OpenAI Improvements

Enhanced async API integration:

  • Better rate limit handling
  • Configurable retry logic with delays
  • Improved error handling
  • Batch size optimization for OpenAI-compatible endpoints

Common Args Support:

# Now supports additional parameters
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
  --tasks mmstar

Usage Examples

Audio Evaluation with Caching

# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 8 \
  --output_path ./audio_results/ \
  --log_samples

# Second run will use cache - much faster!

Multi-Benchmark Evaluation

# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks voicebench_mmsu,csbench,scibench_math,charxiv \
  --batch_size 4 \
  --output_path ./multimodal_results/

Distributed Evaluation with Caching

export LMMS_EVAL_USE_CACHE=True

torchrun --nproc_per_node=8 -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
  --tasks step2_audio_paralinguistic,csbench,scibench \
  --batch...
Read more

v0.4.1: Tool Calling evaluation, cache API and more models and benchmarks for v0.4.1

28 Sep 01:33

Choose a tag to compare

Main Features

  • Tool Calling evaluation through mcp and openai server
  • A unified cache API for resuming the response

Tool Calling Examples

We have now support for tool calling evaluation for models through a openai server and mcp server. To start with, you first need to setup an openai-compatible server through vllm/sglang or any of the framework

Then, you will need to write your own mcp server so that our client can connect with. An example launching commands:

accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
    --model async_openai \
    --model_args model_version=$CKPT_PATH,mcp_server_path=/path/to/mcp_server.py \
    --tasks $TASK_NAME \
    --batch_size 1 \
    --output_path ./logs/ \
    --log_samples

Cache Api

To handle cases such that the evaluation got terminate, we have create a cache api for people to resume the evaluation instead of start a completely new one. Examples of using the cache api in your generate until:

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)  # your model inference
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

More information can be found in caching.md

What's Changed

New Contributors

Full Changelog: v0.4...v0.4.1

v0.4: multi-node, tp + dp parallel, unified llm-as-judge api, `doc_to_message` support

30 Jul 04:31
b7b4b1d

Choose a tag to compare

😻 LMMs-Eval upgrades to v0.4, better evals for better models.

  • multi-node evals, tp+dp parallel.
  • new doc_to_message support for interleaved modalities inputs, fully compatible with OpenAI official message format, suitable for evaluation in more complicated tasks.
  • unified llm-as-judge API to support more versatile metric functions, async mode support for large concurrency and throughput.
  • more features:
    • tool-uses for agentic tasks
    • programmic API for supporting more third-party training frameworks like nanoVLM, now call LMMs-Eval in your training loop to inspect your models on more tasks.

This upgrade focuses on accelerating evaluation and improves consistency, addressing the needs of reasoning models with longer outputs, multiple rollouts, and in scenarios that LLM-as-judge is required for general domain tasks.

With LMMs-Eval, we dedicated to build the frontier evaluation toolkit to accelerate development for better multimodal models.

More at: https://github.com/EvolvingLMMs-Lab/lmms-eval

Meanwhile, we are currently building the next frontier fully open multimodal models and new supporting frameworks.

Vibe check with us: https://lmms-lab.com

What's Changed

  • [Improvement] Accept chat template string in vLLM models by @VincentYCYao in #768
  • [Feat] fix tasks and vllm to reproduce better results. by @Luodian in #774
  • Remove the deprecated tasks related to the nonexistent lmms-lab/OlympiadBench dataset by @yaojingguo in #776
  • [Feat] LMMS-Eval 0.4 by @Luodian in #721

Full Changelog: v0.3.5...v0.4

v0.3.5

21 Jul 12:30
f7a6d6b

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.4...v0.3.5

What's Changed

New Contributors

Read more

v0.3.4

30 May 07:06

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.3...v0.3.4

v0.3.3 Fix models and add model examples

20 Apr 06:26
514082e

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.3.3

v0.3.2

06 Apr 12:13

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.3.2

v0.3.1

22 Feb 09:15
eb2dadc

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

29 Nov 09:46
754640a

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.4...v0.3.0

v0.2.4 add `generate_until_multi_round` to support interative and multi-round evaluations; add models and fix glitches

03 Oct 15:33
af395ae

Choose a tag to compare

What's Changed

  • [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
  • fix: fix wrong args in wandb logger by @Luodian in #226
  • [feat] Add check for existence of accelerator before waiting by @Luodian in #227
  • add more language tasks and fix fewshot evaluation bugs by @Luodian in #228
  • Remove unnecessary LM object removal in evaluator by @Luodian in #229
  • [fix] Shallow copy issue by @pufanyi in #231
  • [Minor] Fix max_new_tokens in video llava by @kcz358 in #237
  • Update LMMS evaluation tasks for various subjects by @Luodian in #240
  • [Fix] Fix async append result in different order issue by @kcz358 in #244
  • Update the version requirement for transformers by @zhijian-liu in #235
  • Add new LMMS evaluation task for wild vision benchmark by @Luodian in #247
  • Add raw score to wildvision bench by @Luodian in #250
  • [Fix] Strict video to be single processing by @kcz358 in #246
  • Refactor wild_vision_aggregation_raw_scores to calculate average score by @Luodian in #252
  • [Fix] Bring back process result pbar by @kcz358 in #251
  • [Minor] Update utils.py by @YangYangGirl in #249
  • Refactor distributed gathering of logged samples and metrics by @Luodian in #253
  • Refactor caching module and fix serialization issue by @Luodian in #255
  • [Minor] Bring back fix for metadata by @kcz358 in #258
  • [Model] support minimonkey model by @white2018 in #257
  • [Feat] add regression test and change saving logic related to output_path by @Luodian in #259
  • [Feat] Add support for llava_hf video, better loading logic for llava_hf ckpt by @kcz358 in #260
  • [Model] support cogvlm2 model by @white2018 in #261
  • [Docs] Update and sort current_tasks.md by @pbcong in #262
  • fix error name with infovqa task by @ZhaoyangLi-nju in #265
  • [Task] Add MMT and MMT_MI (Multiple Image) Task by @ngquangtrung57 in #270
  • mme-realworld by @yfzhang114 in #266
  • [Model] support Qwen2 VL by @abzb1 in #268
  • Support new task mmworld by @jkooy in #269
  • Update current tasks.md by @pbcong in #272
  • [feat] support video evaluation for qwen2-vl and add mix-evals-video2text by @Luodian in #275
  • [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by @CaraJ7 in #277
  • [Fix] Model name None in Task manager, mix eval model specific kwargs, claude retrying fix by @kcz358 in #278
  • [Feat] Add support for evaluation of Oryx models by @dongyh20 in #276
  • [Fix] Fix the error when running models caused by generate_until_multi_round by @pufanyi in #281
  • [fix] Refactor GeminiAPI class to add video pooling and freeing by @pufanyi in #287
  • add jmmmu by @AtsuMiyai in #286
  • [Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluation for mvbench by @yinanhe in #280

New Contributors

Full Changelog: v0.2.3...v0.2.4