LMMs-Eval v0.6

Overview

After developing LMMs-Eval for over a year, we've integrated 100+ tasks and 30+ models across images, videos, and audio. Throughout this journey, we've grown increasingly aware that evaluation itself deserves the same rigor we demand from the models we evaluate. A benchmark that cannot reliably indicate a model's true capabilities is not just unhelpful, it actively misleads research directions.

This realization drives v0.6: a re-architecture designed to make evaluation fast enough to iterate, rigorous enough to trust, and challenging enough to matter.

Building on lmms-eval's existing framework, v0.6 transforms evaluation from a one-off script into a production-ready evaluation system. This enables two critical workflows:

During training: Evaluation runs as a standalone service, decoupled from the training loop. Submit checkpoints for async evaluation without blocking GPU training.
Post-training: Rapid, comprehensive evaluation across all modalities with statistical guarantees on the results.

From the engineering side, v0.6 also ships a substantial API-throughput upgrade. With the latest API control-path updates (adaptive concurrency, refill scheduling, prefix-aware queueing, and retry/backoff decoupling), we observe about 7.5x throughput improvement on a fixed LIMIT=100 benchmark (0.3278 -> 2.4584 req/s), while preserving metric outputs for the same task/model setup.

Area	Key Features
Performance	Fully async and decoupled inference; adaptive API concurrency control; prefix-aware queueing; measured ~7x+ throughput gain on API benchmark path
Evaluation as a Service	Async job submission without blocking GPU training; separately hosted eval service on dedicated GPUs
Statistical Rigor	Confidence intervals, clustered standard errors, baseline-anchored paired comparison
Frontier Evaluation	Long video, spatial intelligence, and agentic scenarios

1. Architecture

1.1 Evaluation Pipeline

v0.6 defines evaluation as three decoupled components:

┌─────────┐      ┌─────────┐      ┌─────────┐
│  Input  │ ───► │  Model  │ ───► │  Judge  │
└─────────┘      └─────────┘      └─────────┘

Component	Contents	Output
Input	Multimodal data + question + ground truth	`Instance` objects
Model	LMM inference (local or API)	Generated responses
Judge	Metric computation (exact match, LLM judge, etc.)	Scores

Async Pipeline with Cache

Both stages run asynchronously with intelligent caching:

graph LR
    Input[Input]
    Model[Model]
    Storage[Storage]
    Judge[Judge]
    
    Input -- async --> Model
    Model -- "cache" --> Storage
    Storage -- async --> Judge

    Key["Cache key: hash(input + config + git_commit)"]
    
    Model -.-> Key

Cache System

Avoid redundant inference when the same evaluation has been run before:

Cache Key Component	Purpose
Input hash	Same dataset + question
Config hash	Same generation parameters (temp, max_tokens, etc.)
Git commit	Same model code version

Cache hit -> skip inference, reuse stored outputs. Cache miss -> run inference, store results.

Benefits:

No redundant computation: identical runs return cached results instantly
Crash recovery: resume from cached outputs without re-inference
Resource separation: Model (GPU) and Judge (API) can run on different machines
Reproducibility: cache key ensures exact same conditions

Stage 1: Input -> Model

Prefix Cache Optimization

vLLM/SGLang reuse KV cache for shared prefixes. We cluster inputs by media to maximize hits:

Prefix clustering: Group questions by shared image/video
Length sorting: Similar lengths in same batch
Chunked prefill: Split long prompts into chunks to reduce TTFT and avoid OOM

Stage 2: Model -> Judge

Persist-First Strategy: Save model outputs to disk immediately, then score asynchronously.

1.2 Model Interface & Backends

Decoupled Design

v0.6 separates evaluation logic from model inference. Models implement a standardized LMM abstract base class:

class LMM(ABC):
    def generate_until(self, requests: list[Instance]) -> list[str]:
        """Batched text generation."""
        pass

    def loglikelihood(self, requests: list[Instance]) -> list[float]:
        """Token-level log probabilities for multiple-choice ranking."""
        pass

The Instance object carries:

media_list: Raw images, video paths, or audio buffers
prompt: Text prompt
gen_kwargs: Generation parameters

Supported Backends

Backend	Example
vLLM	`python -m lmms_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-VL-7B`
SGLang	`python -m lmms_eval --model sglang --model_args pretrained=Qwen/Qwen2.5-VL-7B`
API Models	OpenAI, Anthropic, Groq, etc. - see API Concurrency & Throughput

Unified API Model Interfaces

v0.6 unifies API-backed model evaluation under two interfaces:

Interface	`--model` name	Backend	Recommended
Async	`async_openai`	`asyncio` + `AsyncOpenAI`	Yes
Sync	`openai`	`ThreadPoolExecutor` + `OpenAI`	Fallback

We recommend async_openai for all API-backed evaluation — it uses native async I/O and achieves significantly higher throughput.

Both resolve to chat mode by default via Model Registry V2. The simple mode (doc_to_visual + doc_to_text) is deprecated and will be removed in a future release.

Naming change in v0.6: the canonical model names have been shortened from openai_compatible / async_openai_compatible to openai / async_openai. These are the names used in filenames, registry keys, and @register_model decorators. The old names (openai_compatible, openai_compatible_chat, async_openai_compatible, async_openai_compatible_chat) continue to work as aliases via MODEL_ALIASES in __init__.py, so existing scripts are not affected.

Model Registry V2

v0.6 introduces ModelRegistryV2 — a unified model registry that replaces the previous ad-hoc import system. All model names (--model X) resolve through a single path.

How it works

Two dicts in lmms_eval/models/__init__.py declare available models:

AVAILABLE_SIMPLE_MODELS: maps model_id -> ClassName for simple (legacy) models in models/simple/
AVAILABLE_CHAT_TEMPLATE_MODELS: maps model_id -> ClassName for chat models in models/chat/

At startup, the registry merges both dicts into ModelManifest objects. Each manifest holds a model_id and up to two class paths (simple + chat). Class paths are auto-constructed: lmms_eval.models.{type}.{model_id}.{ClassName}, so the dict key must match the filename.

Resolution: chat is always preferred over simple (unless force_simple=True). This means --model openai transparently resolves to the chat implementation.

Aliasing: backward-compatible names are supported via MODEL_ALIASES in __init__.py and via ModelManifest.aliases. Old names like openai_compatible, openai_compatible_chat, async_openai_compatible, and async_openai_compatible_chat continue to work.

Simple mode deprecation: the simple model interface (doc_to_visual + doc_to_text) for API models is deprecated. New integrations should always use chat (doc_to_messages + ChatMessages). The simple implementations in models/simple/openai.py will be removed in a future release.

1.3 API Concurrency & Throughput

v0.6 adds adaptive concurrency for API-backed evaluation (async_openai, openai).

Adaptive Concurrency Control

The controller continuously adjusts in-flight request count using three online signals:

request failure rate
rate-limit hit rate (e.g., 429 / throttling)
p95 latency against a target latency budget

Execution also uses refill-style scheduling (no full-window barrier), so completed requests immediately release slots for new work.

For API models with repeated prompt prefixes, v0.6 also supports prefix-aware queueing to improve prefill-cache hit opportunities by dispatching same-prefix requests close together.

Control	Meaning
`adaptive_concurrency`	Enable/disable adaptive mode
`adaptive_min_concurrency`	Lower bound for concurrency
`adaptive_max_concurrency`	Upper bound for concurrency
`adaptive_target_latency_s`	Target p95 latency budget
`adaptive_increase_step`	Additive growth step when healthy
`adaptive_decrease_factor`	Multiplicative decrease on pressure
`adaptive_failure_threshold`	Failure-rate threshold for backoff
`retry_backoff_s`	Retry sleep interval (separate from request timeout)
`prefix_aware_queue`	Group dispatch by prefix hash
`prefix_hash_chars`	Prefix length used for hashing

Example (sync API backend):

python -m lmms_eval \
  --model openai \
  --model_args model_version=<model>,num_concurrent=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256

Example (async API backend, recommended):

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=<model>,num_cpus=16,adaptive_concurrency=true,adaptive_min_concurrency=1,adaptive_max_concurrency=64,adaptive_target_latency_s=15.0,adaptive_increase_step=0.15,adaptive_decrease_factor=0.75,adaptive_failure_threshold=0.05,retry_backoff_s=1.0,prefix_aware_queue=true,prefix_hash_chars=256

Performance Snapshot

To make the performance claim auditable, we keep a concrete benchmark trail in this repo:

Historical comparison file: logs/openrouter_molmo_throughput/throughput_comparison.csv
Latest-vs-previous comparison file: logs/openrouter_molmo_throughput/throughput_comparison_latest_vs_prev.csv

Benchmark setup used for this snapshot:

Task: mme
Limit: 100
Model backend: openai / async_openai API path
API endpoint family: OpenRouter-compatible
Model: bytedance-seed/seed-1.6-flash
Baseline control: static single concurrency (num_concurrent=1)
Latest control: adaptive + refill scheduling + prefix-aware queueing + explicit retry backoff

Result summary (requests_per_sec):

Run Type	Concurrency	RPS	Wall Time (s)	Relative to Baseline
baseline	1	0.327836	305.030740	1.00x
static	24	1.926987	51.894473	5.88x
adaptive (v1)	16	2.404706	41.585121	7.33x
adaptive (v2)	16	2.458435	40.676279	7.50x

Interpretation:

The latest API control path reaches about 7.5x throughput over baseline on the same LIMIT=100 setup.
Compared to the previous adaptive run (v1), the latest adaptive run (v2) still improves (2.4047 -> 2.4584 req/s, +2.23%). This is a small but measurable delta in a noisy environment (shared network + provider-side scheduling), so the right takeaway is not "a new ceiling", but "less overhead and better utilization under the same constraints."
The core point: this speedup is not from changing benchmark difficulty. We keep the same task (mme), model (bytedance-seed/seed-1.6-flash), limit (100), and evaluation prompts/settings. The gain comes from changes in the API request scheduling/control path.
What adaptive (v2) means in practice:
- Refill scheduling (no window barrier): maintain a steady pool of in-flight requests and immediately dispatch new work as soon as a request completes. This reduces idle gaps and prevents the slowest request in a window from gating progress.
- Rolling controller updates: adjust concurrency based on a rolling batch of completions (failure rate, rate-limit hits, and p95 latency vs target) rather than only after fixed windows. This makes the controller more responsive and less sensitive to outliers.
- Hysteresis for stability: use separate "reduce" vs "increase" conditions (and minimum sample thresholds) to avoid oscillating on a single transient 429 or a brief latency spike.
- Retry/backoff decoupling: retry_backoff_s is explicitly separate from request timeout, so retries don't sleep for long timeouts and tie up worker slots.
- Prefix-aware queueing (when enabled): reorder dispatch by prefix hash so same-prefix requests are sent close together, improving prefill-cache hit opportunities on providers that support prefix caching. (Some routing layers may dilute this benefit; the mechanism is still safe.)

1.4 Data Layer

Storage Format

Format	Random Access	Media Handling	Use in v0.6
JSON/Files	O(N) scan	External files	Legacy
Parquet	Row-group decompress	Binary blobs	Primary format

Parquet: Task metadata (questions, answers, splits). Supports projection pushdown, read only required columns. Underlying format for HF Datasets (disk).

TODO: Optimize for high-throughput multimodal access (e.g., Lance or similar columnar storage for images/videos).

1.5 Evaluation as a Service

To integrate evaluation into training workflows, v0.6 provides a disaggregated HTTP service architecture. Implementation: lmms-engine#127

┌─────────────────┐          ┌─────────────────┐           ┌─────────────────┐
│  Training Loop  │ ──POST──▶│   Eval Server   │ ──queue──▶│   Job Worker    │
│   (any host)    │◀──poll── │   (FastAPI)     │◀──result──│   (GPU node)    │
└─────────────────┘          └─────────────────┘           └─────────────────┘

Key benefit: Training continues while evaluation runs asynchronously on separate resources.

Server API

Endpoint	Method	Description
`/evaluate`	POST	Submit evaluation job (model, tasks, config)
`/jobs/{job_id}`	GET	Query job status and results
`/queue`	GET	View pending/running/completed jobs
`/tasks`	GET	List available evaluation tasks
`/models`	GET	List supported model backends

Client Usage

from lmms_eval.entrypoints import EvalClient

client = EvalClient("http://eval-server:8000")

# Submit evaluation (non-blocking)
job = client.evaluate(
    model="qwen2_5_vl",
    tasks=["mmmu_val", "mme"],
    model_args={"pretrained": "Qwen/Qwen2.5-VL-7B-Instruct"},
)

# Continue training...
# Later, retrieve results
result = client.wait_for_job(job["job_id"])

The server uses a JobScheduler that queues requests and processes them sequentially, ensuring proper GPU resource management without conflicts.

2. Statistical Analysis

2.1 Why Statistical Analysis?

Current leaderboards rank models by mean accuracy without uncertainty quantification. As Anthropic's research demonstrates, this is fundamentally flawed:

Problem	Example	Consequence
Scores are estimates, not truth	85% on 1000 questions ≠ 85% true capability	False confidence in rankings
Small differences are noise	85.2% vs 85.5% is statistically insignificant	Wasted effort chasing noise
Correlated questions inflate precision	10 questions per video ≠ 10 independent samples	Underestimated uncertainty

The fix: Treat evaluation as a sampling experiment. Report confidence intervals. Use clustered standard errors. Compare models with paired tests.

What v0.6 Adds

Feature	Purpose
Standard Error (SE)	Quantify uncertainty of single model score
Confidence Intervals	Report score ± margin, not point estimate
Clustered SE	Correct for correlated questions (same video/image)
Paired Comparison	Detect small differences by removing question difficulty variance
Model Stability	Measure inherent variance under standard settings

2.2 Standard Error Estimation

Independent Samples

For binary metrics (pass/fail), SE simplifies to:

$$SE = \sqrt{\frac{p(1-p)}{n}}$$

where $p$ = accuracy, $n$ = number of questions.

Key insight: SE ∝ $1/\sqrt{n}$. To halve uncertainty -> 4× more questions.

Output format: score ± 1.96 × SE (95% CI)

Clustered Samples

Problem: Multiple questions per video/image are correlated, not independent.

Solution: Specify cluster_key -> system applies cluster-robust SE correction.

task: videomme
cluster_key: video_id

Clustered SE can be 3× larger than naive estimates.

2.3 Model Comparison

Problem: Checking if confidence intervals overlap is low-power.

Solution: Paired test — compute per-question difference $d_i = score_A - score_B$, then test if mean($d$) ≠ 0.

Why: Removes question difficulty variance (dominant noise), isolates model difference signal.

Baseline-Anchored Evaluation

A practical application of paired comparison: anchor evaluations to a standard baseline model (e.g., Gemini 3.0 Pro).

Approach	Report	Limitation
Absolute score	"Our model: 78.3%"	Meaningless without context
Leaderboard rank	"#3 on MMMU"	Rank doesn't quantify gap
Paired difference	"+2.1% vs Gemini 3.0 Pro (p<0.01)"	Statistically grounded claim

Benefits:

Reproducible claims: "We beat baseline X by Y%" is verifiable
Training signal: Track improvement over baseline across checkpoints
Publication-ready: Statistical significance replaces hand-waving

# Example: Compare your model against Gemini 3.0 Pro baseline
results = paired_comparison(
    model_a="your_model",
    model_b="gemini-3.0-pro",  # Standard baseline
    tasks=["mmmu_val", "mathvista"],
)
# Output: mean_diff=+2.1%, CI=[+0.8%, +3.4%], p=0.003

2.4 Power Analysis

Purpose: Minimum sample size to detect a given effect (e.g., 2% improvement).

Rule of thumb: Reliable benchmarks need $n > 1000$ questions.

2.5 Model Stability Measurement

Why Measure Variance?

Two models with 80% accuracy can behave differently:

Model A: Answers consistently (same questions right/wrong each run)
Model B: Answers randomly (different results each run)

Model A is more reliable. Model B's accuracy is "luck."

Goal: Measure model's inherent stability under standard settings (temp=0.7).

Law of Total Variance

$$Var(Score) = \underbrace{Var_{within}}_{\text{Model instability}} + \underbrace{Var_{between}}_{\text{Question difficulty}}$$

The first term measures model stability — lower is better.

Protocol

Run N samples per question (temp=0.7), report:

Metric	Meaning
Expected Accuracy (EA)	Mean accuracy across all N samples
Consensus Accuracy (CA)	Accuracy after majority vote
Internal Variance (IV)	Model instability — lower is better
Consistency Rate (CR)	% questions with same answer across N runs

Example Output

┌─────────────┬─────┬─────┬───────┐
│ Model       │ EA  │ CA  │  IV   │
├─────────────┼─────┼─────┼───────┤
│ Model A     │ 80% │ 82% │ 0.05  │  ← Stable
│ Model B     │ 80% │ 81% │ 0.15  │  ← Unstable
└─────────────┴─────┴─────┴───────┘

Same accuracy, but Model A is 3× more stable.

Question-Level Diagnostics

Pattern	Possible Cause
High IV across all models	Ambiguous question
High IV for one model	Model-specific weakness
Zero IV, always wrong	Confidently wrong knowledge

3. Evaluating Multimodal Models in 2026 (TODO)

Note: This section outlines planned evaluation features. Implementation is in progress.

3.1 More features are expected

Static image QA benchmarks are saturating. Building multimodal systems requires setting more challenging tasks and evaluating them in more realistic scenarios:

Capability	Challenge	Current Gap
Long video understanding	10min+ videos, 1000+ frames	Most benchmarks use <128 frames
High motion	Objects movement at 30fps	Sparse sampling loses fine-grained actions
Spatial reasoning	3D world understanding	2D perception ≠ physical grounding
Agentic interaction	Multi-step task execution and feedback	Static QA can't measure planning/tool use well

Key insight: These capabilities require in-environment evaluation, the model must interact with simulators, receive feedback, and adapt. Static input-output pairs cannot capture this.

3.2 Long Video & High Frame Rate

The Scale Problem

Scenario	Frames	Tokens (est.)	Challenge
1min video @ 1fps	60	~60K	Fits context
10min video @ 1fps	600	~600K	Exceeds most context windows
1min video @ 30fps	1800	~1.8M	Memory explosion

Streaming metrics:

Event detection latency: Time from event occurrence to model detection
Memory efficiency: Performance vs. KV cache size
Graceful degradation: Accuracy when forced to evict old context

3.3 Spatial Intelligence

Spatial intelligence benchmarks are needed to evaluate the model's ability to reason in real-world scenarios.

3.4 Agentic Evaluation in Simulators

Interaction based simulators are needed to evaluate agentic capabilities, instead of relying on static benchmarks.

4. Migration Notes

4.1 OpenAI Judge Model Deprecation

Many tasks in lmms-eval use OpenAI models as LLM judges for scoring (e.g., evaluating free-form answers, computing GPT-based metrics). These judge model names are hardcoded as defaults in individual task utils.py files and YAML configs.

As OpenAI deprecates older model versions, users may need to switch to newer models. Below is the official deprecation timeline and our recommended migration mapping.

Official Deprecation Timeline

Source: OpenAI Deprecations (last checked: 2026-02-18)

GPT-3.5 series

Model	Shutdown Date	Status	Recommended Replacement
`gpt-3.5-turbo-0301`	2024-09-13	Already shut down	`gpt-3.5-turbo`
`gpt-3.5-turbo-0613`	2024-09-13	Already shut down	`gpt-3.5-turbo`
`gpt-3.5-turbo-16k-0613`	2024-09-13	Already shut down	`gpt-3.5-turbo`
`gpt-3.5-turbo-instruct`	2026-09-28	Deprecated	`gpt-5-mini` or `gpt-4.1-mini`
`gpt-3.5-turbo-1106`	2026-09-28	Deprecated	`gpt-5-mini` or `gpt-4.1-mini`

GPT-4 series

Model	Shutdown Date	Status	Recommended Replacement
`gpt-4-vision-preview`	2024-12-06	Already shut down	`gpt-4o`
`gpt-4-1106-vision-preview`	2024-12-06	Already shut down	`gpt-4o`
`gpt-4-32k` / `gpt-4-32k-0613`	2025-06-06	Already shut down	`gpt-4o`
`gpt-4-32k-0314`	2025-06-06	Already shut down	`gpt-4o`
`gpt-4-0314`	2026-03-26	Deprecated	`gpt-5` or `gpt-4.1`
`gpt-4-1106-preview`	2026-03-26	Deprecated	`gpt-5` or `gpt-4.1`
`gpt-4-0125-preview`	2026-03-26	Deprecated	`gpt-5` or `gpt-4.1`
`gpt-4.5-preview`	2025-07-14	Already shut down	`gpt-4.1`

GPT-4o series (audio/realtime variants)

Model	Shutdown Date	Status	Recommended Replacement
`gpt-4o-audio-preview-2024-10-01`	2025-10-10	Already shut down	`gpt-audio`
`chatgpt-4o-latest`	2026-02-17	Deprecated	`gpt-5.1-chat-latest`
`gpt-4o-audio-preview`	2026-03-24	Deprecated	`gpt-audio`
`gpt-4o-mini-audio-preview`	2026-03-24	Deprecated	`gpt-audio-mini`
`gpt-4o-realtime-preview`	2026-03-24	Deprecated	`gpt-realtime`
`gpt-4o-mini-realtime-preview`	2026-03-24	Deprecated	`gpt-realtime-mini`

Note: gpt-4o and gpt-4o-mini (the base chat models) do not have announced deprecation dates as of 2026-02-18. They remain current models.

Recommended Migration Mapping for lmms-eval

When a judge model used by a task reaches its shutdown date, override it with a replacement:

Current Default in Code	Recommended Replacement	Notes
`gpt-4o` / `gpt-4o-2024-*`	`gpt-5-mini`	Cost-efficient GPT-5 variant, direct successor
`gpt-4o-mini`	`gpt-5-nano`	Cheapest GPT-5 variant
`gpt-3.5-turbo` / `gpt-3.5-turbo-*`	`gpt-5-nano`	Legacy model, cheapest replacement
`gpt-4o-audio-preview`	`gpt-audio`	Dedicated audio model (shutdown 2026-03-24)
`gpt-4.1` / `gpt-4.1-mini`	`gpt-5-mini` / `gpt-5-nano`	GPT-5 series supersedes 4.1

Why we keep the old defaults in code: Task configs retain their original model references to preserve reproducibility of published results. Changing defaults would silently alter scoring behavior for existing benchmarks.

How to override: Most tasks read the judge model from environment variables. Set MODEL_VERSION (or the task-specific variable) to use a newer model:

export MODEL_VERSION="gpt-5-mini"
python -m lmms_eval --model qwen2_5_vl --tasks mme --limit 5

Some tasks use different environment variable names (e.g., BABYVISION_MODEL_NAME, VIESCORE_MODEL_NAME, JUDGE_MODEL_NAME). Check the relevant task's utils.py for the exact variable name.

References

Core: Statistical Evaluation Framework

A Statistical Approach to Model Evaluations Miller et al., Anthropic, 2024

The theoretical foundation for Section 2. Key contributions:

Treating evaluations as sampling experiments

Standard errors and confidence intervals for LLM benchmarks

Clustered standard errors for correlated questions

Paired difference analysis for model comparison

Paper: Adding Error Bars to Evals (arXiv)

Evaluation Frameworks

LMMs-Eval — Base framework
FlagEvalMM — Multimodal evaluation

Data Formats

Lance Format — Columnar storage for ML

Benchmarks

LLM Agent Evaluation Survey
StreamingBench — Video streaming
VSI-Bench — Spatial intelligence

FilesExpand file tree

lmms-eval-0.6.md

Latest commit

History