Ablate-to-Validate:
Are Vision-Language Models Really Using Visual Reasoning Tokens?

Tianyi Zhang*  ·  Mahtab Bigverdi*  ·  Ranjay Krishna
*Equal contribution
University of Washington

Video Explanation

Abstract

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support “visual reasoning.” Despite the improved task accuracy, this alone does not show that models actually use these tokens for reasoning. Gains may instead arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized. We instantiate this principle as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT measures whether performance depends on the information carried by latent tokens rather than their mere presence. It probes (1) span-position bias through zero and random replacement, (2) disentangles token-budget from token-diversity effects through first-repeat and count-matched variants, and (3) evaluates information content using oracle or ground-truth token injection together with distribution-matched random baselines. As a controlled testbed, we study relative depth reasoning, where continuous depth embeddings can be inserted at explicit span positions under a fixed token budget. We train LLaVA and Qwen2.5-VL models both to predict and to consume these tokens, and show that TRT applies across heterogeneous latent-token model backbones. Concretely, we cover two trained backbones (LLaVA-13B, Qwen2.5-VL-3B) with continuous and discrete depth spans across three frozen visual encoders (SigLIP2, CLIP, DINOv2) and multiple token budgets, and additionally apply TRT to three off-the-shelf visual reasoning systems (Mirage, Mull-Tokens, CoVT) evaluated on BLINK, VSP, and CV-Bench. Our results show that accuracy gains can be a misleading proxy for latent-token reasoning: across multiple model backbones, types of continuous visual tokens, and compute budgets, VLMs retain most of the improvement even when latent-token content is corrupted or replaced, revealing a persistent gap between “having a latent channel” and actually using it as an information bottleneck. By separating true content utilization from alternative explanations, TRT provides a simple and standardized way to evaluate visual reasoning tokens in vision-language models, and we recommend reporting such diagnostics as standard practice.

Method Overview

Ablate-to-Validate method overview figure
Pipeline walkthrough

1. VLMs + Visual Reasoning Tokens

A vision-language model answers visual reasoning questions while producing an intermediate visual reasoning token span.

2. Token Replacement Test

Keep the prompt and image fixed, then replace only the visual reasoning token content to isolate what the model actually consumes.

3. Utilization Diagnosis

If content matters, corruption should hurt and oracle tokens should help. If performance is stable, tokens may be acting as anchors or budget.

Datasets

The training and evaluation data are released publicly on the Hugging Face Hub.

Training — relative-depth QA

ADE20K relative-depth question–answer pairs used to train the LLaVA and Qwen2.5-VL depth checkpoints.

🤗 agianbig/mixed_depth

Evaluation — HardBLINK

HardBLINK 3/4/5-point relative-depth questions and images used for the Token Replacement Test evaluations.

🤗 agianbig/hardblink_eval

Key Results

Continuous latent spans are weakly content-sensitive

For continuous depth spans, random and first-repeat replacements produce limited changes, while oracle depth tokens provide little or no headroom.

LLaVA oracle: 74.46 → 74.73 Qwen oracle: 68.55 → 67.74

Discrete spans behave like bottlenecks

Discrete depth tokens show stronger causal dependence: oracle replacement helps, while random and constant replacements degrade accuracy.

Qwen oracle: 71.24 → 80.64 Qwen random: 71.24 → 51.34
Setting Identity Oracle Random First-repeat / Constant
LLaVA continuous (SigLIP2, K=64) 74.46 74.73 72.58 74.46
Qwen continuous (DINOv2, K=4) 68.55 67.74 67.20 68.01
LLaVA discrete 77.69 78.76 68.82 73.66
Qwen discrete 71.24 80.64 51.34 58.87

Beyond Depth — Off-the-Shelf VLMs

TRT is not specific to our depth testbed. We also apply it to three off-the-shelf “visual reasoning” methods — Mirage, Mull-Tokens, and CoVT — each on its own primary benchmark. Each card stands on its own; the within-method delta is the signal.

Numbers are not comparable across cards. Different tasks, different baselines. Read each card on its own; the within-method delta is the signal.

Mirage

Backbone: Qwen2.5-VL-7B
Training: 2-stage, latent + text → text-only
Eval: VSP-SP (Spatial Planning)

Identity 76.25 1st-repeat 75.75 Random 74.75 Zero 51.50

Zero moderately degrades SP; random and first-repeat sit near identity.

Mull-Tokens

Backbone: Qwen2.5-VL-7B
Training: 2-stage SFT, anchored → free-form
Eval: BLINK (mean acc)

Identity 63.71 1st-repeat 63.71 Random 63.86 Zero 63.43

First-repeat matches baseline; random latents induce only small changes, consistent with token presence dominating diversity.

CoVT

Backbone: LLaVA-13B
Training: 4-stage distillation curriculum
Eval: CV-Bench-2D (overall)

Identity 76.26 1st-repeat 76.26 Random 21.39 Zero 58.97

Pure random collapses performance; distribution-matched random largely recovers it, consistent with reliance on distributional statistics over fine-grained content.

Three TRT Signatures

When any of these show up in a latent-token method, the gain is unlikely to be driven by token content. Each can produce accuracy gains without the model reading the tokens' content.

1. Corruption tolerated

zero / random replacement preserves the gain. Even corrupted content doesn't punish the model.

2. Diversity unused

first-repeat suffices. A single vector tiled K times keeps performance up, so diversity isn't doing the work.

3. Content unused

oracle (GT) injection adds no headroom. Even a perfect signal isn't being read.

Takeaway

Token Replacement Test (TRT) is a task-agnostic, intervention-based diagnostic for latent visual reasoning in vision-language models. By replacing the visual reasoning span while holding prompt, image, token budget, and decoding fixed, TRT exposes whether a model truly uses the content of its visual reasoning tokens. Across our depth testbed and three off-the-shelf methods (Mirage, Mull-Tokens, CoVT), visual reasoning tokens are often not causally used at inference. Accuracy gains alone are not evidence of utilization, and TRT should accompany any claim about latent visual reasoning tokens.

Report TRT alongside accuracy. A simple, inference-time diagnostic. Run it before claiming your model uses its visual reasoning tokens.

Citation

If you find this work useful, please cite:

@article{zhang2026ablate,
  title         = {Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?},
  author        = {Zhang, Tianyi and Bigverdi, Mahtab and Krishna, Ranjay},
  year          = {2026},
  eprint        = {2605.21642},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}