1. VLMs + Visual Reasoning Tokens
A vision-language model answers visual reasoning questions while producing an intermediate visual reasoning token span.
Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support “visual reasoning.” Despite the improved task accuracy, this alone does not show that models actually use these tokens for reasoning. Gains may instead arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized. We instantiate this principle as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT measures whether performance depends on the information carried by latent tokens rather than their mere presence. It probes (1) span-position bias through zero and random replacement, (2) disentangles token-budget from token-diversity effects through first-repeat and count-matched variants, and (3) evaluates information content using oracle or ground-truth token injection together with distribution-matched random baselines. As a controlled testbed, we study relative depth reasoning, where continuous depth embeddings can be inserted at explicit span positions under a fixed token budget. We train LLaVA and Qwen2.5-VL models both to predict and to consume these tokens, and show that TRT applies across heterogeneous latent-token model backbones. Concretely, we cover two trained backbones (LLaVA-13B, Qwen2.5-VL-3B) with continuous and discrete depth spans across three frozen visual encoders (SigLIP2, CLIP, DINOv2) and multiple token budgets, and additionally apply TRT to three off-the-shelf visual reasoning systems (Mirage, Mull-Tokens, CoVT) evaluated on BLINK, VSP, and CV-Bench. Our results show that accuracy gains can be a misleading proxy for latent-token reasoning: across multiple model backbones, types of continuous visual tokens, and compute budgets, VLMs retain most of the improvement even when latent-token content is corrupted or replaced, revealing a persistent gap between “having a latent channel” and actually using it as an information bottleneck. By separating true content utilization from alternative explanations, TRT provides a simple and standardized way to evaluate visual reasoning tokens in vision-language models, and we recommend reporting such diagnostics as standard practice.
The training and evaluation data are released publicly on the Hugging Face Hub.
ADE20K relative-depth question–answer pairs used to train the LLaVA and Qwen2.5-VL depth checkpoints.
🤗 agianbig/mixed_depthHardBLINK 3/4/5-point relative-depth questions and images used for the Token Replacement Test evaluations.
🤗 agianbig/hardblink_evalFor continuous depth spans, random and first-repeat replacements produce limited changes, while oracle depth tokens provide little or no headroom.
LLaVA oracle: 74.46 → 74.73 Qwen oracle: 68.55 → 67.74Discrete depth tokens show stronger causal dependence: oracle replacement helps, while random and constant replacements degrade accuracy.
Qwen oracle: 71.24 → 80.64 Qwen random: 71.24 → 51.34| Setting | Identity | Oracle | Random | First-repeat / Constant |
|---|---|---|---|---|
| LLaVA continuous (SigLIP2, K=64) | 74.46 | 74.73 | 72.58 | 74.46 |
| Qwen continuous (DINOv2, K=4) | 68.55 | 67.74 | 67.20 | 68.01 |
| LLaVA discrete | 77.69 | 78.76 | 68.82 | 73.66 |
| Qwen discrete | 71.24 | 80.64 | 51.34 | 58.87 |
TRT is not specific to our depth testbed. We also apply it to three off-the-shelf “visual reasoning” methods — Mirage, Mull-Tokens, and CoVT — each on its own primary benchmark. Each card stands on its own; the within-method delta is the signal.
Zero moderately degrades SP; random and first-repeat sit near identity.
First-repeat matches baseline; random latents induce only small changes, consistent with token presence dominating diversity.
Pure random collapses performance; distribution-matched random largely recovers it, consistent with reliance on distributional statistics over fine-grained content.
When any of these show up in a latent-token method, the gain is unlikely to be driven by token content. Each can produce accuracy gains without the model reading the tokens' content.
zero / random replacement preserves the gain.
Even corrupted content doesn't punish the model.
first-repeat suffices. A single vector tiled K
times keeps performance up, so diversity isn't doing the work.
oracle (GT) injection adds no headroom. Even a perfect
signal isn't being read.
Token Replacement Test (TRT) is a task-agnostic, intervention-based diagnostic for latent visual reasoning in vision-language models. By replacing the visual reasoning span while holding prompt, image, token budget, and decoding fixed, TRT exposes whether a model truly uses the content of its visual reasoning tokens. Across our depth testbed and three off-the-shelf methods (Mirage, Mull-Tokens, CoVT), visual reasoning tokens are often not causally used at inference. Accuracy gains alone are not evidence of utilization, and TRT should accompany any claim about latent visual reasoning tokens.
Report TRT alongside accuracy. A simple, inference-time diagnostic. Run it before claiming your model uses its visual reasoning tokens.
If you find this work useful, please cite:
@article{zhang2026ablate,
title = {Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?},
author = {Zhang, Tianyi and Bigverdi, Mahtab and Krishna, Ranjay},
year = {2026},
eprint = {2605.21642},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}