Preprint · 2026

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

Yanming Zhang1,∗, Yihan Bian1,∗, Jingyuan Qi2, Yuguang Yao3, Lifu Huang4, Tianyi Zhou5
1University of Maryland  ·  2Virginia Tech  ·  3Intuit  ·  4UC Davis  ·  5MBZUAI
Equal contribution

TL;DR — Reasoning by editing, not regenerating. Reflective Masking turns a Mask Diffusion Model into a multi-turn reviser: it erases uncertain tokens, regenerates only what is needed, and remembers previous attempts.

Abstract

Recent diffusion language models — such as Google's DiffusionGemma — show that text generation need not be left-to-right: a model can refine a whole canvas using bidirectional context. We ask a complementary question: can existing Mask Diffusion Models (MDMs) be taught to reason by revising their own previous outputs? We propose Reflective Masking (RM), a lightweight post-training method that turns masking into a model-driven decision — keep reliable tokens, re-mask uncertain ones, and reveal better replacements — making an MDM a multi-turn reviser rather than a one-shot decoder. To support multi-turn correction we add History Reference, a parameter-free memory that exposes the denoising trajectory to the model. Unlike a large pretrained diffusion LM, RM needs no architectural changes and no online rollouts, and drops into existing MDMs across Sudoku, text reasoning, and image editing — enabling sparse, iterative self-revision.

CoT thinks by continuing. RM thinks by revising.

A diffusion-native analogue of chain-of-thought reflection.

Side-by-side: AR Reasoning vs. Reflective Masking Reasoning

AR reasoning / reflectionReflective Masking in MDMs
Generates thoughts left-to-rightRevises a full canvas bidirectionally
Corrects mistakes by appending more text or regeneratingCorrects mistakes by re-masking only unreliable tokens
Past mistakes remain in contextWrong tokens can be erased from the current state
Test-time scaling = longer traces / more samplesTest-time scaling = more rounds of selective revision
Memory is textual contextMemory is History Reference over denoising states

Results

Reasoning through explicit revision

Sudoku Image editing Text reasoning

Three task families, from instruction-rich image editing to open-ended text reasoning. Reflective Masking consistently beats masking-based baselines, and History Reference helps most where the model must explore on its own — all trained in about 5 hours on 2×H100.

Sudoku — structured error correction

A tiny from-scratch MDM (0.81M params) recovers 9×9 boards with 4–20 corrupted cells by iterative re-masking. History Reference (HR) sharply cuts repeated mistakes and rule conflicts; adding History Embedding Rotation (HER) tops every metric.

Step 0 / 8 initial · corrupted
Errors remaining: 19 Re-masked: 0
wrong digit re-masked just corrected
Reflective masking on Sudoku. Two real revision trajectories: the model re-masks cells it is unsure about (amber) and re-predicts them, turning wrong digits (red) into the correct solution until the board is valid — driving errors down to 0. Switch examples, press play, or step through manually.
Variant Exact Accuracy
% ↑
Valid Rate
% ↑
Replay Mistake
% ↓
Conflict Cells
/board ↓
RM (no History Reference) 82.486.60.570.578
RM + HR 91.4↑9.0 91.8↑5.2 0.07↓0.50 0.300↓0.278
RM + HR + decay 89.4↑7.0 89.6↑3.0 0.07↓0.50 0.362↓0.216
Ours — RM + HR + decay + HER 93.4↑11.0 93.6↑7.0 0.03↓0.54 0.236↓0.342

Quantitative results on Sudoku revision. Δ is the change versus the RM (no History Reference) baseline; bold marks the best value per column.

Relation to DiffusionGemma (Google). DiffusionGemma independently validates reasoning-by-revision on Sudoku: per its model card, exact-solve rises from 18% one-shot → 89.5% purely by revising over steps, and from 1.5% → 89.5% after fine-tuning a large pretrained model for 4,000 steps. Reflective Masking reaches an even higher 93.4% exact accuracy with a 0.81M-parameter MDM trained from scratch — orders of magnitude smaller than DiffusionGemma's fine-tuned backbone — and extends the same revision mechanism beyond text to image editing, a modality DiffusionGemma does not support.

DiffusionGemma: Google, “DiffusionGemma: 4× faster text generation” (2026); Sudoku numbers from the fine-tuned model card (Unsloth).


Image editing — localized, instruction-guided revision

With a 7B multimodal backbone (Lumina-DiMOO), Reflective Masking localizes the edit and changes only that region, leaving the rest of the image untouched — outperforming masking-based baselines.

Qualitative image editing. The model predicts where to edit and revises only that region, keeping the rest intact while following the instruction more faithfully than baselines. In the last row the input sits under its instruction and each method shows the edited result (top) over the predicted depth/edit region (bottom). Hover an instruction label for the full prompt, or click any image to view it at full resolution.

Text reasoning — self-correction with no answer hints

On open-ended math and code (LLaDA backbone), the model re-masks uncertain intermediate tokens and revises them as context resolves — beating both the base model and vanilla SFT.

A diffusion-native analogue of reflection. Chain-of-thought and reflection let autoregressive models reason by writing more — appending a new trace and carrying every past mistake along in the context. Reflective Masking gives MDMs the complementary move: reason by revising. Rather than append a fresh reasoning trace, the model edits its previous state, re-masking only the tokens it now doubts so wrong steps are erased instead of accumulated.

Text reasoning case study: across denoising steps, the model re-masks and revises tokens, correcting an intermediate wrong answer to produce the correct final answer.
Reflective masking in action. The model re-masks uncertain tokens (M), fixes the chain-of-thought, then revises the final answer — self-correction by the model itself.
Benchmark Category LLaDA
%
Vanilla SFT
%
Ours (RM)
%
Δ
MATH500Math 19.422.424.8 ↑2.4
MBPPCode 28.030.639.4 ↑8.8
ARC-ChallengeMCQA 73.781.386.1 ↑4.8

Performance across benchmarks. Δ is the improvement over Vanilla SFT; bold marks the best value per row.

Minerva MATH AlgebraCount. & Prob.Geometry Interm. Alg.Num TheoryPrealgebra PrecalcAggregate
Vanilla SFT (%) 28.9017.7220.6713.0717.41 36.2814.1022.62
Ours RM (%) 29.4918.3520.67 14.4021.6738.00 16.6724.10
Δ (%) ↑0.59↑0.63 0↑1.33↑4.26 ↑1.72↑2.57 ↑1.48

Per-subject breakdown on Minerva MATH; Reflective Masking improves on nearly every category.


Method

Reflective Masking & History Reference

Each position takes one of three actions per step: Reveal a confident prediction, Reflectively Mask an uncertain one for another try, or Reserve it. Masking becomes a model-driven decision, so the model can revisit and fix its earlier predictions across turns.

History Reference (HER) accumulates per-step states at the embedding level, giving the model access to its own trajectory — what it predicted and what it already revised — with no extra parameters and no longer attention sequences.

Reflective Masking overview: across denoising steps, tokens are Revealed, Reflectively Masked, or Reserved; word-token embeddings (WTE) and History Embedding Rotation (HER) are combined as History Reference before the MDM.
Method overview. At each step every position is Revealed, Reflectively Masked, or Reserved. Word embeddings (WTE) combine with accumulated history via HER before the MDM, so each revision sees the full denoising trajectory.

Training

Activating Reflective Masking, offline

RM is taught offline, with no online rollouts. From a clean target we sample a mask, take one MDM forward pass, and draw plausible wrong tokens to build a pseudo-trajectory that matches the model's own distribution. Three per-token losses then teach when to commit, when to re-mask, and when to leave a token alone:

Offline training data pipeline: sample noisy positions and a mask, run an MDM forward pass, sample wrong tokens, build per-position history, and supervise with Reveal, RM, and Keep losses.
Offline data pipeline. One no-grad forward pass plus sampling builds the pseudo-trajectory; per-position history (correct / mask / wrong) feeds History Reference, supervised by a Reveal + RM + Keep loss.

Citation

BibTeX

@misc{zhang2026multiturn,
  title         = {Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models},
  author        = {Zhang, Yanming and Bian, Yihan and Qi, Jingyuan and Yao, Yuguang and Huang, Lifu and Zhou, Tianyi},
  year          = {2026},
  eprint        = {2606.16700},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2606.16700}
}