Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

TL;DR — Reasoning by editing, not regenerating. Reflective Masking turns a Mask Diffusion Model into a multi-turn reviser: it erases uncertain tokens, regenerates only what is needed, and remembers previous attempts.

Abstract

Recent diffusion language models — such as Google's DiffusionGemma — show that text generation need not be left-to-right: a model can refine a whole canvas using bidirectional context. We ask a complementary question: can existing Mask Diffusion Models (MDMs) be taught to reason by revising their own previous outputs? We propose Reflective Masking (RM), a lightweight post-training method that turns masking into a model-driven decision — keep reliable tokens, re-mask uncertain ones, and reveal better replacements — making an MDM a multi-turn reviser rather than a one-shot decoder. To support multi-turn correction we add History Reference, a parameter-free memory that exposes the denoising trajectory to the model. Unlike a large pretrained diffusion LM, RM needs no architectural changes and no online rollouts, and drops into existing MDMs across Sudoku, text reasoning, and image editing — enabling sparse, iterative self-revision.

1Re-masking is the self-correction MDMs were missing. MDMs can edit in place but never choose to — so they lock in early mistakes. RM makes masking a model-driven decision (keep reliable tokens, re-mask uncertain ones, reveal better replacements), so the model fixes its own errors instead of carrying them forward.
2A lightweight post-training recipe — no new architecture. RM is activated by a scalable offline data pipeline (no online rollouts) and drops into existing MDMs unchanged — validated across text, Sudoku, and image editing.
3History Reference — a memory of past attempts, for free. A parameter-free mechanism that carries the denoising trajectory forward, so the model remembers what it already tried and stops repeating the same error.

CoT thinks by continuing. RM thinks by revising.

A diffusion-native analogue of chain-of-thought reflection.

Side-by-side: AR Reasoning vs. Reflective Masking Reasoning

AR reasoning / reflection	Reflective Masking in MDMs
Generates thoughts left-to-right	Revises a full canvas bidirectionally
Corrects mistakes by appending more text or regenerating	Corrects mistakes by re-masking only unreliable tokens
Past mistakes remain in context	Wrong tokens can be erased from the current state
Test-time scaling = longer traces / more samples	Test-time scaling = more rounds of selective revision
Memory is textual context	Memory is History Reference over denoising states

Results

Reasoning through explicit revision

Sudoku Image editing Text reasoning

Three task families, from instruction-rich image editing to open-ended text reasoning. Reflective Masking consistently beats masking-based baselines, and History Reference helps most where the model must explore on its own — all trained in about 5 hours on 2×H100.

Sudoku — structured error correction

A tiny from-scratch MDM (0.81M params) recovers 9×9 boards with 4–20 corrupted cells by iterative re-masking. History Reference (HR) sharply cuts repeated mistakes and rule conflicts; adding History Embedding Rotation (HER) tops every metric.

Step 0 / 8 initial · corrupted

Errors remaining: 19 Re-masked: 0

wrong digit re-masked just corrected

Reflective masking on Sudoku. Two real revision trajectories: the model re-masks cells it is unsure about (amber) and re-predicts them, turning wrong digits (red) into the correct solution until the board is valid — driving errors down to 0. Switch examples, press play, or step through manually.

Variant	Exact Accuracy % ↑	Valid Rate % ↑	Replay Mistake % ↓	Conflict Cells /board ↓
RM (no History Reference)	82.4	86.6	0.57	0.578
RM + HR	91.4↑9.0	91.8↑5.2	0.07↓0.50	0.300↓0.278
RM + HR + decay	89.4↑7.0	89.6↑3.0	0.07↓0.50	0.362↓0.216
Ours — RM + HR + decay + HER	93.4↑11.0	93.6↑7.0	0.03↓0.54	0.236↓0.342

Quantitative results on Sudoku revision. Δ is the change versus the RM (no History Reference) baseline; bold marks the best value per column.

Relation to DiffusionGemma (Google). DiffusionGemma independently validates reasoning-by-revision on Sudoku: per its model card, exact-solve rises from 18% one-shot → 89.5% purely by revising over steps, and from 1.5% → 89.5% after fine-tuning a large pretrained model for 4,000 steps. Reflective Masking reaches an even higher 93.4% exact accuracy with a 0.81M-parameter MDM trained from scratch — orders of magnitude smaller than DiffusionGemma's fine-tuned backbone — and extends the same revision mechanism beyond text to image editing, a modality DiffusionGemma does not support.

DiffusionGemma: Google, “DiffusionGemma: 4× faster text generation” (2026); Sudoku numbers from the fine-tuned model card (Unsloth).

Image editing — localized, instruction-guided revision

With a 7B multimodal backbone (Lumina-DiMOO), Reflective Masking localizes the edit and changes only that region, leaving the rest of the image untouched — outperforming masking-based baselines.

Original — **Qualitative image editing.** The model predicts where to edit and revises only that region, keeping the rest intact while following the instruction more faithfully than baselines. In the last row the **input** sits under its instruction and each method shows the **edited result** (top) over the predicted **depth/edit region** (bottom). *Hover an instruction label for the full prompt, or click any image to view it at full resolution.*

Predict Mask — **Qualitative image editing.** The model predicts where to edit and revises only that region, keeping the rest intact while following the instruction more faithfully than baselines. In the last row the **input** sits under its instruction and each method shows the **edited result** (top) over the predicted **depth/edit region** (bottom). *Hover an instruction label for the full prompt, or click any image to view it at full resolution.*

Text reasoning — self-correction with no answer hints

On open-ended math and code (LLaDA backbone), the model re-masks uncertain intermediate tokens and revises them as context resolves — beating both the base model and vanilla SFT.

A diffusion-native analogue of reflection. Chain-of-thought and reflection let autoregressive models reason by writing more — appending a new trace and carrying every past mistake along in the context. Reflective Masking gives MDMs the complementary move: reason by revising. Rather than append a fresh reasoning trace, the model edits its previous state, re-masking only the tokens it now doubts so wrong steps are erased instead of accumulated.

Text reasoning case study: across denoising steps, the model re-masks and revises tokens, correcting an intermediate wrong answer to produce the correct final answer. — **Reflective masking in action.** The model re-masks uncertain tokens (M), fixes the chain-of-thought, then revises the final answer — self-correction by the model itself.

Benchmark	Category	LLaDA %	Vanilla SFT %	Ours (RM) %	Δ
MATH500	Math	19.4	22.4	24.8	↑2.4
MBPP	Code	28.0	30.6	39.4	↑8.8
ARC-Challenge	MCQA	73.7	81.3	86.1	↑4.8

Performance across benchmarks. Δ is the improvement over Vanilla SFT; bold marks the best value per row.

Minerva MATH	Algebra	Count. & Prob.	Geometry	Interm. Alg.	Num Theory	Prealgebra	Precalc	Aggregate
Vanilla SFT (%)	28.90	17.72	20.67	13.07	17.41	36.28	14.10	22.62
Ours RM (%)	29.49	18.35	20.67	14.40	21.67	38.00	16.67	24.10
Δ (%)	↑0.59	↑0.63	0	↑1.33	↑4.26	↑1.72	↑2.57	↑1.48

Per-subject breakdown on Minerva MATH; Reflective Masking improves on nearly every category.

Method

Reflective Masking & History Reference

Each position takes one of three actions per step: Reveal a confident prediction, Reflectively Mask an uncertain one for another try, or Reserve it. Masking becomes a model-driven decision, so the model can revisit and fix its earlier predictions across turns.

History Reference (HER) accumulates per-step states at the embedding level, giving the model access to its own trajectory — what it predicted and what it already revised — with no extra parameters and no longer attention sequences.

Reflective Masking overview: across denoising steps, tokens are Revealed, Reflectively Masked, or Reserved; word-token embeddings (WTE) and History Embedding Rotation (HER) are combined as History Reference before the MDM. — **Method overview.** At each step every position is **Revealed**, **Reflectively Masked**, or **Reserved**. Word embeddings (WTE) combine with accumulated history via **HER** before the MDM, so each revision sees the full denoising trajectory.

Training

Activating Reflective Masking, offline

RM is taught offline, with no online rollouts. From a clean target we sample a mask, take one MDM forward pass, and draw plausible wrong tokens to build a pseudo-trajectory that matches the model's own distribution. Three per-token losses then teach when to commit, when to re-mask, and when to leave a token alone:

Revealmasked token → correct token
Re-maskwrong visible token → MASK
Keepcorrect visible token → itself

Offline training data pipeline: sample noisy positions and a mask, run an MDM forward pass, sample wrong tokens, build per-position history, and supervise with Reveal, RM, and Keep losses. — **Offline data pipeline.** One no-grad forward pass plus sampling builds the pseudo-trajectory; per-position history (correct / mask / wrong) feeds *History Reference*, supervised by a **Reveal + RM + Keep** loss.

Citation

BibTeX

@misc{zhang2026multiturn,
  title         = {Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models},
  author        = {Zhang, Yanming and Bian, Yihan and Qi, Jingyuan and Yao, Yuguang and Huang, Lifu and Zhou, Tianyi},
  year          = {2026},
  eprint        = {2606.16700},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2606.16700}
}