CVPR 2026 Highlight

Cubic Discrete Diffusion

Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang¹, Chuofan Ma¹, Zhijie Lin^2†, Yao Teng¹, Lijun Yu³, Shuai Wang⁴, Jiaming Han⁵, Jiashi Feng², Yi Jiang², Xihui Liu^1*

¹University of Hong Kong ²ByteDance Seed ³Carnegie Mellon University ⁴Nanjing University ⁵The Chinese University of Hong Kong

^†Project lead ^*Corresponding author

arXiv Code

Can we generate high-dimensional semantic representations discretely, just like language models generate text?

Generating high-dimensional semantic representations has long been a pursuit for visual generation, yet discrete methods—the paradigm shared with language models—remain stuck with low-dimensional tokens. CubiD breaks this barrier with fine-grained cubic masking across the h×w×d tensor, directly modeling dependencies across both spatial and dimensional axes in 768-dim representation space, while the discretized tokens preserve their original understanding capabilities.

Generated samples from CubiD. Class-conditional generation on ImageNet 256×256 using 768-dimensional discrete representation tokens from DINOv2-B.

Abstract

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges.

We present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation—any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at T regardless of feature dimensionality, where T ≪ h×w×d. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks.

Comparison of discrete generation approaches

Comparison of discrete visual generation approaches. (a) In low dimensions, autoregressive requires h×w steps; discrete diffusion parallelizes in T<h×w steps. (b) In high dimensions, autoregressive becomes intractable (h×w×d steps), and standard discrete diffusion cannot model intra-position dependencies. CubiD performs fine-grained masking across the entire 3D tensor—any dimension at any position—enabling generation in T ≪ h×w×d steps while capturing both spatial and dimensional correlations.

How CubiD Works

CubiD follows the discrete diffusion paradigm by treating generation as iterative denoising of masked tokens. Unlike traditional methods that mask entire spatial positions, CubiD performs fine-grained masking at the dimension level—treating the h×w×d tensor as a unified modeling space where any subset of dimensions at any position can be masked and predicted from the remaining visible context. This enables the model to capture rich dependencies both within and across spatial locations through bidirectional attention and standard cross-entropy loss.

Overview of Cubic Discrete Diffusion. (a) A frozen representation encoder extracts continuous tokens, which are discretized through dimension-wise quantization into h×w×d discrete tokens. (b) During training, we randomly mask tokens across both spatial and dimensional axes. The transformer learns to predict masked tokens from unmasked context, capturing the rich dependency structure across the entire tensor.

At inference, CubiD starts from a fully masked tensor and progressively unmasks tokens through iterative refinement—a coarse-to-fine process that establishes global structure first, then refines local details. Crucially, this requires only hundreds of iterations regardless of the 196,608 total tokens, decoupling generation cost from feature dimensionality.

Generation process. From fully masked (0%) to complete image (100%). At each step, the model predicts all masked tokens in parallel and unmasks a subset following a cosine schedule. The visualization reveals a natural coarse-to-fine progression.

One Token, Two Purposes

To validate whether discrete tokens maintain the understanding capabilities of continuous representations, we evaluate on multimodal understanding tasks using the LLaVA framework. We compare three variants: original continuous SigLIP2 features, vector quantization (VQ), and our dimension-wise quantization (DQ).

Understanding Performance on LLaVA Benchmarks

Tokenizer	Type	GQA	TextVQA	POPE	MME
SigLIP2	Continuous	63.2	59.6	85.4	1484
SigLIP2-VQ	Discrete (VQ)	54.9	45.6	81.2	1189
SigLIP2-DQ (Ours)	Discrete (DQ)	63.1	59.8	85.0	1480

Traditional vector quantization methods that work well in low dimensions (8-32) fail at 768 dimensions due to the curse of dimensionality. Dimension-wise quantization treats each dimension independently, making discretization tractable at high dimensionality. As shown above, dimension-wise quantized features achieve nearly identical performance to continuous features on all benchmarks, while VQ suffers substantial degradation. This confirms that properly discretized high-dimensional tokens preserve semantic quality for understanding tasks, establishing them as viable unified representations for both understanding and generation.

Generation Quality

We compare CubiD with existing discrete generation methods on ImageNet 256×256 class-conditional generation. CubiD directly generates with native high-dimensional representation tokens (768d), while all other methods operate in compressed latent spaces ranging from 8 to 128 dimensions.

ImageNet 256×256 Class-Conditional Generation

Method	Latent Dim	#Params	gFID↓	gFID↓ (w/ cfg)
Discrete Diffusion
MaskGIT	16	227M	6.18	4.02
Token-Critic	16	368M	4.69	—
DPC	16	454M	4.45	—
Discrete Autoregressive
ViT-VQGAN	32	1.7B	4.17	—
LlamaGen-XXL	8	1.4B	14.6	—
VAR	32	2.0B	2.16	1.97
VFMTok-XXL	12	1.4B	2.38	1.95
VFMTok-3B	12	3.1B	2.34	2.04
High-Dimensional Tokens (Ours)
CubiD-L	768	946M	2.38	2.37
CubiD-XL	768	1.4B	2.06	2.04
CubiD-XXL	768	3.7B	2.02	1.88

Notably, representation tokens show reduced dependency on classifier-free guidance—even without guidance, CubiD-XXL achieves 2.02 gFID, outperforming most existing methods. CubiD also demonstrates consistent improvement with scale, from 2.38 (L) to 2.06 (XL) to 2.02 (XXL) without guidance, exhibiting strong scaling behavior from 900M to 3.7B parameters.

BibTeX

@article{wang2026cubid,
  title={Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens},
  author={Wang, Yuqing and Ma, Chuofan and Lin, Zhijie and Teng, Yao and Yu, Lijun and Wang, Shuai and Han, Jiaming and Feng, Jiashi and Jiang, Yi and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.19232},
  year={2026}
}