Representation Forcing for Bottleneck-Free Unified Multimodal Models

Yuqing Wang1,2 †, Zhijie Lin2 ‡, Ceyuan Yang2, Yang Zhao2, Fei Xiao2, Hao He3,2 †, Qi Zhao2, Zihan Ding2, Fuyun Wang3,2 †, Shuai Wang4,2 †, Youliang Zhang5,2 †, Haoqi Fan2, Xihui Liu1 ✉
1University of Hong Kong, 2ByteDance Seed, 3The Chinese University of Hong Kong, 4Nanjing University, 5Tsinghua University
Work done during an internship at ByteDance Seed.   Project lead.   Corresponding author.

What lies between language and pixels?

Representation.

We propose to ground visual representation in the decoder of a unified model. The decoder learns to predict it from text, just as the encoder reads it from images. Understanding and generation then meet in a single representation space, learned end to end, with no frozen VAE in between.

We call this Representation Forcing: the encoder's representations force the decoder to learn the same visual structure, and the decoder's predictions, in turn, force the pixels to follow it.

Generation gallery at 1024x1024

Text-to-image generation results at 1024 × 1024 resolution from our pixel-space unified model with Representation Forcing.

Abstract

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels.

In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space.

We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.

Main Contributions

  • 1 Representation Forcing. A simple technique that closes the pixel-space generation gap in unified multimodal models by training the decoder to autoregressively predict visual representations as intermediate tokens, without any pretrained VAE.
  • 2 Benefits both generation and understanding. Our pixel-space model with RF matches VAE-based unified models on generation and outperforms its VAE-based counterpart on understanding, suggesting pixel-space generation is more compatible with unified multimodal modeling.
  • 3 A step toward fully end-to-end UMMs. Perception and generation share a single, end-to-end-learned representation space, rather than coordinate across separately pretrained components.

Architectural Comparison

Architectural comparison

(a) Prevailing UMMs rely on a frozen VAE encoder and decoder for image generation, creating a structural bottleneck. (b) Naively removing the VAE and generating directly in pixel space eliminates this bottleneck but loses structural guidance, leading to a quality gap. (c) Representation Forcing closes this gap by training the decoder to autoregressively predict visual representations (Rep head) before pixel generation. These representations match features from the model's own understanding encoder and remain in context within the shared transformer, providing structural guidance for pixel-space diffusion without any external latent space.

Method

Training Pipeline of Representation Forcing

RF training pipeline

Left: The decoder processes a unified sequence of text tokens (T), representation tokens (R), and pixel patches (P) within a shared transformer. Text and representation tokens are predicted autoregressively under next-token prediction (ℒLM and ℒRep), while pixel patches are generated via bidirectional diffusion from noise (ℒFM). The image encoder provides continuous visual features to the transformer for understanding tasks. Right: For generation training, an EMA copy of the image encoder extracts features from the ground-truth image, which are discretized via online quantization into representation tokens. These tokens provide both training targets for ℒRep and teacher-forcing inputs at R positions. At inference, the right panel is bypassed entirely: the decoder predicts representation tokens from the text prompt alone, and these tokens guide pixel-space diffusion.

Key Design Choices

  • Representations from the understanding encoder. Rather than relying on an external latent space, RF derives intermediate representations from the model's own jointly trained understanding encoder. Features are discretized via online vector quantization into representation tokens, using an EMA copy of the encoder for stable targets and Sinkhorn-Knopp balancing to prevent codebook collapse.
  • Discrete representation tokens. We discretize the visual representations so they can be predicted under the same next-token prediction objective as text tokens — unifying language modeling and representation prediction within a single autoregressive stream.
  • Mixture-of-Transformers backbone. All tokens share self-attention layers but are routed to three modality-specific feed-forward experts: understanding, representation prediction, and pixel generation.
  • End-to-end objective: ℒ = ℒLM + ℒFM + ℒRep. Cross-entropy losses for text and representation tokens; flow-matching loss with x-prediction for pixels.

Experiments

Image Generation

On GenEval and DPG-Bench, our pixel-space model with RF (RF-Pixel) matches state-of-the-art VAE-based unified multimodal models — without any separately pretrained generative module.

Model GenEval DPG
Single Obj. Two Obj. Counting Colors Position Color Attri. Overall↑ Overall↑
Generation Only
PixArt-α0.980.500.440.800.080.070.4871.11
SDv2.10.980.510.440.850.070.170.5068.09
DALL-E 20.940.660.490.770.100.190.52
SDXL0.980.740.390.850.150.230.5574.65
DALL-E 30.960.870.470.830.430.450.6783.50
SD3-Medium0.990.940.720.890.330.600.7484.08
FLUX.1-dev0.980.930.750.930.680.650.8284.00
Seedream 3.00.990.960.910.930.470.800.8488.27
Z-Image-Turbo1.000.950.770.890.650.680.8284.86
Qwen-Image0.990.920.890.880.760.770.8788.32
Unified Models
Chameleon0.39
LWM0.930.410.460.790.090.150.47
SEED-X0.970.580.260.800.190.140.49
TokenFlow-XL0.950.600.410.810.160.240.5573.38
ILLUME0.990.860.450.710.390.280.61
Janus0.970.680.300.840.460.420.61
Transfusion0.63
Emu30.990.810.420.800.490.450.6681.60
Show-o0.980.800.660.840.310.500.68
Show-o21.000.870.580.920.520.620.7686.14
Janus-Pro-7B0.990.890.590.900.790.660.8084.19
MetaQuery-XL0.8082.05
BLIP3-o0.8481.60
UniWorld-V10.980.930.810.890.740.710.8481.38
OmniGen20.990.960.740.980.710.750.8683.57
BAGEL0.990.940.810.880.640.630.8285.07
BAGEL0.980.950.840.950.780.770.88
RF-Pixel (ours)0.990.930.840.890.740.660.8484.15
RF-Pixel (ours)0.980.950.880.870.920.700.88

with LLM rewriter.

Image Understanding

Two takeaways: (i) RF consistently improves general visual understanding under both generation pathways; (ii) Pixel+RF outperforms VAE+RF on most benchmarks — removing the external VAE lets understanding and generation share a single representation space more tightly, indicating pixel-space generation is more compatible with unified multimodal modeling.

Method General Visual Understanding Document & Diagram
MMMU HalluBench MME* BLINK RealWorldQA AI2D DocVQA ChartQA
VLM-only56.265.079.756.265.890.389.386.0
VAE51.055.771.352.265.290.790.078.8
VAE + RF 49.6 −1.4 61.3 +5.6 79.3 +8.0 52.9 +0.7 66.6 +1.4 87.8 −2.9 88.3 −1.7 80.5 +1.7
Pixel49.963.776.649.463.185.890.081.7
Pixel + RF 54.2 +4.3 64.8 +1.1 80.2 +3.6 53.0 +3.6 65.8 +2.7 90.3 +4.5 88.0 −2.0 81.3 −0.4

MME* reports average accuracy across all perception and cognition questions. Deltas are vs. the corresponding non-RF baseline.

Ablation: Effect of Representation Forcing

Qualitative comparison with and without RF

Qualitative comparison of pixel-space generation with and without RF. Without RF, the model produces images with poor structure — distorted object shapes and incoherent compositions. With RF, the model generates more coherent structures by first predicting high-level visual representations before pixel rendering. Quantitatively, RF lifts pixel-space GenEval from 0.25 to 0.76, matching the VAE-based counterpart at 0.77.

BibTeX

@article{wang2026representation,
  title={Representation Forcing for Bottleneck-Free Unified Multimodal Models},
  author={Wang, Yuqing and Lin, Zhijie and Yang, Ceyuan and Zhao, Yang and Xiao, Fei and He, Hao and Zhao, Qi and Ding, Zihan and Wang, Fuyun and Wang, Shuai and Zhang, Youliang and Fan, Haoqi and Liu, Xihui},
  journal={arXiv preprint arXiv:2604.21921},
  year={2026}
}