TokenBridge: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

1University of Hong Kong, 2ByteDance Seed, 3École Polytechnique, 4Peking University

Token Representation Dilemma

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. However, a fundamental dilemma exists in token representation:

Discrete Tokens

  • 😊 Enable straightforward modeling with standard cross-entropy loss
  • 😢 Suffer from information loss due to training quantization
  • 😢 Face tokenizer training instability issues
  • 😢 Limited vocabulary size restricts representational capacity

Continuous Tokens

  • 😊 Better preserve rich visual details
  • 😊 Avoid quantization bottlenecks during training
  • 😢 Require complex distribution modeling (diffusion or GMM)
  • 😢 Complicate the generation pipeline with specialized components

How do we bridge this gap?

  1. We decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from pretrained continuous representations, enabling seamless conversion between token types.
  2. Our approach bridges the quality gap between discrete and continuous methods, achieving continuous-level visual quality while maintaining the modeling simplicity of discrete approaches - harnessing the strengths of both approaches.

Comparison of Different AR Approaches

Method Comparison

(a) Traditional discrete tokenization incorporates quantization during training, resulting in tokenizer training instability and limited vocabulary size that restricts representational capacity. (b) Hybrid continuous AR models preserve rich visual information but need complex distribution modeling (diffusion or GMM) beyond standard categorical prediction. (c) Our approach bridges these paradigms by applying post-training quantization to pretrained continuous features, maintaining the high representational capacity of continuous tokens while enabling simple autoregressive modeling.

TokenBridge Teaser

TokenBridge combines the representational capacity of continuous tokens with the modeling simplicity of discrete approaches for high-quality visual generation.

Abstract

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.

To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling.

Main Contributions

  • A novel paradigm that bridges continuous and discrete token representations, achieving continuous level visual quality with the standard autoregressive cross-entropy loss.
  • A training-free quantization approach that transforms pretrained VAE features into discrete tokens without the optimization instabilities of conventional discrete tokenizers.
  • An efficient dimension-wise autoregressive prediction mechanism that handles exponentially large token spaces.

Method

Post-Training Quantization

Post-Training Quantization Process

Illustration of our post-training quantization process. The top row shows the pretrained continuous VAE tokenizer, mapping an input image to continuous latent features and reconstructing it through the decoder. Our post-training quantization process (middle) transforms these continuous features into discrete tokens by independently quantizing each channel dimension. The bottom-left shows how our approach preserves the original Gaussian-like distribution (purple curve) in discretized form (purple histogram). The right portion demonstrates the de-quantization process that maps indices back to continuous values for decoding.

Efficient Large-Vocabulary Token Modeling

Dimension-wise Autoregressive Prediction

Our autoregressive generation process. At the spatial level, our model autoregressively generates tokens conditioning on previous positions. For each spatial location, we apply dimension-wise sequential prediction to efficiently handle the large token space. This approach decomposes the modeling of each token into a series of smaller classification problems while preserving essential inter-dimensional dependencies.

Experiments

Main Results

Quantitative Comparison

Comparison of visual generation methods on ImageNet 256×256. Our model achieves comparable performance to the best continuous token approach (MAR) while using standard categorical prediction in autoregressive modeling.

Properties of Our Tokenizer

Reconstruction Quality Comparison

Reconstruction quality of typical continuous and discrete tokenizers. Our method achieves reconstruction quality comparable to continuous VAE, preserving more fine details than traditional discrete tokenizers, especially in text and facial features.

Different Granularity Reconstruction Results

Different quantization granularity reconstruction results. Visual comparison showing reconstructions at varying quantization levels. While global structure remains preserved across all quantization levels, finer quantization (higher B values) better maintains details in textures and edges.

Properties of Our Generator

Token Prediction Strategy Comparison

Token Prediction Strategy. Comparison of dimension-wise token prediction approaches. Top: Parallel prediction produces blurry, inconsistent images. Bottom: Our autoregressive approach sequentially predicts token dimensions, generating coherent, high-quality images. This highlights the interdependence of token dimensions and they cannot be predicted independently.

Confidence-guided Generation

Generation guided by token confidence. Our discrete token approach enables confidence-guided generation, producing clean foreground objects against simple backgrounds by prioritizing high-confidence tokens. This provides an advantage over continuous tokens, which lack explicit token-level confidence scores.

More Visualization Results

ImageNet Generation Results

Class-conditional generation results on ImageNet. Our approach achieves high-quality generation with fine details and realistic textures across diverse object categories.

BibTeX

coming soon~

Acknowledgment

The authors are grateful to Tianhong Li for helpful discussions on MAR and to Yi Jiang, Prof. Difan Zou, and Yujin Han for valuable feedback on the early version of this work.