Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. However, a fundamental dilemma exists in token representation:
Discrete Tokens
- 😊 Enable straightforward modeling with standard cross-entropy loss
- 😢 Suffer from information loss due to training quantization
- 😢 Face tokenizer training instability issues
- 😢 Limited vocabulary size restricts representational capacity
Continuous Tokens
- 😊 Better preserve rich visual details
- 😊 Avoid quantization bottlenecks during training
- 😢 Require complex distribution modeling (diffusion or GMM)
- 😢 Complicate the generation pipeline with specialized components
How do we bridge this gap?
- We decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from pretrained continuous representations, enabling seamless conversion between token types.
- Our approach bridges the quality gap between discrete and continuous methods, achieving continuous-level visual quality while maintaining the modeling simplicity of discrete approaches - harnessing the strengths of both approaches.