Friday, November 7, 2025

Adapting Self-Supervised Representations as a Latent House for Environment friendly Era


We introduce Illustration Tokenizer (RepTok), a generative modeling framework that represents a picture utilizing a single steady latent token obtained from self-supervised imaginative and prescient transformers. Constructing on a pre-trained SSL encoder, we fine-tune solely the semantic token embedding and pair it with a generative decoder educated collectively utilizing a normal circulate matching goal. This adaptation enriches the token with low-level, reconstruction-relevant particulars, enabling devoted picture reconstruction. To protect the favorable geometry of the unique SSL house, we add a cosine-similarity loss that regularizes the tailored token, guaranteeing the latent house stays easy and appropriate for technology. Our single-token formulation resolves spatial redundancies of 2D latent areas and considerably reduces coaching prices. Regardless of its simplicity and effectivity, RepTok achieves aggressive outcomes on class-conditional ImageNet technology and naturally extends to text-to-image synthesis, reaching aggressive zero-shot efficiency on MS-COCO underneath extraordinarily restricted coaching budgets. Our findings spotlight the potential of fine-tuned SSL representations as compact and efficient latent areas for environment friendly generative modeling.

Related Articles

Latest Articles