Unified multimodal Massive Language Fashions (LLMs) that may each perceive and generate visible content material maintain immense potential. Nevertheless, current open-source fashions typically undergo from a efficiency trade-off between these capabilities. We current Manzano, a easy and scalable unified framework that considerably reduces this rigidity by coupling a hybrid picture tokenizer with a well-curated coaching recipe. A single shared imaginative and prescient encoder feeds two light-weight adapters that produce steady embeddings for image-to-text understanding and discrete tokens for text-to-image technology inside a typical semantic area. A unified autoregressive LLM predicts high-level semantics within the type of textual content and picture tokens, with an auxiliary diffusion decoder subsequently translating the picture tokens into pixels. The structure, along with a unified coaching recipe over understanding and technology information, permits scalable joint studying of each capabilities. Manzano achieves state-of-the-art outcomes amongst unified fashions, and is aggressive with specialist fashions, notably on text-rich analysis. Our research present minimal activity conflicts and constant positive aspects from scaling mannequin dimension, validating our design alternative of a hybrid tokenizer.
- †Meta
- ** Work finished whereas at Apple
