One Mannequin, Three Modalities: ByteDance Releases Lance for Picture and Video Understanding, Era, and Enhancing

May 21, 2026

90

Constructing a single mannequin that may each perceive and generate photos and movies is tougher than it sounds. The 2 duties pull in reverse instructions. Understanding advantages from high-level semantic options tightly aligned with language. Era wants low-level steady representations that protect texture, geometry, and temporal dynamics. Most techniques deal with this stress by separating the 2 into distinct architectures, then bridging them post-hoc.

ByteDance analysis crew took a special strategy with Lance. Slightly than assembling separate parts, the analysis crew designed a mannequin that natively integrates understanding, era, and modifying throughout each picture and video modalities — educated collectively from the beginning.

What Lance Can Do

Lance organizes its capabilities into three output households: textual content (X2T), photos (X2I), and movies (X2V). On the understanding facet, this covers picture and video captioning, visible query answering, OCR, visible grounding, and reasoning. On the era facet, it handles text-to-image, text-to-video, image-to-video, subject-driven era, picture modifying, and video modifying — together with multi-turn consistency modifying throughout each modalities.

This all-in-one functionality is a significant milestone. Whereas commonplace unified architectures sometimes cease at primary picture understanding and text-to-image era, Lance is among the many few to natively bridge all the image-video ecosystem throughout each understanding and era duties.

How the Structure Works

The structure is predicated on two ideas: unified context modeling and decoupled functionality pathways.

For unified context, Lance converts all inputs — textual content, photos, and movies — right into a single shared interleaved multimodal sequence. Textual content tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visible inputs, the Qwen2.5-VL ViT encoder produces compact semantic visible tokens. For generation-oriented visible inputs, the Wan2.2 3D causal VAE encoder encodes photos and movies into steady latent representations, making use of 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token varieties — textual content, semantic visible, and latent visible — dwell in the identical sequence. The mannequin then runs generalized 3D causal consideration over the complete context, with textual content tokens utilizing causal consideration and visible tokens utilizing bidirectional consideration.

For decoupled pathways, Lance makes use of a dual-stream mixture-of-experts structure initialized from Qwen2.5-VL 3B. The understanding professional (LLMUND) handles textual content and semantic visible tokens, producing outputs for multimodal reasoning and textual content era. The era professional (LLMGEN) handles VAE latent tokens for visible synthesis and modifying. Crucially, each specialists function over the identical shared interleaved sequence — they share context however don’t compete for a similar parameters. The understanding professional is educated with a next-token prediction loss; the era professional is educated with a stream matching goal in steady latent house. The 2 losses are mixed with configurable weights all through coaching.

Modality-Conscious Rotary Positional Encoding (MaPE)

Operating ViT semantic tokens, clear VAE situation tokens, and noisy VAE goal tokens by way of the identical sequence creates a delicate drawback. Customary 3D-RoPE encodes positions primarily based on spatiotemporal structure alone — it has no technique to inform these token teams aside. When a number of visible token teams occupy the identical sequence, their positional boundaries change into ambiguous, which may damage cross-task alignment.

Lance introduces Modality-Conscious Rotary Positional Encoding (MaPE) to repair this. MaPE applies a hard and fast temporal offset to every modality group primarily based on its index within the sequence. Spatial coordinates keep unchanged, so the intrinsic structure inside photos and movies is preserved. The temporal offset alone is sufficient to separate the token teams within the world positional house with out disrupting temporal ordering inside any particular person video.

Eradicating MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to six.30, and VBench from 81.81 to 80.95 — constant degradation throughout era, modifying, and understanding.

Coaching: 4 Phases, One Unified Framework

Lance is educated by way of 4 sequential phases, every constructing on the final.

Pre-Coaching (PT) lays the muse utilizing roughly 1B image-text and 140M video-text pairs, overlaying 1.5T coaching tokens. This stage establishes primary multimodal alignment and era functionality. The VAE and ViT encoders are frozen right here; solely the spine and connectors are educated.

Continuous Coaching (CT) expands the duty house by introducing interleaved multi-task knowledge — modifying samples, subject-driven era samples, and multimodal understanding knowledge — throughout roughly 300B tokens. A progressive data-mixture schedule step by step will increase the proportion of tougher duties like modifying as coaching proceeds.

Supervised High quality-Tuning (SFT) tightens instruction following, modifying accuracy, and id consistency utilizing curated high-quality knowledge throughout 72B tokens.

Reinforcement Studying (RL) makes use of Group Relative Coverage Optimization (GRPO), with PaddleOCR serving because the reward mannequin, to additional sharpen textual content rendering accuracy and image-text alignment.

The whole lot suits inside a most coaching finances of 128 GPUs.

Outcomes

Picture Era. On GenEval, Lance scores 0.90 general, matching TUNA for the highest spot amongst unified fashions. Subcategory scores embody counting (0.84), colours (0.97), and spatial place (0.87). On DPG-Bench, Lance scores 84.67 general, with significantly sturdy relation modeling — although TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To place the parameter effectivity in perspective: Janus-Professional-7B scores 0.80 on GenEval; Present-o2 (7B) scores 0.76. Lance matches the highest unified mannequin rating at 3B activated parameters.

Video Era. On VBench, Lance achieves a Complete Rating of 85.11 (utilizing LLM rewriting), the best amongst unified fashions. The subsequent-best unified mannequin, TUNA, scores 84.06. Lance additionally outscores devoted generation-only fashions together with HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Picture Enhancing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the best amongst unified fashions. It leads in background change, materials modification, movement change, portrait beautification, topic elimination, topic alternative, and tone switch. Textual content modification is flagged as a remaining weak spot.

Video Understanding. On MVBench, Lance achieves a 62.0 general rating, the best amongst unified fashions. Present-o2 (7B), the next-best unified mannequin, scores 55.7. Lance additionally outperforms a number of understanding-only fashions with extra parameters — notable provided that it’s concurrently educated for era and modifying.

Marktechpost’s Visible Explainer

Key Takeaways

Lance is a 3B activated parameter native unified multimodal mannequin that handles picture and video understanding, era, and modifying inside a single collectively educated framework.
A dual-stream mixture-of-experts structure with Modality-Conscious Rotary Positional Encoding (MaPE) decouples understanding and era pathways whereas protecting them in shared interleaved multimodal context.
Lance achieves 0.90 on GenEval and 85.11 on VBench, the best Complete Rating amongst unified fashions, educated inside a most finances of 128 GPUs.
On MVBench, Lance scores 62.0, the best amongst unified fashions — outperforming Present-o2 (7B) at 55.7, whereas additionally supporting era and modifying.
Lance is open-source underneath Apache 2.0, with weights obtainable on Hugging Face.

Try the Paper, Mannequin Weights and Mission Web page. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

One Mannequin, Three Modalities: ByteDance Releases Lance for Picture and Video Understanding, Era, and Enhancing

What Lance Can Do

How the Structure Works

Modality-Conscious Rotary Positional Encoding (MaPE)

Coaching: 4 Phases, One Unified Framework

Outcomes

Marktechpost’s Visible Explainer

Examine Your Atmosphere First

Clone from GitHub

Set up Required Packages

Obtain Lance—3B Checkpoints

Run Duties by way of the CLI

Launch the Gradio Interface (Non-compulsory)

Key Takeaways

Related Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Latest Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Intel simply posted its greatest progress in 15 years – and burned billions to make it occur

One in every of NASA’s Most Necessary Deep Area Observatories Hit by Spanish Wildfires