Video-conditioned sound and speech era, encompassing video-to-sound (V2S) and visible text-to-speech (VisualTTS) duties, are conventionally addressed as separate duties, with restricted exploration to unify them inside a signle framework. Current makes an attempt to unify V2S and VisualTTS face challenges in dealing with distinct situation sorts (e.g., heterogeneous video and transcript circumstances) and require complicated coaching phases. Unifying these two duties stays an open downside. To bridge this hole, we current VSSFlow, which seamlessly integrates each V2S and VisualTTS duties right into a unified flow-matching framework. VSSFlow makes use of a novel situation aggregation mechanism to deal with distinct enter indicators. We discover that cross-attention and self-attention layer exhibit totally different inductive biases within the technique of introducing situation. Subsequently, VSSFlow leverages these inductive biases to successfully deal with totally different representations: cross-attention for ambiguous video circumstances and self-attention for extra deterministic speech transcripts. Moreover, opposite to the prevailing perception that joint coaching on the 2 duties requires complicated coaching methods and will degrade efficiency, we discover that VSSFlow advantages from the end-to-end joint studying course of for sound and speech era with out further designs on coaching phases. Detailed evaluation attributes it to the realized basic audio prior shared between duties, which accelerates convergence, enhances conditional era, and stabilizes the classifier-free steerage course of. Intensive experiments show that VSSFlow surpasses the state-of-the-art domain-specific baselines on each V2S and VisualTTS benchmarks, underscoring the crucial potential of unified generative fashions.
- † Renmin College of China
