Tokenization in video fashions, usually by patchification, generates an extreme and redundant variety of tokens. This severely limits video effectivity and scalability. Whereas current trajectory-based tokenizers supply a promising resolution by decoupling video period from token depend, they depend on advanced exterior segmentation and monitoring pipelines which are gradual and task-agnostic. We suggest TrajTok, an end-to-end video tokenizer module that’s totally built-in and co-trained with video fashions for a downstream goal, dynamically adapting its token granularity to semantic complexity, unbiased of video period. TrajTok incorporates a unified segmenter that performs implicit clustering over pixels in each area and time to straight produce object trajectories in a single ahead go. By prioritizing downstream adaptability over pixel-perfect segmentation constancy, TrajTok is light-weight and environment friendly, but empirically improves video understanding efficiency. With TrajTok, we implement a video CLIP mannequin skilled from scratch (TrajViT2). It achieves one of the best accuracy at scale throughout each classification and retrieval benchmarks, whereas sustaining effectivity corresponding to one of the best token-merging strategies. TrajTok additionally proves to be a flexible element past its position as a tokenizer. We present that it may be seamlessly built-in as both a probing head for pretrained visible options (TrajAdapter) or an alignment connector in vision-language fashions (TrajVLM) with particularly robust efficiency in long-video reasoning.
- †College of Washington
- ‡ Allen Institute for Synthetic Intelligence (AI2)
- § Woven by Toyota, Inc.
