Wednesday, January 21, 2026

Salesforce AI Introduces FOFPred: A Language-Pushed Future Optical Circulation Prediction Framework that Permits Improved Robotic Management and Video Technology


Salesforce AI analysis group current FOFPred, a language pushed future optical movement prediction framework that connects giant imaginative and prescient language fashions with diffusion transformers for dense movement forecasting in management and video technology settings. FOFPred takes a number of pictures and a pure language instruction comparable to ‘transferring the bottle from proper to left’ and predicts 4 future optical movement frames that describe how each pixel is anticipated to maneuver over time.

https://arxiv.org/pdf/2601.10781

Future optical movement as a movement illustration

Optical movement is the obvious per pixel displacement between two frames. FOFPred focuses on future optical movement, which suggests predicting dense displacement fields for future frames given solely present observations and textual content, with out entry to future pictures at inference.

Future optical movement is a compact movement solely illustration. It removes static look and retains solely pixel degree movement, so it’s effectively suited as an intermediate state for robotic management insurance policies and as a conditioning sign for video diffusion fashions. In comparison with predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling textures and excessive frequency particulars that aren’t required for movement planning.

To plug into present latent diffusion infrastructure, the analysis group encode optical movement as RGB pictures. They map movement magnitude and path from polar kind into HSV channels, then convert to RGB. The scaling of every channel is tuned in order that consecutive movement frames are visually easy and resemble animated graphics. A regular Flux.1 variational autoencoder then encodes and decodes these movement pictures.

Unified VLM Diffusion spine

FOFPred makes use of a unified structure that mixes a frozen imaginative and prescient language mannequin, a frozen VAE and a trainable diffusion transformer. The pipeline is:

  • Qwen2.5-VL is used because the imaginative and prescient language encoder to collectively encode the caption and visible inputs.
  • Flux.1 VAE encodes the enter pictures and the coaching optical movement targets into latent tensors.
  • An OmniGen type diffusion transformer, DiT, takes projected visible and textual options as conditional inputs and generates latent future movement sequences.

Solely the DiT and small MLP projectors are skilled. The Qwen2.5-VL and Flux.1 weights keep frozen, which lets the mannequin reuse picture modifying pretraining and multimodal reasoning capability from prior work. Temporal modeling is added by extending the RoPE positional encoding and a spotlight blocks from two dimensional spatial positions to full spatio-temporal positions throughout enter and output body sequences. This provides full spatio-temporal consideration with out including additional parameters, so the DiT can reuse OmniGen picture pretraining immediately.

https://arxiv.org/pdf/2601.10781

Coaching on noisy net movies with relative optical movement

The core mannequin is skilled on net scale human exercise movies with paired captions. The analysis group makes use of the One thing One thing V2 dataset and the EgoDex selfish manipulation dataset to acquire round 500,000 video caption pairs.

Coaching makes use of an finish to finish movement matching goal in latent house. Future optical movement sequences are first computed offline, then encoded by the VAE and used as targets in a movement matching diffusion loss for the DiT. Throughout coaching the tactic additionally applies classifier free steerage on each textual content and visible situations and masks some frames and viewpoints to enhance robustness.

A vital contribution is the relative optical movement calculation used to construct clear coaching targets from noisy selfish movies. For every body pair the tactic:

  1. Computes dense optical movement with an off the shelf estimator.
  2. Estimates digicam movement through homography utilizing deep options.
  3. Makes use of projective geometry to subtract digicam movement and acquire object centric relative movement vectors.
  4. Filters body pairs by choosing these the place the highest ok % movement magnitudes exceed a threshold, which focuses coaching on segments with significant movement.

These steps are run offline at decrease decision for effectivity, then recomputed at unique decision for the ultimate targets. The ablation research exhibits that static body targets or uncooked movement with out digicam movement elimination hurt downstream efficiency, whereas disentangled relative movement targets give the most effective outcomes.

https://arxiv.org/pdf/2601.10781

Language pushed robotic manipulation

The primary downstream use case is robotic management. FOFPred is finetuned on robotic video caption information to foretell future optical movement from each fastened and wrist mounted cameras. On prime of FOFPred, the analysis group connect a diffusion coverage community that takes predicted movement, textual content and robotic state, and outputs steady actions. This setup follows prior diffusion coverage work however makes use of future optical movement as an alternative of predicted RGB frames because the core illustration.

On the CALVIN ABCD benchmark, which evaluates lengthy horizon zero shot chains of 5 language specified manipulation duties, FOFPred reaches a mean chain size of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 beneath the identical protocol. FOFPred additionally attains a Process 5 success charge of 78.7 %, which is the most effective amongst reported strategies. In a low information setting with 10 % of CALVIN demonstrations, FOFPred nonetheless reaches 3.43 common size, increased than the three.25 of VPP.

On RoboTwin 2.0, a twin arm manipulation benchmark with 5 duties that require each arms, FOFPred attains a mean success charge of 68.6 %. The VPP baseline reaches 61.8 % beneath an identical coaching settings. FOFPred improves success on each process within the subset.

https://arxiv.org/pdf/2601.10781

Movement conscious textual content to video technology

The second downstream process is movement management in textual content to video technology. The analysis group construct a two stage pipeline by connecting FOFPred with the Go along with the Circulation video diffusion mannequin. FOFPred takes an preliminary body and a language description of movement, predicts a sequence of future movement frames, and interpolates them right into a dense movement area. Go along with the Circulation then makes use of this movement area and the preliminary body to synthesize the ultimate video, imposing the described movement sample.

On the movement heavy One thing One thing V2 benchmark, the FOFPred together with Go along with the Circulation pipeline improves over the CogVideoX baseline beneath an identical situations. The tactic reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and movement constancy 0.662, that are persistently higher than CogVideoX. Importantly, FOFPred solely makes use of language and a single body at inference, whereas a number of controllable video baselines require hand or object masks or trajectories as additional inputs.

https://arxiv.org/pdf/2601.10781

Key Take aways

  1. FOFPred reframes movement prediction as language pushed future optical movement, predicting 4 dense optical movement frames from a number of present pictures and a textual content instruction, which gives a compact movement solely illustration for downstream duties.
  2. The mannequin makes use of a unified VLM Diffusion spine, with Qwen2.5-VL as a frozen imaginative and prescient language encoder, Flux.1-VAE as a frozen latent encoder for pictures and movement, and an OmniGen type DiT as the one skilled part with spatio temporal RoPE primarily based consideration.
  3. Coaching depends on giant scale net and selfish video from One thing One thing-V2 and EgoDex, and builds relative optical movement targets by estimating ego-motion through homography, subtracting digicam movement and filtering for prime movement segments, which considerably improves downstream efficiency.
  4. In robotic manipulation, FOFPred acts as a movement spine for a diffusion coverage head and achieves state-of-the-art or higher outcomes on CALVIN ABCD and RoboTwin 2.0, together with 4.48 common process chain size on CALVIN and 68.6 % common success on RoboTwin, outperforming VPP and DreamVLA variants.
  5. For textual content to video technology, connecting FOFPred to Go along with the Circulation yields higher SSv2 metrics than CogVideoX, with increased SSIM and PSNR, decrease FVD and KVD, and improved movement constancy, whereas requiring solely language and a single body at inference, making FOFPred a reusable movement controller for each robotics and video synthesis pipelines.

Take a look at the Paper, Mannequin and Repo. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.

Related Articles

Latest Articles