By generating smooth and plausible transitions between two image frames, video inbetweening serves as a crucial tool for video editing and long video synthesis. Traditional methods often struggle to handle complex, large-scale motions. While recent advancements in video generation produce high-quality results, they frequently lack precise control over the details of intermediate frames, which can result in outputs that deviate from creative intent. To address these challenges, we introduce MotionBridge, a unified video inbetweening framework offering flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. Learning such diverse, multimodal controls in a unified framework poses significant challenges. To tackle this, we design two specialized generators to faithfully extract control signals and dual-branch embedders to resolve ambiguities in feature encoding. Additionally, we propose a curriculum training strategy to progressively learn and integrate these various controls.
We present results using motion trajectories and masks as inputs to control the in-between generation between the provided input frames.
Using masks allows for greater control over generation. In these examples, the red mask indicates static regions, while the blue mask highlights areas of dynamic motion. The static mask ensures that specified regions remain stable while still adhering to the conditions of the last frame. Whereas, dynamic masks define areas of movement, providing more precise control over motion.
Our model enables seamless looping video generation. With identical input frames, it creates smooth loops that follow motion trajectories, ideal for digital art, virtual environments, and background animations.
Basic Text-to-Video (T2V) models rely solely on text for control, which often falls short for detailed or larger motions, leading to undesired outputs. We demonstrate results in refining T2V outputs, enhancing control.
Given two input frames of the same object from different camera locations, we use automatic motion extraction via optical flow to generate the inbetween frames.
Here we show some results by using trajectories and masks as inputs to control the inbteweening generation between the given input frames.
Example results using guide pixels for control: specific pixels are placed in target regions of the final frames.
Our model also animates single images by generating plausible motions given trajectories, enabling applications like bringing static images to life and enhancing storytelling.
We compare our model to baselines also demonstrating its enhanced control and ability to generate coherent motion.