Tracking every pixel: motion estimation with OmniMotion
Research in computer vision is continuously expanding the horizons of possibilities for editing and creating video content, and one of the new innovative tools presented at the International Conference on Computer Vision in Paris is OmniMotion. It is described in the paper “Tracking Everything Everywhere All at Once.” Developed by Cornell researchers, this is a powerful optimization tool designed to estimate motion in video footage. It offers the potential to completely transform video editing and generative content creation using artificial intelligence. Traditionally, motion estimation methods have followed one of two main approaches: tracking sparse objects and using dense optical flow. However, none of them allowed us to fully simulate motion in video over large time intervals and keep track of the movement of all pixels in the video. Approaches undertaken to address this problem are often context-limited in time and space, leading to accumulation of errors over long trajectories and inconsistencies in motion estimates. In general, the development of methods for tracking both dense and long-range trajectories remains a pressing issue in the field, including three main aspects:
- motion tracking over long time intervals
- motion tracking even through occlusion events
- ensuring consistency in space and time
OmniMotion is a new optimization method designed to more accurately estimate both dense and long-range motion in video sequences. Unlike previous algorithms that operated in limited time windows, OmniMotion provides a complete and globally consistent representation of motion. This means that every pixel in a video can now be accurately tracked throughout the entire video footage, opening the door to new possibilities for video content exploration and creation. The method proposed in OmniMotion can handle complex tasks such as occlusion tracking and modeling various combinations of camera and object motion. Tests conducted during the research have shown that this innovative approach easily outperforms pre-existing methods in both quantitative and qualitative terms.
As shown in the motion illustration above, OmniMotion allows you to estimate full-scale motion trajectories for every pixel in every frame of video. The sparse trajectories of foreground objects are shown for clarity, but OmniMotion also calculates motion trajectories for all pixels. This method provides precise, consistent movement over long distances, even for fast-moving objects, and reliably tracks objects even through occlusion moments, as shown in the examples with the dog and swing.
In OmniMotion, the canonical volume G is a 3D atlas containing information about the video. It includes a coordinate network Fθ based on the NeRF method to establish a correspondence between each canonical 3D coordinate, density σ and color c.
Density information helps identify surfaces in a frame and determine whether objects are occluded, and color is used to calculate photometric loss for optimization purposes. The canonical 3D volume plays an important role in capturing and analyzing the motion dynamics in a scene.
OmniMotion also uses 3D bijections, which provide a continuous one-to-one correspondence between 3D points in local coordinates and the canonical 3D coordinate system. These bijections provide motion consistency by ensuring that correspondence between 3D points in different frames originates from the same canonical point.
To represent complex real-world motion, bijections are implemented using invertible neural networks (INNs) that provide expressive and adaptive display capabilities. This method allows OmniMotion to accurately capture and track motion across frames while maintaining overall data consistency.
To implement OmniMotion, a complex network consisting of six layers of affine transformation was created. It is capable of computing the latent code for each frame using a 2-layer network with 256 channels, and the dimension of this code is 128. Additionally, the canonical representation is implemented using a GaborNet architecture equipped with 3 layers and 512 channels. Pixel coordinates are normalized to the range [-1, 1], and a local 3D space is specified for each frame. Matched canonical locations are initialized within the unit sphere. Also, compression operations adapted from mip-NeRF 360 are applied for numerical stability during the training.
This architecture is trained on each video sequence using the Adam optimizer for 200,000 iterations. Each training set includes 256 pairs of matches selected from 8 image pairs, resulting in a total of 1024 matches. It is also important to note that 32 points are selected for each ray using stratified sampling. This sophisticated architecture is a key to OmniMotion's outstanding performance and solves the complex challenges associated with motion estimation in video.
One of the very useful aspects of OmniMotion is its ability to extract pseudo-depth renderings from an optimized quasi-3D representation. This provides information about the different depths of different objects in the scene and displays their relative positions. Below is an illustration of the pseudo-depth visualization. Nearby objects are marked in blue, while distant objects are marked in red, which clearly demonstrates the order of the different parts of the scene.
It is important to note that, like many motion estimation methods, OmniMotion has its limitations. It does not always cope with very fast and rigid movements, as well as with thin structures in the scene. In these special scenarios, pairwise correspondence methods may not provide sufficiently reliable matches, which can lead to a lack of accuracy in the global motion calculation. OmniMotion continues to evolve to address these challenges and contribute to the advancement of video motion analysis.
Try out the demo version here. Technical details are available on GitHub