Comparing Correspondences: Video Prediction with
Correspondence-wise Losses

Daniel Geng

Max Hamilton

Andrew Owens

[Paper]

[GitHub]

Losses typically compare pixels to pixels or patches to patches by absolute location. Instead, what would happen if we compare patches or pixels to corresponding patches or pixels? We propose correspondence-wise losses that do just this, and compare them against traditional pixel-wise and patch-wise losses.

Abstract

Image prediction methods often struggle on tasks that require changing the positions of objects, such as video prediction, producing blurry images that average over the many positions that objects might occupy. In this paper, we propose a simple change to existing image similarity metrics that makes them more robust to positional errors: we match the images using optical flow, then measure the visual similarity of corresponding pixels. This change leads to crisper and more perceptually accurate predictions, and does not require modifications to the image prediction network. We apply our method to a variety of video prediction tasks, where it obtains strong performance with simple network architectures, and to the closely related task of video interpolation.

Toy Experiment

As a way to understand correspondence-wise losses in a simple scenario, we create a toy dataset by starting with a static background and then sampling a car's horizontal position uniformally about the center of the image. In this way we can simulate positional uncertainty. We can then ask for networks trained under various losses, what is the optimal prediction under this uncertainty?

Sample

MSE

L1 + Perceptual

Corr-wise L1

Under the L1 and the MSE loss a network produces poor images (analytically, these are the median and the mean respectively). Perhaps more surprisingly, the perceptual loss also performs badly. A correspondence-wise loss, on the other hand, is able to faithfully reproduce the car.

Ablations

We train video predictions using both pixel-wise and correspondence-wise losses. All else being equal, we find that training with a correspondence-wise loss produces higher quality images, especially when there is large amounts of spatial uncertainty.

Below is an example in which there is a sudden camera movement in the context frames. This causes a poor L1 prediction, which a corr-wise L1 trained network improves upon:

Context 1

Context 2

Context 3

Pixel-wise L1 Prediciton

Corr-wise L1 Prediciton

A perceptual + L1 trained network does better, but still has artifacts. Turning the perceptual + L1 loss into a correspondence-wise loss produces a much more appealing prediction:

Context 1

Context 2

Context 3

Pixel-wise L1 + Perceptual Prediciton

Corr-wise L1 + Perceptual Prediciton

Training Visualization

A visualization of training with a correspondence-wise loss from the toy experiment. Notice how content is generated first and then the position of objects is refined.

Paper and Supplementary Material

For more details and experiments check out our paper:

Geng, Hamilton, Owens.
Comparing Correspondences.
CVPR 2022.
(hosted on ArXiv)

[Bibtex]

This template was originally made by Phillip Isola and Richard Zhang.