Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

University of Michigan
ICLR 2024
Correspondence to: ude.hcimu@gnegd

Motion Guidance is a method to achieve motion-based editing. Given an image to edit and a flow field, indicating where each pixel should go, we produce a new image with the desired motion.


Our method is zero-shot, and supports motions such as rotations, translations, stretches, scaling, shrinking, homographies, general deformations, and works on both generated and real images.


(If you are interested in using the interactive flow visualizations from this page for your own project, a minimum working example is provided here.)

Examples

"a photo of a cat"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"an apple on a wooden table"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a painting of a sunflower"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a photo of a laptop"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a photo of a topiary"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

See below for more examples.

Method

To achieve motion-based editing, we propose using guidance [1] during sampling in a diffusion model. At each time step in the reverse process, we perturb the noisy estimate in the direction that minimizes some loss function. As a loss function, we propose using the difference between the desired motion and the current motion of the noisy sample, with respect to the original image, as estimated by an off-the-shelf (differentiable) optical flow network [2]. Effectively, we find a sample that is likely under the diffusion model, while attaining a low loss.


In order to achieve good results, we find we also need to use a few tricks, including color regularization, reconstruction guidance [3] [4], occlusion masking, and edit masking. Please see the paper for additional details.



A GUI for Flow Construction

All optical flows used in this paper (with the exception of Motion Transfer results) are generated by composing elementary flows together, and masking with segmentations masks from SAM. These elementary flows consist of translations, rotations, scaling, and more complex deformations. We show examples of how these flows can be created using a simple UI below:

Segmenting out the topiary tree, constructing a translation optical flow field, and then applying our motion editing method. (Not real-time)
Segmenting out the apple, constructing a shrinking optical flow field, and then applying our motion editing method. (Not real-time)

Construction of elementary flows with a GUI is fairly straightforward, using a click and drag interface to specify translations, rotations, scaling, and stretching. More complex deformations can be constructed by composing or interpolating these flows:

Flow colorwheel for reference. Color represents flow direction, and brightness represents magnitude
Translations: These can be defined by clicking and dragging a vector
Rotations: The initial click defines the center of rotation, and dragging further away increases the angle of rotation
Scaling: The initial click defines the center of scaling. Dragging outside the circle indicates magnifying, inside the circle indicates shrinking
Stretching: Stretches with respect to a line defined by the first click. The notch denotes the boundary between squeezing and stretching
Interpolated Stretching: We can interpolate between stretches and squeezes, yielding a continuous and complex deformation, as seen in the topiary example.

Various Motions – Same Source Image

Below, we show various translations, scalings, and stretches to the same source image.

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a teapot floating in water"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

Motion Transfer

In some cases, we can extract motion from video and apply that motion to images. For example, below we extract to motion from the video of the Earth spinning, and apply it to various, real animal images. The extracted flow is not perfectly accurate, and does not perfectly align with the source image, but because we optimize a soft objective, our method is able to produce reasonable results.

Frame 1

Frame 2

Estimated Flow

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

Additional Examples

"a photo of a cute humanoid robot on a solid background"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"an aerial photo of a river"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a photo of a modern house"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a photo of a lion"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a photo of a lion"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

[real image]

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a painting of a lone tree"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

"a painting of a lone tree"

Target Flow

(Hover over me)

Source Image

(Hover over me)

Motion Edited

Limitations

Our method suffers from various limitations that would benefit from further research. Below, we show examples of failure cases. (a) Because we use an off-the-shelf optical flow network, some flow prompts are severely out-of-distribution, such as a vertical flip, and fail. (b) Because we optimize a soft objective, seeking to achieve a likely sample under the diffusion model as well as a sample that minimizes the guidance energy, we sometimes see loss of identity in our generations. (c) The one-step approximation is sometimes unstable, and can diverge catastrophically. Additionally, we inherit the limitations of diffusion models and Universal Guided Diffusion [4], such as slow sampling speeds.

Related Works

DragGAN enables drag-based editing of images using pretrained GANs. Users select a point on an image, and indicate where it should move to.


Inspired by DragGAN, Drag Diffusion and Dragon Diffusion port the drag-based editing capabilities of DragGAN to more versatile diffusion models.


Related works have proposed guidance on various objectives, including on an LPIPS loss, "readout heads", the internal features of the diffusion network itself, and segmentation, detection, and facial recognition networks.

References

[1] Dhariwal, Nichol, “Diffusion Models Beat GANs on Image Synthesis”, NeurIPS, 2021.

[2] Teed, Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow”, ECCV, 2020.

[3] Ho et al., “Video Diffusion Models”, arXiv, June 2022.

[4] Bansal et al., “Universal Guidance for Diffusion Models”, ICLR, 2024.

Real Image Attribution

BibTeX

@article{geng2024motion,
  author    = {Geng, Daniel and Owens, Andrew},
  title     = {Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators},
  journal   = {International Conference on Learning Representations},
  year      = {2024},
}