Motion Guidance is a method to achieve motion-based editing. Given an image to edit and a flow field, indicating where each pixel should go, we produce a new image with the desired motion.
Our method is zero-shot, and supports motions such as rotations, translations, stretches, scaling, shrinking, homographies, general deformations, and works on both generated and real images.
(If you are interested in using the interactive flow visualizations from this page for your own project, a minimum working example is provided here.)
"a photo of a cat"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"an apple on a wooden table"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a painting of a sunflower"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a laptop"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a topiary"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
To achieve motion-based editing, we propose using guidance [1] during sampling in a diffusion model. At each time step in the reverse process, we perturb the noisy estimate in the direction that minimizes some loss function. As a loss function, we propose using the difference between the desired motion and the current motion of the noisy sample, with respect to the original image, as estimated by an off-the-shelf (differentiable) optical flow network [2]. Effectively, we find a sample that is likely under the diffusion model, while attaining a low loss.
In order to achieve good results, we find we also need to use a few tricks, including color regularization, reconstruction guidance [3] [4], occlusion masking, and edit masking. Please see the paper for additional details.
All optical flows used in this paper (with the exception of Motion Transfer results) are generated by composing elementary flows together, and masking with segmentations masks from SAM. These elementary flows consist of translations, rotations, scaling, and more complex deformations. We show examples of how these flows can be created using a simple UI below:
Construction of elementary flows with a GUI is fairly straightforward, using a click and drag interface to specify translations, rotations, scaling, and stretching. More complex deformations can be constructed by composing or interpolating these flows:
Below, we show various translations, scalings, and stretches to the same source image.
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a teapot floating in water"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
In some cases, we can extract motion from video and apply that motion to images. For example, below we extract to motion from the video of the Earth spinning, and apply it to various, real animal images. The extracted flow is not perfectly accurate, and does not perfectly align with the source image, but because we optimize a soft objective, our method is able to produce reasonable results.
Frame 1
Frame 2
Estimated Flow
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a cute humanoid robot on a solid background"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"an aerial photo of a river"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a modern house"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a lion"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a photo of a lion"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
[real image]
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a painting of a lone tree"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
"a painting of a lone tree"
Target Flow
(Hover over me)
Source Image
(Hover over me)
Motion Edited
Our method suffers from various limitations that would benefit from further research. Below, we show examples of failure cases. (a) Because we use an off-the-shelf optical flow network, some flow prompts are severely out-of-distribution, such as a vertical flip, and fail. (b) Because we optimize a soft objective, seeking to achieve a likely sample under the diffusion model as well as a sample that minimizes the guidance energy, we sometimes see loss of identity in our generations. (c) The one-step approximation is sometimes unstable, and can diverge catastrophically. Additionally, we inherit the limitations of diffusion models and Universal Guided Diffusion [4], such as slow sampling speeds.
DragGAN enables drag-based editing of images using pretrained GANs. Users select a point on an image, and indicate where it should move to.
Inspired by DragGAN, Drag Diffusion and Dragon Diffusion port the drag-based editing capabilities of DragGAN to more versatile diffusion models.
Related works have proposed guidance on various objectives, including on an LPIPS loss, "readout heads", the internal features of the diffusion network itself, and segmentation, detection, and facial recognition networks.
[1] Dhariwal, Nichol, “Diffusion Models Beat GANs on Image Synthesis”, NeurIPS, 2021. ↩
[2] Teed, Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow”, ECCV, 2020. ↩
[3] Ho et al., “Video Diffusion Models”, arXiv, June 2022. ↩
[4] Bansal et al., “Universal Guidance for Diffusion Models”, ICLR, 2024. ↩
@article{geng2024motion,
author = {Geng, Daniel and Owens, Andrew},
title = {Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators},
journal = {International Conference on Learning Representations},
year = {2024},
}