Visual Anagrams: Generating Multi-View
Optical Illusions with Diffusion Models

Daniel Geng, Inbum Park, Andrew Owens

University of Michigan

Correspondence to: ude.hcimu@gnegd

CVPR 2024 (Oral)

Paper arXiv Code Colab Print a Jigsaw! Slides (354 MB) BibTeX

Jigsaw Gallery Rotations Gallery Misc. Gallery

tl;dr: We use pretrained diffusion models
to make optical illusions

Overview

We present a simple, zero-shot method to generate multi-view optical illusions. These are images that look like one thing, but change appearance or identity when transformed. We show in theory and practice that our method supports a broad range of transformations including rotations, flips, color inversions, skews, jigsaw rearrangements, random permutations, and multiple views. We show some examples below.

Jigsaw Permutations

We can make images that change appearance when broken up into puzzle pieces and rearranged. These are jigsaw puzzles with multiple solutions!

Jigsaw Gallery

Flips and 180° Rotations

These illusions change appearance when flipped, or rotated 180 degrees.

Rotations Gallery

90° Rotations

And these illusions change appearance when rotated 90 degrees.

Rotations Gallery

Color Inversions

We can also make images that change when their colors are inverted.

Misc. Gallery

Miscellaneous Transformations

Here we show miscellaneous illusions. This includes skewing (a reference to this Hans Holbein painting!) and what we call "inner circle rotations."

Misc. Gallery

Random Patch Permutations

We can also rearrange patches. Surprisingly, increasing the number of patches to \(64 \times 64\) still gives recognizable results, albeit of lower quality.

Misc. Gallery

Three Views

Finally, we can do more than just two views, although it is much harder to get good results. Here are three view illusions with a series of rotations.

Four Views

And here is a four-view illusion. It was tremendously difficult to get this to work, and this was the only half-decent one we found!

Method

Our method is conceptually simple. We take an off-the-shelf diffusion model and use it to estimate the noise in different views or transformations, \(v_i\), of an image. The noise estimates are then aligned by applying the inverse view, \(v_i^{-1}\), and averaged together. This averaged noise estimate is then used to take a diffusion step.

Conditions on Views

We find that not every view function works with the above method. Of course, \(v_i\) must be invertible, but we discuss two additional constraints.

Linearity

A diffusion model is trained to estimate the noise in noisy data \(\mathbf{x}_t\) conditioned on time step \(t\). The noisy data \(\mathbf{x}_t\) is expected to have the form \[\mathbf{x}_t = w_t^{\text{signal}}\underbrace{\mathbf{x}_0}_{\text{signal}} + w_t^{\text{noise}}\underbrace{\epsilon\vphantom{\mathbf{x}_0}}_{\text{noise}}.\] That is, \(\mathbf{x}_t\) is a weighted average of pure signal \(\mathbf{x_0}\) and pure noise \(\epsilon\), specifically with weights \(w_t^{\text{signal}}\) and \(w_t^{\text{noise}}\). Therefore, our view, \(v\) must maintain this weighting between signal and noise. This can be achieved by making \(v\) linear, which we represent by the square matrix \(\mathbf{A}\). By linearity \[\begin{aligned} v(\mathbf{x}_t) &= \mathbf{A}(w_t^{\text{signal}} \mathbf{x}_0+w_t^{\text{noise}} \epsilon)\\[7pt] &= w_t^{\text{signal}} \underbrace{\mathbf{A}\mathbf{x}_0}_{\text{new signal}} + w_t^{\text{noise}} \underbrace{\mathbf{A}\epsilon}_{\text{new noise}}. \end{aligned}\] Effectively, \(v\) acts on the signal and the noise independently, and combines the result with the correct weighting.

Statistical Consistency

Diffusion models are trained with the assumption that the noise is drawn iid from a standard normal. Therefore we must ensure that the transformed noise also follows these statistics. That is, we need \[\mathbf{A}\epsilon \sim \mathcal{N}(0, I).\] For linear transformations, this is equivalent to the condition that \(\mathbf{A}\) is orthogonal. Intuitively, orthogonal matrices respect the spherical symmetry of the standard multivariate Gaussian distribution.

Therefore, for a transformation to work with our method, it is sufficient for it to be orthogonal.

Orthogonal Transformations

Most orthogonal transformations on images are meaningless, visually. For example, we transform the image below with a randomly sampled orthogonal matrix.

However, permutations matrices are a subset of orthogonal matrices, and are quite interpretable. They are just rearrangements of pixels in an image. This is where the idea of a visual anagram comes from. The majority of illusions here can be interpreted this way—as specific rearrangements of pixels—such as rotations, flips, skews, "inner rotations," jigsaw rearrangements, and patch permutations. Finally, color inversions are not permutations, but are orthogonal as they are a negation of pixel values.

BibTeX

@InProceedings{geng2024visualanagrams,
  title     = {Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models},
  author    = {Geng, Daniel and Park, Inbum and Owens, Andrew},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024}
  url       = {https://arxiv.org/abs/2311.17919},
}

Visual Anagrams: Generating Multi-ViewOptical Illusions with Diffusion Models

tl;dr: We use pretrained diffusion modelsto make optical illusions

Overview

Jigsaw Permutations

Flips and 180° Rotations

90° Rotations

Color Inversions

Miscellaneous Transformations

Random Patch Permutations

Three Views

Four Views

Method

Conditions on Views

Linearity

Statistical Consistency

Orthogonal Transformations

Related Links

BibTeX

Visual Anagrams: Generating Multi-View
Optical Illusions with Diffusion Models

tl;dr: We use pretrained diffusion models
to make optical illusions