Optical Illusions with Diffusion Models

to make optical illusions

We present a simple, zero-shot method to generate *multi-view optical illusions*.
These are images that look like one thing, but change appearance or identity when
transformed. We show in theory and practice that our method supports a broad range of transformations
including rotations, flips,
color inversions, skews,
jigsaw rearrangements,
random permutations, and
multiple views.
We show some examples below.

These illusions change appearance when flipped, or rotated 180 degrees.

Our method is conceptually simple. We take an off-the-shelf diffusion model and use it to estimate the noise in different views or transformations, \(v_i\), of an image. The noise estimates are then aligned by applying the inverse view, \(v_i^{-1}\), and averaged together. This averaged noise estimate is then used to take a diffusion step.

We find that not every view function works with the above method. Of course, \(v_i\) must be invertible, but we discuss two additional constraints.

A diffusion model is trained to estimate the noise in noisy data \(\mathbf{x}_t\) conditioned on time step \(t\). The noisy data \(\mathbf{x}_t\) is expected to have the form \[\mathbf{x}_t = w_t^{\text{signal}}\underbrace{\mathbf{x}_0}_{\text{signal}} + w_t^{\text{noise}}\underbrace{\epsilon\vphantom{\mathbf{x}_0}}_{\text{noise}}.\] That is, \(\mathbf{x}_t\) is a weighted average of pure signal \(\mathbf{x_0}\) and pure noise \(\epsilon\), specifically with weights \(w_t^{\text{signal}}\) and \(w_t^{\text{noise}}\). Therefore, our view, \(v\) must maintain this weighting between signal and noise. This can be achieved by making \(v\) linear, which we represent by the square matrix \(\mathbf{A}\). By linearity \[\begin{aligned} v(\mathbf{x}_t) &= \mathbf{A}(w_t^{\text{signal}} \mathbf{x}_0+w_t^{\text{noise}} \epsilon)\\[7pt] &= w_t^{\text{signal}} \underbrace{\mathbf{A}\mathbf{x}_0}_{\text{new signal}} + w_t^{\text{noise}} \underbrace{\mathbf{A}\epsilon}_{\text{new noise}}. \end{aligned}\] Effectively, \(v\) acts on the signal and the noise independently, and combines the result with the correct weighting.

Diffusion models are trained with the assumption that the noise is drawn iid from a standard normal. Therefore we must ensure that the transformed noise also follows these statistics. That is, we need \[\mathbf{A}\epsilon \sim \mathcal{N}(0, I).\] For linear transformations, this is equivalent to the condition that \(\mathbf{A}\) is orthogonal. Intuitively, orthogonal matrices respect the spherical symmetry of the standard multivariate Gaussian distribution.

Therefore, for a transformation to work with our method, it is **sufficient for it to be orthogonal.**

Most orthogonal transformations on images are meaningless, visually. For example, we transform the image below with a randomly sampled orthogonal matrix.

However, **permutations matrices are a subset of orthogonal matrices,** and are quite interpretable.
They are just rearrangements of pixels in an image. This is where the idea of a **visual anagram**
comes from. The majority of illusions here can be interpreted this way—as specific rearrangements of pixels—such as
rotations, flips,
skews, "inner rotations,"
jigsaw rearrangements, and
patch permutations. Finally, color inversions
are not permutations, but are orthogonal as they are a negation of pixel values.

There are several other great works in this area:

Diffusion Illusions,
by Ryan Burgert *et al.*,
which produces multi-view illusions, along with other visual effects, through score distillation sampling.

This colab notebook by Matthew Tancik, which introduces a similar idea to ours. We improve upon it significantly in terms of quality of illusions, range of transformations, and theoretical analysis.

Recent work by a pseudonymous artist, Ugleh, uses a Stable Diffusion model finetuned for generating QR codes to produce images whose global structure subtly matches a given template image.

Factorized Diffusion, follow up work to Visual Anagrams which makes many different types of "hybrid" illusions, including hybrid images with three different contents, partially resolving an open problem from the original hybrid images paper (see section 2.3).

Images that Sound, which creates spectrograms that also look like images using a similar technique, but across modalities.

```
@InProceedings{geng2024visualanagrams,
title = {Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models},
author = {Geng, Daniel and Park, Inbum and Owens, Andrew},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
url = {https://arxiv.org/abs/2311.17919},
}
```