Diffusion models update noisy images, \(\mathbf{x}_t\), to less
noisy images, \(\mathbf{x}_{t-1}\), with an \(\texttt{update}(\cdot,\cdot)\) function1.
Commonly used update functions include DDPM and DDIM, and are
linear combinations of the noisy image, \(\mathbf{x}_t\), and
the noise estimate \(\epsilon_\theta\).2
That is, these updates can be written as
\[ \begin{aligned} \mathbf{x}_{t-1} &= \texttt{update}(\mathbf{x}_t, \epsilon_\theta) \\ &=\omega_t \mathbf{x}_t + \gamma_t \epsilon_\theta \end{aligned} \]
where \(\omega_t\) and \(\gamma_t\) are determined by the variance schedule and the scheduler. Then given a decomposition \( \mathbf{x} = \sum f_i(\mathbf{x}) \), this
means the update rule can be decomposed into a sum of updates on components:
\[ \begin{aligned} \mathbf{x}_{t-1} &= \texttt{update}(\mathbf{x}_t, \epsilon) \\ &= \texttt{update}\left( \sum f_i(\mathbf{x}_t), \sum f_i(\epsilon) \right) \\ &= \sum_i \texttt{update}(f_i(\mathbf{x}_t), f_i(\epsilon)) \end{aligned} \]
where the last equality is by linearity of \( \texttt{update}(\cdot,\cdot) \).
Our method can be understood as conditioning each of these
components on a different text prompt. Written explicitly,
for text prompts \( y_i \) our method is
\[ \begin{aligned} \mathbf{x}_{t-1} = \sum_i \texttt{update}(f_i(\mathbf{x}_t), f_i(\epsilon(\mathbf{x}_t, y_i, t))). \end{aligned} \]
Moreover, if the \( f_i \)'s are linear then we have
\[ \begin{aligned}
f_i(\mathbf{x}_{t-1}) &= f_i(\texttt{update}(\mathbf{x}_t, \epsilon)) \\
&= f_i(\omega_t\mathbf{x}_t + \gamma_t\epsilon_\theta) \\
&= \omega_t f_i(\mathbf{x}_t) + \gamma_t f_i(\epsilon_\theta) \\
&= \texttt{update}(f_i(\mathbf{x}_t), f_i(\epsilon_\theta)),
\end{aligned} \]
meaning that updating using the \(i\)th component of
\(\mathbf{x}_t\) with the \(i\)th component of \(\epsilon_\theta\) will only affect
the \(i\)th component of \(\mathbf{x}_{t-1}\).