in Machine Learning

Variational Autoencoders are not autoencoders

When VAEs are trained with powerful decoders, the model can learn to ‘ignore the latent variable’. This isn’t something an autoencoder should do. In this post we’ll take a look at why this happens and why this represents a shortcoming of the name Variational Autoencoder rather than anything else.

Variational Autoencoders (VAEs) are popular for many reasons, one of which is that they provide a way to featurise data. One of their ‘failure modes’ is that if a powerful decoder is used, training can result in good scores for the objective function we optimise, yet the learned representation is completely useless in that all data points x are encoded as the prior distribution, so the latent representation z contains no information about x.

The name Variational Autoencoder throws a lot of people off when trying to understand why this happens — an autoencoder compresses observed high-dimensional data into a low-dimensional representation, so surely VAEs should always result in a good compression? In fact, this behaviour is not a failure mode of VAEs per se, but rather represents a failure mode of the name VAE!

In this post, we’ll look at what VAEs are actually trained to do — not what they sound like they ought to do — and see that this ‘pathological’ behaviour entirely makes sense. We’ll see that VAEs are a particular way to train Latent Variable Models, and that fundamentally their encoders are introduced as a mathematical trick to allow approximation of an intractable quantity. The nature of this trick is such that when powerful decoders are used, ignoring the latent variable is encouraged.

VAEs and autoencoders

An autoencoder is a type of model in which we compress data by mapping to a low dimensional space and back. Autoencoder objectives are one of the following equivalent forms:

\begin{aligned}\min: \quad \text{Objective} &= \text{Reconstruction Error} + \text{Regulariser} \\ \max: \quad \text{Objective} &= \text{Reconstruction Quality} - \text{Regulariser} \end{aligned}\\

The objective of a VAE (the variational lower bound, also known as the Evidence Lower BOund or ELBO and introduced in this previous post) looks somewhat like the second of these, hence giving rise to the name Variational Autoencoder. Averaging the following over x \sim p_{\text{data}} gives the full objective to be maximised:

\displaystyle \mathcal{L}({\theta}, {\phi}, x) = \underbrace{\mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z)}_{\text{(i)}} - \underbrace{\text{KL}[q_{\phi}(z|x) || p(z)]}_{\text{(ii)}} \leq \log p_{\theta}(x)\quad (*)

Many papers and tutorials introducing VAEs will explicitly describe (i) as the ‘reconstruction’ loss and (ii) as the ‘regulariser’. However, despite appearances VAEs are not in their heart-of-hearts autoencoders: we’ll describe this in detail in the next section, but it’s of critical importance to stress that, rather than maximising a regularised reconstruction quality, the fundamental goal of a VAE is to maximise the log-likelihood \log p_\theta(x) .

This is not possible to do directly, but by introducing the approximate posterior q_\phi(z|x) we can get a tractable lower bound of the desired objective, giving us the VAE objective. The variational lower bound is precisely what its name suggests – a lower bound on the log-likelihood, not a ‘regularised reconstruction cost’. A failure to recognise this distinction has caused confusion to many.

Latent Variable Models

A Latent Variable Model (LVM) is a way to specify complex distributions over high dimensional spaces by composing simple distributions, and VAEs provide one way to train such models. An LVM is specified by fixing a prior p(z) and parameterised family of conditional distributions {p_{\theta}(x|z)}, the latter of which is also called the decoder or generator interchangeably in the literature.

For a fixed {{\theta}}, we get a distribution {p_{\theta}(x) = \int p_{\theta}(x|z) p(z) dz} over the data-space. Training an LVM requires (a) picking a divergence between {p_{\theta}(x)} and the true data distribution {p_{\text{data}}(x)}; (b) choosing {{\theta}} to minimise this.

Hang on a second – in VAEs, we maximise a lower bound on the log-likelihood, not minimise a divergence, right? In fact, it turns out that if we choose the following KL as our divergence,

\displaystyle \text{KL}[p_{\text{data}} || p_{\theta}] = \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\text{data}}(x) - \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x),

then since the left expectation doesn’t depend on {{\theta}}, minimising the divergence is equivalent to maximising the right expectation, which happens to be the log-likelihood.

Since the {\text{KL}} is a divergence, we have that {\text{KL}[p_{\text{data}} || p_{\theta}] \geq 0} with equality if and only if {p_{\theta} = p_{\text{data}}}. This means that the maximum possible value of {\mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x)} occurs when {p_{\theta} = p_{\text{data}}}, at which point {\mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x) = \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\text{data}}(x)}. So this is the global optimum of the VAE objective.

Although {p(z)} and {p_{\theta}(x|z)} are usually chosen to be simple and easy to evaluate, {p_{\theta}(x) = \int p_{\theta}(x|z) p(z) dz} is generally difficult to evaluate since it involves computing an integral. {\log p_{\theta}(x)} can’t easily be evaluated, but the variational lower bound of this quantity, \mathcal{L}(\theta, \phi, x), can be. This involves introducing a new family of conditional distributions {q_{\phi}(z|x)} which we call the approximate posterior. Provided we have made sensible choices about the family of distributions {q_{\phi}(z|x)}, \mathcal{L}(\theta, \phi, x) will be simple to evaluate (and differentiate through) but the price we pay is the gap between the true posterior {p_{\theta}(z|x)} and the approximate posterior {q_{\phi}(z|x)}. This is derived in more detail here.

\log p_{\theta}(x) = \mathcal{L}({\theta}, {\phi}, x) + \text{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)] \geq \mathcal{L}({\theta}, {\phi}, x) \quad (**)

While it is indeed tempting to look at the definition of \mathcal{L}(\theta, \phi, x) in Equation (*) and think ‘reconstruction + regulariser’ as many people do, it’s important to remember that the encoder {q_{\phi}(z|x)} was only introduced as a trick: we’re actually trying to train an LVM and the thing we want to maximise is {\log p_{\theta}(x)}. {q_{\phi}(z|x)} doesn’t actually have anything to do with this term beyond a bit of mathematical gymnastics that gives us an easily computable approximation to {\log p_{\theta}(x)}.

Powerful decoders

For our purposes, we will define a decoder — i.e. family of conditional distributions {p_{\theta}(x|z)} — to be powerful with respect to {p_{\text{data}}} if there exists a {{\theta}^*} such that {p_{\theta^*}(x|z) = p_{\text{data}}(x)} for all z and x. This is a property of both the family of decoders as well as the data itself. In words, a decoder is powerful if it is possible to perfectly describe the data distribution without using the latent variable.

When people talk about powerful decoders and ‘ignoring the latent variables’, they are often referring to a case in which {p_{\text{data}}} is a complex dataset of images, and the decoder is a very expressive auto-regressive architecture (e.g. PixelCNN).

However, this also happens in much simpler cases too: suppose that {p_{\text{data}}} is Gaussian {\mathcal{N}(\mu_{\text{data}}, \Sigma_{\text{data}})} and that we use a Gaussian decoder, where {p_{\theta}(x|z) = \mathcal{N}(\mu_{\theta}(z), \Sigma_{\theta}(z))} where {\mu_{\theta}} and {\Sigma_{\theta}} are parameterised by neural networks. In this case, the decoder is also powerful with respect to {p_{\text{data}}}, provided that the neural networks are capable of modelling the constant functions {\mu_{\theta}(z) = \mu_{\text{data}}} and {\Sigma_{\theta}(z) = \Sigma_{\text{data}}}.

As a brief aside, suppose we use a Gaussian decoder, but with non-Gaussian {p_{\text{data}}}. The decoder can be made more expressive by adding more layers to the network, but it will not be possible to make the decoder powerful with resepct to {p_{\text{data}}} by only adding more and more layers – doing so would require adding more expressive conditional distributions than Gaussians.

It’s quite easy to prove using (**) that ‘ignoring the latent variable’ in VAEs with decoders that are powerful with respect to the data is actually optimal behaviour.

Claim: Suppose that (i) there exists {{\theta^*}} such that {p_{\theta^*}(x|z) = p_{\text{data}}(x)} for all x, and (ii) there exists {{\phi^*}} such that {q_{\phi^*}(z|x) = p(z)} for all z. Then {({\theta^*}, {\phi^*})} is a globally optimal solution to the VAE objective.


Proof: If {p_{\theta^*}(x|z) = p_{\text{data}}(x)} then {p_{\theta^*}(z|x) = p(z)}, and thus {\text{KL}[p_{\theta^*}(z|x) || q_{\phi^*}(z|x) ] = 0} and so the variational lower bound in Equation (**) is tight. That is,

\begin{aligned}\log p_{\theta^*}(x) &= \mathcal{L}({\theta^*}, {\phi^*}, x) \\&= \mathbb{E}_{z \sim q_{\phi^*}(z|x)} [ \log p_{\theta^*}(x|z) ] + \text{KL}[q_{\phi^*}(z|x) || p(z)] \\&= \log p_{\text{data}}(x)\end{aligned}

Thus the objective of the VAE is at its global optimum. \Box

What if {p_{\theta}(x) = p_{\text{data}}(x)} but {p_{\theta}(x|z)} isn’t independent of {z}?

If we have powerful decoders, it may well be that there is a setting of the parameters {{\theta^+}} such that {p_{\theta^+}(x) = p_{\text{data}}(x)} and for which {p_{\theta^+}(x|z)} does actually depend on {z}. In this case, for any {{\phi}} we have

\begin{aligned}\mathcal{L}({\theta^*}, {\phi^*}, x) &= \log p_{\text{data}}(x) \\&= \log p_{\theta^+}(x) \\&= \mathcal{L}({\theta^+}, {\phi}, x) + \text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] \\&\geq \mathcal{L}({\theta^+}, {\phi}, x)\end{aligned}

and so {\mathcal{L}({\theta^+}, {\phi}, x)} will be strictly worse than the global optimum for any {{\phi}} for which {\text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] > 0}. If {p_{\theta^+}(x|z)} depends on {z}, the posterior distribution {p_{\theta^+}(z|x)} is likely to be complex. Since {q_{\phi}(z|x)} must by design be a reasonably simple family of distributions, it is unlikely that there exists a {{\phi}} such that {\text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] = 0} for all {x}, and hence it is likely that for any {{\phi}},

\mathcal{L}({\theta^+}, {\phi}, x) < \mathcal{L}({\theta^*}, {\phi^*}, x)

which is to say that the solution {p_{\theta^*}(x) = p_{\text{data}}(x)} will be preferred by the VAE over {p_{\theta^+}(x) = p_{\text{data}}(x)}.

Put differently, and subject to some caveats about the richness of the family of distributions q_\phi(z|x): if there is an optimal solution which ignores the latent code, it is probably the unique optimal solution.

Summary

If you are still in the mindset that VAEs are autoencoders with objectives of the form ‘reconstruction + regulariser’, the above proof that ignoring the latent variable is optimal when using powerful decoders might be unsatisfying. But remember, VAEs are not autoencoders! They are first and foremost ways to train LVMs. The objective of the VAE is a lower bound on {\log p_{\theta}(x)}. The encoder {q_{\phi}(z|x)} is introduced only as a mathematical trick to get a lower bound of {\log p_{\theta}(x)} that is computationally tractable. This bound is exact when the latent variables are ignored, so if it is possible to capture the data distribution — i.e. {p_{\theta}(x) = p_{\text{data}}(x)} — without using the latent variables, this will be preferred by the VAE.


I’m grateful to Jamie Townsend and Diego Fioravanti for helpful discussions leading to the writing of this post, and to Sebastian Weichwald, Alessandro Ialongo, Niki Kilbertus and Mateo Rojas-Carulla for proofreading it.