Variational Autoencoders are not autoencoders

When VAEs are trained with powerful decoders, the model can learn to ‘ignore the latent variable’. This isn’t something an autoencoder should do. In this post we’ll take a look at why this happens and why this represents a shortcoming of the name Variational Autoencoder rather than anything else.

Variational Autoencoders (VAEs) are popular for many reasons, one of which is that they provide a way to featurise data. One of their ‘failure modes’ is that if a powerful decoder is used, training can result in good scores for the objective function we optimise, yet the learned representation is completely useless in that all data points x are encoded as the prior distribution, so the latent representation z contains no information about x.

The name Variational Autoencoder throws a lot of people off when trying to understand why this happens — an autoencoder compresses observed high-dimensional data into a low-dimensional representation, so surely VAEs should always result in a good compression? In fact, this behaviour is not a failure mode of VAEs per se, but rather represents a failure mode of the name VAE!

In this post, we’ll look at what VAEs are actually trained to do — not what they sound like they ought to do — and see that this ‘pathological’ behaviour entirely makes sense. We’ll see that VAEs are a particular way to train Latent Variable Models, and that fundamentally their encoders are introduced as a mathematical trick to allow approximation of an intractable quantity. The nature of this trick is such that when powerful decoders are used, ignoring the latent variable is encouraged.

VAEs and autoencoders

An autoencoder is a type of model in which we compress data by mapping to a low dimensional space and back. Autoencoder objectives are one of the following equivalent forms:

\begin{aligned}\min: \quad \text{Objective} &= \text{Reconstruction Error} + \text{Regulariser} \\ \max: \quad \text{Objective} &= \text{Reconstruction Quality} - \text{Regulariser} \end{aligned}\\

The objective of a VAE (the variational lower bound, also known as the Evidence Lower BOund or ELBO and introduced in this previous post) looks somewhat like the second of these, hence giving rise to the name Variational Autoencoder. Averaging the following over x \sim p_{\text{data}} gives the full objective to be maximised:

\displaystyle \mathcal{L}({\theta}, {\phi}, x) = \underbrace{\mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z)}_{\text{(i)}} - \underbrace{\text{KL}[q_{\phi}(z|x) || p(z)]}_{\text{(ii)}} \leq \log p_{\theta}(x)\quad (*)

Many papers and tutorials introducing VAEs will explicitly describe (i) as the ‘reconstruction’ loss and (ii) as the ‘regulariser’. However, despite appearances VAEs are not in their heart-of-hearts autoencoders: we’ll describe this in detail in the next section, but it’s of critical importance to stress that, rather than maximising a regularised reconstruction quality, the fundamental goal of a VAE is to maximise the log-likelihood \log p_\theta(x) .

This is not possible to do directly, but by introducing the approximate posterior q_\phi(z|x) we can get a tractable lower bound of the desired objective, giving us the VAE objective. The variational lower bound is precisely what its name suggests – a lower bound on the log-likelihood, not a ‘regularised reconstruction cost’. A failure to recognise this distinction has caused confusion to many.

Latent Variable Models

A Latent Variable Model (LVM) is a way to specify complex distributions over high dimensional spaces by composing simple distributions, and VAEs provide one way to train such models. An LVM is specified by fixing a prior p(z) and parameterised family of conditional distributions {p_{\theta}(x|z)}, the latter of which is also called the decoder or generator interchangeably in the literature.

For a fixed {{\theta}}, we get a distribution {p_{\theta}(x) = \int p_{\theta}(x|z) p(z) dz} over the data-space. Training an LVM requires (a) picking a divergence between {p_{\theta}(x)} and the true data distribution {p_{\text{data}}(x)}; (b) choosing {{\theta}} to minimise this.

Hang on a second – in VAEs, we maximise a lower bound on the log-likelihood, not minimise a divergence, right? In fact, it turns out that if we choose the following KL as our divergence,

\displaystyle \text{KL}[p_{\text{data}} || p_{\theta}] = \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\text{data}}(x) - \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x),

then since the left expectation doesn’t depend on {{\theta}}, minimising the divergence is equivalent to maximising the right expectation, which happens to be the log-likelihood.

Since the {\text{KL}} is a divergence, we have that {\text{KL}[p_{\text{data}} || p_{\theta}] \geq 0} with equality if and only if {p_{\theta} = p_{\text{data}}}. This means that the maximum possible value of {\mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x)} occurs when {p_{\theta} = p_{\text{data}}}, at which point {\mathbb{E}_{x\sim p_{\text{data}}} \log p_{\theta}(x) = \mathbb{E}_{x\sim p_{\text{data}}} \log p_{\text{data}}(x)}. So this is the global optimum of the VAE objective.

Although {p(z)} and {p_{\theta}(x|z)} are usually chosen to be simple and easy to evaluate, {p_{\theta}(x) = \int p_{\theta}(x|z) p(z) dz} is generally difficult to evaluate since it involves computing an integral. {\log p_{\theta}(x)} can’t easily be evaluated, but the variational lower bound of this quantity, \mathcal{L}(\theta, \phi, x), can be. This involves introducing a new family of conditional distributions {q_{\phi}(z|x)} which we call the approximate posterior. Provided we have made sensible choices about the family of distributions {q_{\phi}(z|x)}, \mathcal{L}(\theta, \phi, x) will be simple to evaluate (and differentiate through) but the price we pay is the gap between the true posterior {p_{\theta}(z|x)} and the approximate posterior {q_{\phi}(z|x)}. This is derived in more detail here.

\log p_{\theta}(x) = \mathcal{L}({\theta}, {\phi}, x) + \text{KL}[q_{\phi}(z|x) || p_{\theta}(z|x)] \geq \mathcal{L}({\theta}, {\phi}, x) \quad (**)

While it is indeed tempting to look at the definition of \mathcal{L}(\theta, \phi, x) in Equation (*) and think ‘reconstruction + regulariser’ as many people do, it’s important to remember that the encoder {q_{\phi}(z|x)} was only introduced as a trick: we’re actually trying to train an LVM and the thing we want to maximise is {\log p_{\theta}(x)}. {q_{\phi}(z|x)} doesn’t actually have anything to do with this term beyond a bit of mathematical gymnastics that gives us an easily computable approximation to {\log p_{\theta}(x)}.

Powerful decoders

For our purposes, we will define a decoder — i.e. family of conditional distributions {p_{\theta}(x|z)} — to be powerful with respect to {p_{\text{data}}} if there exists a {{\theta}^*} such that {p_{\theta^*}(x|z) = p_{\text{data}}(x)} for all z and x. This is a property of both the family of decoders as well as the data itself. In words, a decoder is powerful if it is possible to perfectly describe the data distribution without using the latent variable.

When people talk about powerful decoders and ‘ignoring the latent variables’, they are often referring to a case in which {p_{\text{data}}} is a complex dataset of images, and the decoder is a very expressive auto-regressive architecture (e.g. PixelCNN).

However, this also happens in much simpler cases too: suppose that {p_{\text{data}}} is Gaussian {\mathcal{N}(\mu_{\text{data}}, \Sigma_{\text{data}})} and that we use a Gaussian decoder, where {p_{\theta}(x|z) = \mathcal{N}(\mu_{\theta}(z), \Sigma_{\theta}(z))} where {\mu_{\theta}} and {\Sigma_{\theta}} are parameterised by neural networks. In this case, the decoder is also powerful with respect to {p_{\text{data}}}, provided that the neural networks are capable of modelling the constant functions {\mu_{\theta}(z) = \mu_{\text{data}}} and {\Sigma_{\theta}(z) = \Sigma_{\text{data}}}.

As a brief aside, suppose we use a Gaussian decoder, but with non-Gaussian {p_{\text{data}}}. The decoder can be made more expressive by adding more layers to the network, but it will not be possible to make the decoder powerful with resepct to {p_{\text{data}}} by only adding more and more layers – doing so would require adding more expressive conditional distributions than Gaussians.

It’s quite easy to prove using (**) that ‘ignoring the latent variable’ in VAEs with decoders that are powerful with respect to the data is actually optimal behaviour.

Claim: Suppose that (i) there exists {{\theta^*}} such that {p_{\theta^*}(x|z) = p_{\text{data}}(x)} for all x, and (ii) there exists {{\phi^*}} such that {q_{\phi^*}(z|x) = p(z)} for all z. Then {({\theta^*}, {\phi^*})} is a globally optimal solution to the VAE objective.


Proof: If {p_{\theta^*}(x|z) = p_{\text{data}}(x)} then {p_{\theta^*}(z|x) = p(z)}, and thus {\text{KL}[p_{\theta^*}(z|x) || q_{\phi^*}(z|x) ] = 0} and so the variational lower bound in Equation (**) is tight. That is,

\begin{aligned}\log p_{\theta^*}(x) &= \mathcal{L}({\theta^*}, {\phi^*}, x) \\&= \mathbb{E}_{z \sim q_{\phi^*}(z|x)} [ \log p_{\theta^*}(x|z) ] + \text{KL}[q_{\phi^*}(z|x) || p(z)] \\&= \log p_{\text{data}}(x)\end{aligned}

Thus the objective of the VAE is at its global optimum. \Box

What if {p_{\theta}(x) = p_{\text{data}}(x)} but {p_{\theta}(x|z)} isn’t independent of {z}?

If we have powerful decoders, it may well be that there is a setting of the parameters {{\theta^+}} such that {p_{\theta^+}(x) = p_{\text{data}}(x)} and for which {p_{\theta^+}(x|z)} does actually depend on {z}. In this case, for any {{\phi}} we have

\begin{aligned}\mathcal{L}({\theta^*}, {\phi^*}, x) &= \log p_{\text{data}}(x) \\&= \log p_{\theta^+}(x) \\&= \mathcal{L}({\theta^+}, {\phi}, x) + \text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] \\&\geq \mathcal{L}({\theta^+}, {\phi}, x)\end{aligned}

and so {\mathcal{L}({\theta^+}, {\phi}, x)} will be strictly worse than the global optimum for any {{\phi}} for which {\text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] > 0}. If {p_{\theta^+}(x|z)} depends on {z}, the posterior distribution {p_{\theta^+}(z|x)} is likely to be complex. Since {q_{\phi}(z|x)} must by design be a reasonably simple family of distributions, it is unlikely that there exists a {{\phi}} such that {\text{KL}[q_{\phi}(z|x) || p_{\theta^+}(z|x)] = 0} for all {x}, and hence it is likely that for any {{\phi}},

\mathcal{L}({\theta^+}, {\phi}, x) < \mathcal{L}({\theta^*}, {\phi^*}, x)

which is to say that the solution {p_{\theta^*}(x) = p_{\text{data}}(x)} will be preferred by the VAE over {p_{\theta^+}(x) = p_{\text{data}}(x)}.

Put differently, and subject to some caveats about the richness of the family of distributions q_\phi(z|x): if there is an optimal solution which ignores the latent code, it is probably the unique optimal solution.

Summary

If you are still in the mindset that VAEs are autoencoders with objectives of the form ‘reconstruction + regulariser’, the above proof that ignoring the latent variable is optimal when using powerful decoders might be unsatisfying. But remember, VAEs are not autoencoders! They are first and foremost ways to train LVMs. The objective of the VAE is a lower bound on {\log p_{\theta}(x)}. The encoder {q_{\phi}(z|x)} is introduced only as a mathematical trick to get a lower bound of {\log p_{\theta}(x)} that is computationally tractable. This bound is exact when the latent variables are ignored, so if it is possible to capture the data distribution — i.e. {p_{\theta}(x) = p_{\text{data}}(x)} — without using the latent variables, this will be preferred by the VAE.


I’m grateful to Jamie Townsend and Diego Fioravanti for helpful discussions leading to the writing of this post, and to Sebastian Weichwald, Alessandro Ialongo, Niki Kilbertus and Mateo Rojas-Carulla for proofreading it.

Deriving the variational lower bound

Basic properties of the variational lower bound, a.k.a. ELBO (evidence lower bound).

Often in probabilistic modelling, we are interested in maximising the probability of some observed data given the model, by tuning the model parameters \theta to maximise \prod_i  p_\theta(x_i) where x_i are the observed data. The fact that we are maximising the product of the p_\theta(x_i) corresponds to an assumption that each x_i is drawn i.i.d. from the true data distribution p_{\text{data}}(x).

In practice, it’s mathematically and computationally much more convenient to consider the logarithm of the product, so that our objective to maximise with respect to \theta is:

\sum_i \log p_\theta(x_i)

In the rest of this post we’ll simplify things by just considering \log p_\theta(x) for a single data point x.

Latent Variable Models

In a Latent Variable Model (LVM), as is the case for Variational Autoencoders, our model distribution is obtained by combining a simple distribution p(z) with a parametrised family of conditional distributions p_\theta(x|z), so that out objective can be written

\log p_\theta(x) = \log \left( \int p_\theta(x|z) p(z) dz \right).

Although p(z) and p_\theta(x|z) will generally be simple by choice, it may be impossible to compute \log p_\theta(x) analytically due to the need to solve the integral inside the logarithm. In many practical situations (e.g. anything involving neural networks), we’d not only like to be able to evaluate \log p_\theta(x) but also differentiate it with respect to \theta if we are to fit the model.

Variational Inference

The magic of variational inference hinges on the following two key observations.

First, we can choose any distribution q(z), multiply the inside of the integral by \frac{q(z)}{q(z)} and rearrange without changing its value. (This has a strong connection to Importance Sampling, see below.) Thus we can rewrite our objective as

\log p_\theta(x) = \log \left( \int p_\theta(x|z) \frac{p(z)}{q(z)} q(z) dz \right).

Second, since \log is concave and the integral can be written as an expectation, we can use Jensen’s inequality to swap the \log and \mathbb{E}. This results in a (variational) lower bound consisting of terms we can evaluate, provided we have chosen p_\theta(x|z), p(z) and q(z) suitably:

\begin{aligned} \log p_\theta(x) &= \log \left( \mathbb{E}_{q(z)} p_\theta(x|z) \frac{p(z)}{q(z)} \right) \\&\geq \mathbb{E}_{q(z)} \left[ \log p_\theta(x|z) + \log p(z) - \log q(z) \right] \\&=\mathbb{E}_{q(z)} \left[ \log p_\theta(x|z) \right] - \text{KL}\left[q(z) || p(z) \right] \end{aligned}

Recall that the above inequality holds for any q(z). Since we are probably interested in fitting the model to multiple data points, we can substitute q(z) with q_\phi(z|x), depending on x and a parameter \phi. This is the notation you’ll often see in the literature, (e.g. the original VAE paper, equation (3))

\begin{aligned} \log p_\theta(x) &\geq \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] - \text{KL}\left[q_\phi(z|x) || p(z) \right] =: \mathcal{L}(x, \theta, \phi) \end{aligned}

Note that the terms variational lower bound, evidence lower bound and ELBO are used interchangeably in the literature.

How tight is the variational lower bound?

By properties of the logarithm and one application of Bayes’ rule, it’s straightforward to calculate the tightness of this bound.

\begin{aligned} &\log p_\theta(x) - \mathcal{L}(x, \theta, \phi)\\&=\log p_\theta(x) - \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] + \text{KL}\left[q_\phi(z|x) || p(z) \right] \\&= \mathbb{E}_{q_\phi(z|x)} \left[\log p_\theta(x) - \log p_\theta(x|z) + \log q_\phi(z|x) - \log p(z) \right] \\&= \mathbb{E}_{q_\phi(z|x)} \left[\log p_\theta(x) - \log \frac{p_\theta(z|x) {p_\theta(x)}}{p(z)} + \log q_\phi(z|x) - \log p(z) \right] \\&= \mathbb{E}_{q_\phi(z|x)} \left[\log q_\phi(z|x) - \log p_\theta(z|x) \right] \\&= \text{KL}\left[q_\phi(z|x) || p_\theta(z|x) \right]\end{aligned}

Summary

Writing the above equations in a slightly compressed form, we have

\log p_\theta(x) = \mathcal{L}(x, \theta, \phi) + \text{KL}\left[q_\phi(z|x) || p_\theta(z|x)\right] \geq \mathcal{L}(x, \theta, \phi)

To repeat this in words: the Jensen gap of the variational lower bound is the KL divergence between q_\phi(z|x) and the true posterior p_\theta(z|x). For a fixed \theta, maximising \mathcal{L}(x, \theta, \phi) with respect to \phi is equivalent to minimising \text{KL}\left[q_\phi(z|x) || p_\theta(z|x)\right]. This is why q_\phi(z|x) is often called the approximate posterior.


Bonus material: connection to Importance Sampling

In principle, you could think about trying to numerically approximate the integral \int p_\theta(x|z) p(z) dz by Monte Carlo sampling: draw a bunch of samples z_1, \ldots, z_k \sim p(z) and estimate the integral as

\int p_\theta(x|z) p(z) dz = \mathbb{E}_{p(z)}p_\theta(x|z) \approx \frac{1}{k}\sum_{i=1}^k p_\theta(x|z_i) .

Of course, this probably wouldn’t help for fitting the model, as performing Monte Carlo integration as part of an inner optimization loop would be painfully slow. But there’s a second reason that this is a sub-optimal course of action.

p_\theta(x|z) is the probability of the particular data point x given z. Let’s suppose that for each z, p_\theta(x|z) only puts a significant amount of probability mass on a small set of x, and that this set differs as we vary z. (Note: this will be the case with Gaussian decoders with concentrated covarainces for most non-trivial datasets.) Then for a fixed x, p_\theta(x|z) will be very small for most values of z and massive for a tiny set of values. In other words, our estimator \frac{1}{k}\sum_{i=1}^k p_\theta(x|z_i) will have extremely large variance.

We can improve things by using a trick called Importance Sampling, which really amounts to the observation that for any distribution q(z), multiplying the integrand by \frac{q(z)}{q(z)} and rearranging doesn’t change the value of the integral.

\begin{aligned}\int p_\theta(x|z) p(z) dz &= \int p_\theta(x|z) \frac{p(z)}{q(z)} q(z) dz \\ &= \mathbb{E}_{q(z)}p_\theta(x|z) \frac{p(z)}{q(z)} \\&\approx \frac{1}{k}\sum_{i=1}^k p_\theta(x|z_i)\frac{p(z_i)}{q(z_i)} \qquad z_1, \ldots, z_k \sim q(z) \end{aligned}

The idea here is that if q(z) is chosen to put more mass on values of z for which p_\theta(x|z) is large, the variance of the importance sampling estimator will have lower variance than the naive one. In fact, if we could choose q(z) = p_\theta(z|x) — the posterior distribution over z — our estimator would have variance zero! This means it would be possible to perfectly estimate the integral with only one sample. To see this, observe that by Bayes’ rule,

\begin{aligned} p_\theta(x|z)\frac{p(z)}{p_\theta(z|x)} &= p_\theta(x|z)\frac{p(z)p_\theta(x)}{p_\theta(x|z)p(z)} \\ &=p_\theta(x)\end{aligned}

So regardless of which z\sim p_\theta(z|x) we would draw, our one-sample Monte Carlo estimator would give the correct answer. Unfortunately, calculating p_\theta(z|x) itself requires knowing the value of p_\theta(x), so this insight doesn’t give us a trick to quickly calculate p_\theta(x)! It does, however, give us a connection to the Jensen gap of the variational bound. Since p_\theta(x|z)\frac{p(z)}{p_\theta(z|x)} is constant in z, \mathbb{E}_{p_\theta(z|x)}p_\theta(x|z) \frac{p(z)}{p_\theta(z|x)} is the expectation of a constant function and thus

\log p_\theta(x) = \log\left(\mathbb{E}_{p_\theta(z|x)}p_\theta(x|z) \frac{p(z)}{p_\theta(z|x)}\right) = \mathbb{E}_{p_\theta(z|x)} \log\left( p_\theta(x|z) \frac{p(z)}{p_\theta(z|x)}\right)

The right hand side is the variational lower bound with q_\phi(z|x) = p_\theta(z|x). This equation says that this bound is tight when the approximate posterior is equal to the true posterior, which we already learned above.