When VAEs are trained with powerful decoders, the model can learn to ‘ignore the latent variable’. This isn’t something an autoencoder should do. In this post we’ll take a look at why this happens and why this represents a shortcoming of the name Variational Autoencoder rather than anything else.
Variational Autoencoders (VAEs) are popular for many reasons, one of which is that they provide a way to featurise data. One of their ‘failure modes’ is that if a powerful decoder is used, training can result in good scores for the objective function we optimise, yet the learned representation is completely useless in that all data points are encoded as the prior distribution, so the latent representation contains no information about .
The name Variational Autoencoder throws a lot of people off when trying to understand why this happens — an autoencoder compresses observed high-dimensional data into a low-dimensional representation, so surely VAEs should always result in a good compression? In fact, this behaviour is not a failure mode of VAEs per se, but rather represents a failure mode of the name VAE!
In this post, we’ll look at what VAEs are actually trained to do — not what they sound like they ought to do — and see that this ‘pathological’ behaviour entirely makes sense. We’ll see that VAEs are a particular way to train Latent Variable Models, and that fundamentally their encoders are introduced as a mathematical trick to allow approximation of an intractable quantity. The nature of this trick is such that when powerful decoders are used, ignoring the latent variable is encouraged.
VAEs and autoencoders
An autoencoder is a type of model in which we compress data by mapping to a low dimensional space and back. Autoencoder objectives are one of the following equivalent forms:
The objective of a VAE (the variational lower bound, also known as the Evidence Lower BOund or ELBO and introduced in this previous post) looks somewhat like the second of these, hence giving rise to the name Variational Autoencoder. Averaging the following over gives the full objective to be maximised:
Many papers and tutorials introducing VAEs will explicitly describe (i) as the ‘reconstruction’ loss and (ii) as the ‘regulariser’. However, despite appearances VAEs are not in their heart-of-hearts autoencoders: we’ll describe this in detail in the next section, but it’s of critical importance to stress that, rather than maximising a regularised reconstruction quality, the fundamental goal of a VAE is to maximise the log-likelihood .
This is not possible to do directly, but by introducing the approximate posterior we can get a tractable lower bound of the desired objective, giving us the VAE objective. The variational lower bound is precisely what its name suggests – a lower bound on the log-likelihood, not a ‘regularised reconstruction cost’. A failure to recognise this distinction has caused confusion to many.
Latent Variable Models
A Latent Variable Model (LVM) is a way to specify complex distributions over high dimensional spaces by composing simple distributions, and VAEs provide one way to train such models. An LVM is specified by fixing a prior and parameterised family of conditional distributions , the latter of which is also called the decoder or generator interchangeably in the literature.
For a fixed , we get a distribution over the data-space. Training an LVM requires (a) picking a divergence between and the true data distribution ; (b) choosing to minimise this.
Hang on a second – in VAEs, we maximise a lower bound on the log-likelihood, not minimise a divergence, right? In fact, it turns out that if we choose the following KL as our divergence,
then since the left expectation doesn’t depend on , minimising the divergence is equivalent to maximising the right expectation, which happens to be the log-likelihood.
Since the is a divergence, we have that with equality if and only if . This means that the maximum possible value of occurs when , at which point . So this is the global optimum of the VAE objective.
Although and are usually chosen to be simple and easy to evaluate, is generally difficult to evaluate since it involves computing an integral. can’t easily be evaluated, but the variational lower bound of this quantity, , can be. This involves introducing a new family of conditional distributions which we call the approximate posterior. Provided we have made sensible choices about the family of distributions , will be simple to evaluate (and differentiate through) but the price we pay is the gap between the true posterior and the approximate posterior . This is derived in more detail here.
While it is indeed tempting to look at the definition of in Equation and think ‘reconstruction + regulariser’ as many people do, it’s important to remember that the encoder was only introduced as a trick: we’re actually trying to train an LVM and the thing we want to maximise is . doesn’t actually have anything to do with this term beyond a bit of mathematical gymnastics that gives us an easily computable approximation to .
For our purposes, we will define a decoder — i.e. family of conditional distributions — to be powerful with respect to if there exists a such that for all and . This is a property of both the family of decoders as well as the data itself. In words, a decoder is powerful if it is possible to perfectly describe the data distribution without using the latent variable.
When people talk about powerful decoders and ‘ignoring the latent variables’, they are often referring to a case in which is a complex dataset of images, and the decoder is a very expressive auto-regressive architecture (e.g. PixelCNN).
However, this also happens in much simpler cases too: suppose that is Gaussian and that we use a Gaussian decoder, where where and are parameterised by neural networks. In this case, the decoder is also powerful with respect to , provided that the neural networks are capable of modelling the constant functions and .
As a brief aside, suppose we use a Gaussian decoder, but with non-Gaussian . The decoder can be made more expressive by adding more layers to the network, but it will not be possible to make the decoder powerful with resepct to by only adding more and more layers – doing so would require adding more expressive conditional distributions than Gaussians.
It’s quite easy to prove using that ‘ignoring the latent variable’ in VAEs with decoders that are powerful with respect to the data is actually optimal behaviour.
Claim: Suppose that (i) there exists such that for all x, and (ii) there exists such that for all z. Then is a globally optimal solution to the VAE objective.
Proof: If then , and thus and so the variational lower bound in Equation is tight. That is,
Thus the objective of the VAE is at its global optimum.
What if but isn’t independent of ?
If we have powerful decoders, it may well be that there is a setting of the parameters such that and for which does actually depend on . In this case, for any we have
and so will be strictly worse than the global optimum for any for which . If depends on , the posterior distribution is likely to be complex. Since must by design be a reasonably simple family of distributions, it is unlikely that there exists a such that for all , and hence it is likely that for any ,
which is to say that the solution will be preferred by the VAE over .
Put differently, and subject to some caveats about the richness of the family of distributions : if there is an optimal solution which ignores the latent code, it is probably the unique optimal solution.
If you are still in the mindset that VAEs are autoencoders with objectives of the form ‘reconstruction + regulariser’, the above proof that ignoring the latent variable is optimal when using powerful decoders might be unsatisfying. But remember, VAEs are not autoencoders! They are first and foremost ways to train LVMs. The objective of the VAE is a lower bound on . The encoder is introduced only as a mathematical trick to get a lower bound of that is computationally tractable. This bound is exact when the latent variables are ignored, so if it is possible to capture the data distribution — i.e. — without using the latent variables, this will be preferred by the VAE.
I’m grateful to Jamie Townsend and Diego Fioravanti for helpful discussions leading to the writing of this post, and to Sebastian Weichwald, Alessandro Ialongo, Niki Kilbertus and Mateo Rojas-Carulla for proofreading it.