Basic properties of the variational lower bound, a.k.a. ELBO (evidence lower bound).
Often in probabilistic modelling, we are interested in maximising the probability of some observed data given the model, by tuning the model parameters to maximise where are the observed data. The fact that we are maximising the product of the corresponds to an assumption that each is drawn i.i.d. from the true data distribution .
In practice, it’s mathematically and computationally much more convenient to consider the logarithm of the product, so that our objective to maximise with respect to is:
In the rest of this post we’ll simplify things by just considering for a single data point .
Latent Variable Models
In a Latent Variable Model (LVM), as is the case for Variational Autoencoders, our model distribution is obtained by combining a simple distribution with a parametrised family of conditional distributions , so that out objective can be written
.
Although and will generally be simple by choice, it may be impossible to compute analytically due to the need to solve the integral inside the logarithm. In many practical situations (e.g. anything involving neural networks), we’d not only like to be able to evaluate but also differentiate it with respect to if we are to fit the model.
Variational Inference
The magic of variational inference hinges on the following two key observations.
First, we can choose any distribution , multiply the inside of the integral by and rearrange without changing its value. (This has a strong connection to Importance Sampling, see below.) Thus we can rewrite our objective as
.
Second, since is concave and the integral can be written as an expectation, we can use Jensen’s inequality to swap the and . This results in a (variational) lower bound consisting of terms we can evaluate, provided we have chosen , and suitably:
Recall that the above inequality holds for any . Since we are probably interested in fitting the model to multiple data points, we can substitute with , depending on and a parameter . This is the notation you’ll often see in the literature, (e.g. the original VAE paper, equation (3))
Note that the terms variational lower bound, evidence lower bound and ELBO are used interchangeably in the literature.
How tight is the variational lower bound?
By properties of the logarithm and one application of Bayes’ rule, it’s straightforward to calculate the tightness of this bound.
Summary
Writing the above equations in a slightly compressed form, we have
To repeat this in words: the Jensen gap of the variational lower bound is the KL divergence between and the true posterior . For a fixed , maximising with respect to is equivalent to minimising . This is why is often called the approximate posterior.
Bonus material: connection to Importance Sampling
In principle, you could think about trying to numerically approximate the integral by Monte Carlo sampling: draw a bunch of samples and estimate the integral as
.
Of course, this probably wouldn’t help for fitting the model, as performing Monte Carlo integration as part of an inner optimization loop would be painfully slow. But there’s a second reason that this is a sub-optimal course of action.
is the probability of the particular data point given . Let’s suppose that for each , only puts a significant amount of probability mass on a small set of , and that this set differs as we vary . (Note: this will be the case with Gaussian decoders with concentrated covarainces for most non-trivial datasets.) Then for a fixed , will be very small for most values of and massive for a tiny set of values. In other words, our estimator will have extremely large variance.
We can improve things by using a trick called Importance Sampling, which really amounts to the observation that for any distribution , multiplying the integrand by and rearranging doesn’t change the value of the integral.
The idea here is that if is chosen to put more mass on values of for which is large, the variance of the importance sampling estimator will have lower variance than the naive one. In fact, if we could choose — the posterior distribution over — our estimator would have variance zero! This means it would be possible to perfectly estimate the integral with only one sample. To see this, observe that by Bayes’ rule,
So regardless of which we would draw, our one-sample Monte Carlo estimator would give the correct answer. Unfortunately, calculating itself requires knowing the value of , so this insight doesn’t give us a trick to quickly calculate ! It does, however, give us a connection to the Jensen gap of the variational bound. Since is constant in , is the expectation of a constant function and thus
The right hand side is the variational lower bound with . This equation says that this bound is tight when the approximate posterior is equal to the true posterior, which we already learned above.