Basic properties of the variational lower bound, a.k.a. ELBO (evidence lower bound).
Often in probabilistic modelling, we are interested in maximising the probability of some observed data given the model, by tuning the model parameters to maximise
where
are the observed data. The fact that we are maximising the product of the
corresponds to an assumption that each
is drawn i.i.d. from the true data distribution
.
In practice, it’s mathematically and computationally much more convenient to consider the logarithm of the product, so that our objective to maximise with respect to is:
In the rest of this post we’ll simplify things by just considering for a single data point
.
Latent Variable Models
In a Latent Variable Model (LVM), as is the case for Variational Autoencoders, our model distribution is obtained by combining a simple distribution with a parametrised family of conditional distributions
, so that out objective can be written
.
Although and
will generally be simple by choice, it may be impossible to compute
analytically due to the need to solve the integral inside the logarithm. In many practical situations (e.g. anything involving neural networks), we’d not only like to be able to evaluate
but also differentiate it with respect to
if we are to fit the model.
Variational Inference
The magic of variational inference hinges on the following two key observations.
First, we can choose any distribution , multiply the inside of the integral by
and rearrange without changing its value. (This has a strong connection to Importance Sampling, see below.) Thus we can rewrite our objective as
.
Second, since is concave and the integral can be written as an expectation, we can use Jensen’s inequality to swap the
and
. This results in a (variational) lower bound consisting of terms we can evaluate, provided we have chosen
,
and
suitably:
Recall that the above inequality holds for any . Since we are probably interested in fitting the model to multiple data points, we can substitute
with
, depending on
and a parameter
. This is the notation you’ll often see in the literature, (e.g. the original VAE paper, equation (3))
Note that the terms variational lower bound, evidence lower bound and ELBO are used interchangeably in the literature.
How tight is the variational lower bound?
By properties of the logarithm and one application of Bayes’ rule, it’s straightforward to calculate the tightness of this bound.
Summary
Writing the above equations in a slightly compressed form, we have
To repeat this in words: the Jensen gap of the variational lower bound is the KL divergence between and the true posterior
. For a fixed
, maximising
with respect to
is equivalent to minimising
. This is why
is often called the approximate posterior.
Bonus material: connection to Importance Sampling
In principle, you could think about trying to numerically approximate the integral by Monte Carlo sampling: draw a bunch of samples
and estimate the integral as
.
Of course, this probably wouldn’t help for fitting the model, as performing Monte Carlo integration as part of an inner optimization loop would be painfully slow. But there’s a second reason that this is a sub-optimal course of action.
is the probability of the particular data point
given
. Let’s suppose that for each
,
only puts a significant amount of probability mass on a small set of
, and that this set differs as we vary
. (Note: this will be the case with Gaussian decoders with concentrated covarainces for most non-trivial datasets.) Then for a fixed
,
will be very small for most values of
and massive for a tiny set of values. In other words, our estimator
will have extremely large variance.
We can improve things by using a trick called Importance Sampling, which really amounts to the observation that for any distribution , multiplying the integrand by
and rearranging doesn’t change the value of the integral.
The idea here is that if is chosen to put more mass on values of
for which
is large, the variance of the importance sampling estimator will have lower variance than the naive one. In fact, if we could choose
— the posterior distribution over
— our estimator would have variance zero! This means it would be possible to perfectly estimate the integral with only one sample. To see this, observe that by Bayes’ rule,
So regardless of which we would draw, our one-sample Monte Carlo estimator would give the correct answer. Unfortunately, calculating
itself requires knowing the value of
, so this insight doesn’t give us a trick to quickly calculate
! It does, however, give us a connection to the Jensen gap of the variational bound. Since
is constant in
,
is the expectation of a constant function and thus
The right hand side is the variational lower bound with . This equation says that this bound is tight when the approximate posterior is equal to the true posterior, which we already learned above.