1
00:00:00,146 --> 00:00:06,541
[SOUND]
Hi, welcome to week three.

2
00:00:06,541 --> 00:00:10,663
This time we will see an algorithm
called variational inference.

3
00:00:10,663 --> 00:00:15,646
This is an algorithm for computing
the posterior probability approximately.

4
00:00:15,646 --> 00:00:16,751
But first of all,

5
00:00:16,751 --> 00:00:21,181
let's see why do we even care about
computing approximate posterior.

6
00:00:21,181 --> 00:00:25,974
So here we see base formula that
helps us to compute the posterior

7
00:00:25,974 --> 00:00:29,010
on the latent variables given the data.

8
00:00:29,010 --> 00:00:34,284
We will denote the posterior probability,
f, as p*(z).

9
00:00:34,284 --> 00:00:37,814
So when the prior is
conjugate to the likelihood,

10
00:00:37,814 --> 00:00:40,931
it is really easy to
compute the posterior.

11
00:00:40,931 --> 00:00:45,421
However, for most of the other cases,
it is really hard.

12
00:00:45,421 --> 00:00:50,490
One important case is called
the variational autoencoders and

13
00:00:50,490 --> 00:00:52,694
we will see it in week five.

14
00:00:52,694 --> 00:00:57,514
In variational autoencoders, we model
the likelihood as neural networks.

15
00:00:57,514 --> 00:01:01,930
So it would be a normal
distribution of the data given that

16
00:01:01,930 --> 00:01:05,058
the mean is some neural
network mu of z and

17
00:01:05,058 --> 00:01:10,124
the variance is some other neural
networks, sigma squared of z.

18
00:01:10,124 --> 00:01:13,258
And in this case, there's no conjugacy and

19
00:01:13,258 --> 00:01:17,334
we can't compute the posterior
using Bayes' formula.

20
00:01:17,334 --> 00:01:20,897
But do we actually need
the exact posterior?

21
00:01:20,897 --> 00:01:24,012
For example, here is some distribution and

22
00:01:24,012 --> 00:01:28,736
it doesn't seem to belong to some
known family of distributions.

23
00:01:28,736 --> 00:01:33,080
However, we could approximate
it using the Gaussian and for

24
00:01:33,080 --> 00:01:38,890
most of the practical considerations,
it will really be a good approximation.

25
00:01:38,890 --> 00:01:43,576
For example, it would match the mean,
the variance and [INAUDIBLE] the shape.

26
00:01:43,576 --> 00:01:48,348
And so through out this week,
we'll see a method that will help us to

27
00:01:48,348 --> 00:01:52,054
find the best approximation
of the full posterior.

28
00:01:52,054 --> 00:01:56,532
It works as follows, first of all,
we select some family distribution as Q.

29
00:01:56,532 --> 00:01:59,379
We'll call this a variational family.

30
00:01:59,379 --> 00:02:04,378
For example, this could be family
of normal distributions with some

31
00:02:04,378 --> 00:02:09,562
arbitrary mean, and the coherence
matrix that will be a diagonal one.

32
00:02:09,562 --> 00:02:14,841
What we do next is we try to
approximate the full posterior,

33
00:02:14,841 --> 00:02:20,224
the star of z, with some
variational distribution, q of z,

34
00:02:20,224 --> 00:02:26,371
and we find the best matching
distribution using the KL divergence.

35
00:02:26,371 --> 00:02:30,722
So we try to minimize the KL
divergence between the q and

36
00:02:30,722 --> 00:02:34,231
the p* in the family of distributions, q.

37
00:02:36,638 --> 00:02:41,560
So depending on which Q is left,
we can obtain different results.

38
00:02:41,560 --> 00:02:46,125
If Q's too small, then the true
posterior will not lie in it and

39
00:02:46,125 --> 00:02:51,308
we'll have some distribution that
does not match the full posterior.

40
00:02:51,308 --> 00:02:54,471
And the distance between
the full posterior and

41
00:02:54,471 --> 00:02:59,069
the distribution we approximated
would be exactly a KL divergence.

42
00:02:59,069 --> 00:03:03,728
We selected a larger Q than the posterior

43
00:03:03,728 --> 00:03:08,941
could match the approximate distribution.

44
00:03:08,941 --> 00:03:16,575
However, for larger Qs, it is harder
to compute the variational inference.

45
00:03:16,575 --> 00:03:22,065
For example, if we select a Q as
a family of all possible distributions,

46
00:03:22,065 --> 00:03:26,475
the only possible way to compute
the posterior would be for

47
00:03:26,475 --> 00:03:31,612
example the base formula, and
we've already seen that it is hard.

48
00:03:31,612 --> 00:03:35,078
There's one problem in this approach.

49
00:03:35,078 --> 00:03:40,753
As we'll see later, we'll have to
compute the z star at some point.

50
00:03:40,753 --> 00:03:45,727
However, we can't compute even at one
point because we'll have to compute

51
00:03:45,727 --> 00:03:48,174
the evidence, the p of x.

52
00:03:48,174 --> 00:03:50,771
It is sometimes really hard.

53
00:03:50,771 --> 00:03:55,888
However, there is a nice property of
KL divergence that we'll see now.

54
00:03:55,888 --> 00:03:58,469
So here's our optimization objective.

55
00:03:58,469 --> 00:04:02,350
Here's a KL divergence between
our variational distribution and

56
00:04:02,350 --> 00:04:03,881
a normalized posterior.

57
00:04:03,881 --> 00:04:10,561
So we'll denote it as p hat over
some z that equals to the evidence.

58
00:04:10,561 --> 00:04:15,051
So KL divergence is by definition is
an integral of K of Z times the logarithm

59
00:04:15,051 --> 00:04:18,702
of the ratio between the first
distribution and the second.

60
00:04:18,702 --> 00:04:22,791
Note here that we can take
the z out of an integral.

61
00:04:22,791 --> 00:04:24,331
We will get the following formula.

62
00:04:24,331 --> 00:04:29,035
We get two integrals, first is KL
divergence between the variational

63
00:04:29,035 --> 00:04:33,590
distribution and a normalized
distribution, and some integral.

64
00:04:33,590 --> 00:04:39,498
Actually we can see that we can take
a logarithm from z out of an integral and

65
00:04:39,498 --> 00:04:44,853
the thing that will have left is
the integral of q of z, [INAUDIBLE].

66
00:04:44,853 --> 00:04:48,790
So finally we'll have a KL
divergence plus some constant.

67
00:04:48,790 --> 00:04:51,637
And since we optimized the subjective,
we can remove

68
00:04:51,637 --> 00:04:55,509
the constant since it does not depend
on the variational distribution.

69
00:04:55,509 --> 00:04:58,219
And so here is out [INAUDIBLE] objective.

70
00:04:58,219 --> 00:05:02,915
And in the next video, we'll see a method
called mean-field approximation.

71
00:05:02,915 --> 00:05:12,915
[SOUND]