1
00:00:03,690 --> 00:00:05,850
All right, in this video,

2
00:00:05,850 --> 00:00:08,070
we will see a variational Bayes EM,

3
00:00:08,070 --> 00:00:11,895
and also summarize methods that we have seen until now.

4
00:00:11,895 --> 00:00:16,895
Let me remind you what Expectation Maximization Algorithm means.

5
00:00:16,895 --> 00:00:19,300
We try to maximize the margin of likelihood,

6
00:00:19,300 --> 00:00:21,295
the logarithm of it actually.

7
00:00:21,295 --> 00:00:25,430
It would be a logarithm of P of your data, given the parameters.

8
00:00:25,430 --> 00:00:28,310
We also derive the variation lower bound for it,

9
00:00:28,310 --> 00:00:32,620
and it is actually an expectation over some distribution Q of T times the logarithm

10
00:00:32,620 --> 00:00:38,245
of the ratio between the joint distribution over the latent variable set of data,

11
00:00:38,245 --> 00:00:41,480
given the parameters, and this distribution Q of T,

12
00:00:41,480 --> 00:00:44,865
that is distribution over the lesson variables.

13
00:00:44,865 --> 00:00:48,443
And we try to maximize the variation lower bound,

14
00:00:48,443 --> 00:00:52,560
we do this in an iterative way.

15
00:00:52,560 --> 00:00:57,690
On E-step, we maximize the lower bond with respect to Q and,

16
00:00:57,690 --> 00:01:01,155
on abstract, we'll maximize it with respect to Theta.

17
00:01:01,155 --> 00:01:04,258
We also proved that on E-step,

18
00:01:04,258 --> 00:01:09,210
the maximization of the lower bound with respect to Q is equivalent to

19
00:01:09,210 --> 00:01:14,825
minimizing the curl divergence between the Q and the posterior on the learning variables,

20
00:01:14,825 --> 00:01:17,230
given the parameters and the data.

21
00:01:17,230 --> 00:01:19,150
And on the M-step,

22
00:01:19,150 --> 00:01:25,140
we can maximize the expected value of the logarithm of the joint distribution.

23
00:01:25,140 --> 00:01:30,125
Actually, we do this since the denominator of the variational lower bound,

24
00:01:30,125 --> 00:01:33,285
the Q of T, does not depend on Theta, and so we can drop it.

25
00:01:33,285 --> 00:01:35,535
Let's see an E-step more carefully.

26
00:01:35,535 --> 00:01:40,236
We minimize the curl divergence between the variational distribution Q of T,

27
00:01:40,236 --> 00:01:43,270
and the posterior distribution over the latent variables,

28
00:01:43,270 --> 00:01:45,055
given data and parameters,

29
00:01:45,055 --> 00:01:48,095
and we do this minimization with respect to Q.

30
00:01:48,095 --> 00:01:52,860
In week two, we've shown that the minimum of

31
00:01:52,860 --> 00:01:57,255
this curl divergence actually is the second distribution,

32
00:01:57,255 --> 00:02:02,865
that is, the Q of T equals to probability of T given X and Theta.

33
00:02:02,865 --> 00:02:05,025
However, for many models,

34
00:02:05,025 --> 00:02:08,660
it is not the case that we can compute for posterior exactly.

35
00:02:08,660 --> 00:02:09,930
For example, in the next model,

36
00:02:09,930 --> 00:02:13,950
module C and model called Latent Dirichlet allocation.

37
00:02:13,950 --> 00:02:19,140
And for it, it seems to be impossible to compute the full posterior exactly.

38
00:02:19,140 --> 00:02:23,345
So, we can try to use a variational inference in this case.

39
00:02:23,345 --> 00:02:27,060
The new E-step would be as follows.

40
00:02:27,060 --> 00:02:30,840
We minimize the curl divergence in some variational family.

41
00:02:30,840 --> 00:02:34,185
We can use a meaningful approximation, for example, here.

42
00:02:34,185 --> 00:02:36,513
And this method, where on E-step,

43
00:02:36,513 --> 00:02:41,305
we perform a variational inference is called a variational EM.

44
00:02:41,305 --> 00:02:45,190
All right, we've seen a lot of models.

45
00:02:45,190 --> 00:02:47,180
Now, let's summarize them.

46
00:02:47,180 --> 00:02:49,050
Here's our notation again.

47
00:02:49,050 --> 00:02:52,335
We have data that is known we denoted as X,

48
00:02:52,335 --> 00:02:53,504
we have parameters, Theta,

49
00:02:53,504 --> 00:02:55,284
that are unknown and also,

50
00:02:55,284 --> 00:02:58,725
latent variable is T that are also unknown.

51
00:02:58,725 --> 00:03:02,750
I have two criterions here.

52
00:03:02,750 --> 00:03:06,783
First is accurate and inaccurate.

53
00:03:06,783 --> 00:03:09,350
The methods that are higher are more

54
00:03:09,350 --> 00:03:13,410
accurate and the methods that are lower are more inaccurate.

55
00:03:13,410 --> 00:03:16,050
In the meantime, on the right,

56
00:03:16,050 --> 00:03:17,350
we have another criterion,

57
00:03:17,350 --> 00:03:18,730
the slow and the fast.

58
00:03:18,730 --> 00:03:21,295
The first methods would be quite slow,

59
00:03:21,295 --> 00:03:23,955
and the last ones would be really fast.

60
00:03:23,955 --> 00:03:28,605
You have to compromise between accuracy and speed.

61
00:03:28,605 --> 00:03:30,900
The first algorithm is a full inference,

62
00:03:30,900 --> 00:03:34,145
we'll try to find the full distribution,

63
00:03:34,145 --> 00:03:38,965
the full joint distribution over the latent variables and parameters, given the data.

64
00:03:38,965 --> 00:03:45,110
It is actually very accurate since it is exact inference,

65
00:03:45,110 --> 00:03:50,485
however, for minimums, it is really slow.

66
00:03:50,485 --> 00:03:52,970
Sometimes, we can do Mean field.

67
00:03:52,970 --> 00:03:55,160
For mean fields, we will find

68
00:03:55,160 --> 00:04:00,790
the posterior distribution approximated by the product of two distributions,

69
00:04:00,790 --> 00:04:08,275
the distribution over the latent variables and distribution over parameters.

70
00:04:08,275 --> 00:04:11,620
We can use EM algorithm.

71
00:04:11,620 --> 00:04:18,050
In EM algorithm, we'll find only a point wise estimation of the parameters.

72
00:04:18,050 --> 00:04:21,965
We will still have a distribution over T, however,

73
00:04:21,965 --> 00:04:27,230
the Theta would be equal to the maximum of posterior estimation of it.

74
00:04:27,230 --> 00:04:29,930
All right, but it turns out that for many cases,

75
00:04:29,930 --> 00:04:32,705
it is still not the case that we can use the EM algorithm.

76
00:04:32,705 --> 00:04:35,546
And then we can use a variational EM.

77
00:04:35,546 --> 00:04:37,910
In variational EM, we apply, for example,

78
00:04:37,910 --> 00:04:41,690
a meaningful approximation and factorize the probability over

79
00:04:41,690 --> 00:04:46,600
the latent variables into probabilities for each dimension.

80
00:04:46,600 --> 00:04:50,525
And so, we'll find the factorized distribution of latent variables,

81
00:04:50,525 --> 00:04:54,950
and a point estimate of the parameters.

82
00:04:54,950 --> 00:04:56,270
All right, and finally,

83
00:04:56,270 --> 00:04:59,875
if we can do anything, we can use a Crisp EM.

84
00:04:59,875 --> 00:05:06,975
In Crisp EM, we approximate the latent variables and the parameters with point estimate.

85
00:05:06,975 --> 00:05:09,140
We'll do in an interesting way,

86
00:05:09,140 --> 00:05:14,435
we'll find the maximum probability value for the lesser variables and on the next step,

87
00:05:14,435 --> 00:05:19,778
we'll estimate the maximum probability of estimate all of the parameters.

88
00:05:19,778 --> 00:05:22,390
You should be familiar with such methods,

89
00:05:22,390 --> 00:05:28,485
it is used for k-means clusterization.

90
00:05:28,485 --> 00:05:29,755
So, we've seen the methods,

91
00:05:29,755 --> 00:05:31,680
we've seen the variational inference,

92
00:05:31,680 --> 00:05:32,920
and in the next module,

93
00:05:32,920 --> 00:05:39,410
well see an application of variational inference for the Latent Dirichlet allocation.