1
00:00:03,420 --> 00:00:05,620
In the previous video,

2
00:00:05,620 --> 00:00:09,540
we decided to use Gaussian Mixture Model to fit

3
00:00:09,540 --> 00:00:14,880
our dataset and to solve the clustering problem, but how?

4
00:00:14,880 --> 00:00:21,040
How can we do better than the generals that castigate and descent?

5
00:00:21,040 --> 00:00:24,480
In this video, we will discuss some intuition that leads

6
00:00:24,480 --> 00:00:29,890
to expectation maximization algorithm for this particular case.

7
00:00:29,890 --> 00:00:34,170
To recall that the density of each data point in

8
00:00:34,170 --> 00:00:38,425
our Gaussian Mixture Model is a weighted sum of three or,

9
00:00:38,425 --> 00:00:42,585
in general, as many as you want, Gaussians divisions.

10
00:00:42,585 --> 00:00:46,380
To start with, we will need to introduce a latent variable

11
00:00:46,380 --> 00:00:52,080
here because it will make the reasoning about our model much easier.

12
00:00:52,080 --> 00:00:57,595
What can we correlate more variable here?

13
00:00:57,595 --> 00:00:59,625
Well, we can do something like this.

14
00:00:59,625 --> 00:01:03,915
We can say that each data point was generated

15
00:01:03,915 --> 00:01:09,108
by using some information from latent variable T, which exists.

16
00:01:09,108 --> 00:01:15,120
It's like we have one latent variable T for each data point X and it causes X.

17
00:01:15,120 --> 00:01:17,523
It explains something about X,

18
00:01:17,523 --> 00:01:24,375
and the reasonable thing to assume about T here is that it takes 3 radius,

19
00:01:24,375 --> 00:01:32,160
1, 2 or 3 and it shows us from which Gaussian this particular data point came from.

20
00:01:32,160 --> 00:01:35,390
We actually don't know for any data point,

21
00:01:35,390 --> 00:01:37,850
we don't know from which Gaussian came from.

22
00:01:37,850 --> 00:01:39,395
It's a latent variable, right?

23
00:01:39,395 --> 00:01:42,685
We doesn't observe this,

24
00:01:42,685 --> 00:01:46,313
never nor in training nor in testing,

25
00:01:46,313 --> 00:01:49,790
but these can be helpful so if we know the latent variables,

26
00:01:49,790 --> 00:01:54,420
it can be helpful to understand something about our model.

27
00:01:54,420 --> 00:01:58,825
Later, then with fit the Gaussian mixture model into our data.

28
00:01:58,825 --> 00:02:03,090
We may find the distribution on the latent variable even the data.

29
00:02:03,090 --> 00:02:07,875
We may ask how will the question,

30
00:02:07,875 --> 00:02:09,276
how do you think,

31
00:02:09,276 --> 00:02:14,050
what is the value for latent variable T for these particular data point?

32
00:02:14,050 --> 00:02:18,660
The answer to this question will be basically the clustering, right?

33
00:02:18,660 --> 00:02:25,665
It gives us the beliefs from which Gaussian these data point came from.

34
00:02:25,665 --> 00:02:29,205
If we decided to use this kind of latent variable model,

35
00:02:29,205 --> 00:02:34,530
then it is reasonable to assume that latent variable T has prior distribution pi,

36
00:02:34,530 --> 00:02:38,250
so it's exactly the weights of our Gaussians.

37
00:02:38,250 --> 00:02:42,448
The latent variable T equals to some cluster number,

38
00:02:42,448 --> 00:02:46,570
for example one, with probability pi one and the likelihood.

39
00:02:46,570 --> 00:02:49,380
The density of the data point X,

40
00:02:49,380 --> 00:02:53,320
given that we know from which Gaussian came from,

41
00:02:53,320 --> 00:02:56,710
is just Gaussian distribution because it came from the Gaussian.

42
00:02:56,710 --> 00:03:02,220
Each data point, if we know that it came from the Gaussian number C,

43
00:03:02,220 --> 00:03:05,935
it's density just these Gaussian things do with the parameters of

44
00:03:05,935 --> 00:03:11,960
the Gaussian number C. When we introduce this kind of latent variable model,

45
00:03:11,960 --> 00:03:18,405
we may now look at what will P of X given theta represent in this model.

46
00:03:18,405 --> 00:03:23,490
We change our model and we now have to check that it still gives

47
00:03:23,490 --> 00:03:29,530
you the same general results as our original model.

48
00:03:29,530 --> 00:03:31,670
If you write down P of X given theta,

49
00:03:31,670 --> 00:03:35,735
given parameters and this latent variable model, we will get this.

50
00:03:35,735 --> 00:03:40,290
It's just a rule of sum for probabilities and it says

51
00:03:40,290 --> 00:03:47,157
that a P of X given theta equals to sum with respect to T. So we are marginalizing T,

52
00:03:47,157 --> 00:03:51,470
with the respect to all possible values for T, from 1-3,

53
00:03:51,470 --> 00:03:55,095
and were assigning the join distribution P of X and T,

54
00:03:55,095 --> 00:03:58,500
which equals to the likelihood times the prior.

55
00:03:58,500 --> 00:04:04,155
If you substitute the definition of the prior and likelihood into this summation,

56
00:04:04,155 --> 00:04:07,920
you can understand that this last formula on this slide

57
00:04:07,920 --> 00:04:11,961
is exactly equal to the first formula.

58
00:04:11,961 --> 00:04:14,810
We introduced some latent variable in

59
00:04:14,810 --> 00:04:19,310
such a way that we when we marginalized out this variable,

60
00:04:19,310 --> 00:04:22,180
we get exactly what we had before.

61
00:04:22,180 --> 00:04:25,445
Which means that we now can use this latent variable model,

62
00:04:25,445 --> 00:04:28,445
train it with observed data X,

63
00:04:28,445 --> 00:04:31,445
and it will gives us exactly the same results

64
00:04:31,445 --> 00:04:36,220
as we would get if we use the original model.

65
00:04:36,220 --> 00:04:42,495
Let's now try to build some intuition about how to train this latent variable model.

66
00:04:42,495 --> 00:04:45,200
So say we have a dataset and now,

67
00:04:45,200 --> 00:04:46,395
say it's one dimensional.

68
00:04:46,395 --> 00:04:53,860
Each data point from 1 to N is just one number for sublist of illustration.

69
00:04:53,860 --> 00:04:58,115
How could our goal for finding the maximum likelihood estimation,

70
00:04:58,115 --> 00:05:00,370
is finding the parameters, right?

71
00:05:00,370 --> 00:05:02,290
How can we find the parameters?

72
00:05:02,290 --> 00:05:05,015
Well, it turns out that if we know the sources,

73
00:05:05,015 --> 00:05:08,840
if we know the values of the latent variables for each data point,

74
00:05:08,840 --> 00:05:13,440
then finding the parameters sigma is easy.

75
00:05:13,440 --> 00:05:20,240
Because you're going to say that all the blue data points here is the data points for

76
00:05:20,240 --> 00:05:23,390
which we know that they came from the first Gaussian and

77
00:05:23,390 --> 00:05:26,915
we can look at them separately from the older orange data points,

78
00:05:26,915 --> 00:05:30,020
and just fit them into one Gaussian,

79
00:05:30,020 --> 00:05:32,885
which we already know how to do because it's just

80
00:05:32,885 --> 00:05:37,020
fitting some data set of blue point into a Gaussian distribution.

81
00:05:37,020 --> 00:05:40,355
You can see the formulas on the bottom of the slide,

82
00:05:40,355 --> 00:05:43,985
but it's something we have already done in week one,

83
00:05:43,985 --> 00:05:46,250
which means that if we know the sources,

84
00:05:46,250 --> 00:05:49,040
if we know the values of the latent variables,

85
00:05:49,040 --> 00:05:51,875
it's easy to estimate the parameters data.

86
00:05:51,875 --> 00:05:56,195
Actually, if we don't have this hard assignments,

87
00:05:56,195 --> 00:06:01,440
but rather have soft assignments so some posterior distribution on T,

88
00:06:01,440 --> 00:06:04,110
which means that for each data point.

89
00:06:04,110 --> 00:06:07,476
We don't assume that it belongs to just one cluster,

90
00:06:07,476 --> 00:06:10,100
but rather we assume that it belongs

91
00:06:10,100 --> 00:06:14,340
to all clusters simultaneously, all Gaussian simultaneously.

92
00:06:14,340 --> 00:06:19,400
What will some different probabilities being posterior P of T,

93
00:06:19,400 --> 00:06:21,775
given X and parameters.

94
00:06:21,775 --> 00:06:23,775
If we have these probabilities,

95
00:06:23,775 --> 00:06:26,400
it is also easy to estimate the parameters.

96
00:06:26,400 --> 00:06:29,285
So instead of just signing with respect to all blue points,

97
00:06:29,285 --> 00:06:34,310
and averaging their position to get the location of the blue Gaussian,

98
00:06:34,310 --> 00:06:39,090
we'll have to sum all the points but with weights.

99
00:06:39,090 --> 00:06:43,430
If the posterior P of T = 1,

100
00:06:43,430 --> 00:06:46,030
is 1 for some data point?

101
00:06:46,030 --> 00:06:50,100
It means that this particular data point is completely blue.

102
00:06:50,100 --> 00:06:53,375
It certainly belongs to the blue Gaussian,

103
00:06:53,375 --> 00:07:01,475
and it will be used as just blue data point in the averaging with weight 1.

104
00:07:01,475 --> 00:07:03,110
If for some data point,

105
00:07:03,110 --> 00:07:05,730
P of T = 1 is 0,

106
00:07:05,730 --> 00:07:08,405
it means that these data point is certainly orange,

107
00:07:08,405 --> 00:07:14,507
and we will just don't use it in the computing the blue Gaussian mean at all,

108
00:07:14,507 --> 00:07:18,465
but if the data point is for example,

109
00:07:18,465 --> 00:07:21,340
P of T = 1 is 0.8,

110
00:07:21,340 --> 00:07:24,565
it just means that this data point is kind of not certain.

111
00:07:24,565 --> 00:07:31,240
It thinks that it belongs to the blue Gaussian but it's not sure.

112
00:07:31,240 --> 00:07:33,800
It will highly affect the position of

113
00:07:33,800 --> 00:07:37,490
the blue Gaussian and a little bit affect the position of the orange Gaussian.

114
00:07:37,490 --> 00:07:44,556
We'll direct this kind of formulas later from more strict considerations,

115
00:07:44,556 --> 00:07:49,385
but now, just believe me that if you know the posterior probably on T,

116
00:07:49,385 --> 00:07:52,060
you could easily estimate the parameters this way.

117
00:07:52,060 --> 00:07:57,050
The bottom line here is that if we know the sources,

118
00:07:57,050 --> 00:08:00,512
no matter soft segments or hard segments,

119
00:08:00,512 --> 00:08:04,705
then we can easily estimate the parameters of our Gaussian mixture model.

120
00:08:04,705 --> 00:08:07,250
But on practice, we don't, right?

121
00:08:07,250 --> 00:08:08,400
We don't know the sources,

122
00:08:08,400 --> 00:08:12,110
so how can we estimate the sources?

123
00:08:12,110 --> 00:08:16,205
Well, it turns out that if we know the parameters, so the Gaussians,

124
00:08:16,205 --> 00:08:18,665
their locations and their variances,

125
00:08:18,665 --> 00:08:24,165
then we can easily estimate the sources because we can use just the Bayes' rule to do it.

126
00:08:24,165 --> 00:08:26,798
And by the Bayes' rule, the soft assignment,

127
00:08:26,798 --> 00:08:29,030
the posterior probability of T equals to,

128
00:08:29,030 --> 00:08:30,735
for example, blue Gaussian,

129
00:08:30,735 --> 00:08:32,810
for some particle data point,

130
00:08:32,810 --> 00:08:37,245
is just proportional to the join probability,

131
00:08:37,245 --> 00:08:39,065
which is likelihood times prior.

132
00:08:39,065 --> 00:08:41,630
And if we know the parameters theta,

133
00:08:41,630 --> 00:08:44,345
we can easily compute these likelihood and prior at

134
00:08:44,345 --> 00:08:47,595
any given point so we can easily to compute this posterior probability,

135
00:08:47,595 --> 00:08:51,950
which is basically sources,

136
00:08:51,950 --> 00:08:54,935
which is basically assignments for each data point for the clusters,

137
00:08:54,935 --> 00:09:01,040
soft assignments, and we think that the normalization constant here can be problematic,

138
00:09:01,040 --> 00:09:03,875
but it turns out that we have just two values here.

139
00:09:03,875 --> 00:09:05,455
Two probabilities.

140
00:09:05,455 --> 00:09:10,635
The distribution P of T given some data and theta,

141
00:09:10,635 --> 00:09:13,250
can think just two possible values.

142
00:09:13,250 --> 00:09:16,460
We can explicitly normalize this think by signing with

143
00:09:16,460 --> 00:09:19,900
respect to two unnormalized probabilities. It's no big deal.

144
00:09:19,900 --> 00:09:23,480
To summarize, we have kind of a chicken and egg problem.

145
00:09:23,480 --> 00:09:26,200
If we know the Gaussian parameters,

146
00:09:26,200 --> 00:09:28,250
we can easily estimate the sources.

147
00:09:28,250 --> 00:09:29,690
If we know the sources,

148
00:09:29,690 --> 00:09:32,075
we can easily estimate and Gaussian parameters.

149
00:09:32,075 --> 00:09:34,410
What can we do in this case?

150
00:09:34,410 --> 00:09:37,625
Well, the expectation maximisation algorithm,

151
00:09:37,625 --> 00:09:39,374
in this particular case,

152
00:09:39,374 --> 00:09:42,155
suggests us to do a very natural thing.

153
00:09:42,155 --> 00:09:43,760
Let's, first of all,

154
00:09:43,760 --> 00:09:49,390
internalize the parameter somehow randomly and then in iterations in the loop,

155
00:09:49,390 --> 00:09:53,405
let's repeat two steps until convergence.

156
00:09:53,405 --> 00:09:57,145
First of all, let's fix the parameters.

157
00:09:57,145 --> 00:09:59,035
Assume that they are the true ones,

158
00:09:59,035 --> 00:10:00,925
and they estimate the sources.

159
00:10:00,925 --> 00:10:02,670
On the next sub-step,

160
00:10:02,670 --> 00:10:06,171
let's use the sources to re-estimate the parameters,

161
00:10:06,171 --> 00:10:08,625
to update the beliefs about the parameters.

162
00:10:08,625 --> 00:10:14,480
This way, after repeating this two steps for long enough,

163
00:10:14,480 --> 00:10:18,340
we will hopefully obtain a reasonable solution

164
00:10:18,340 --> 00:10:24,165
to our probability distribution fitting problem.

165
00:10:24,165 --> 00:10:29,160
We will be able to fit Gaussian mixture model into our data.