1
00:00:03,140 --> 00:00:07,540
In the previous video we sketched an algorithm

2
00:00:07,540 --> 00:00:12,080
to fit Gaussian Mixture Model into a data set.

3
00:00:12,080 --> 00:00:17,330
Now let's see an example of how it can look like on some particular set of data.

4
00:00:17,330 --> 00:00:21,645
So assume this data again this one dimensional data set of points.

5
00:00:21,645 --> 00:00:29,280
Let's try to find two Gaussians which explain this data well.

6
00:00:29,280 --> 00:00:32,220
So let's start with initializing,

7
00:00:32,220 --> 00:00:41,565
initialization here means to choose two random Gaussians and place them somewhere.

8
00:00:41,565 --> 00:00:45,733
So let's say we choose this one Gaussian around, will you,

9
00:00:45,733 --> 00:00:53,555
175 and the other one is around 180 and the variances are around, I don't know, around 5.

10
00:00:53,555 --> 00:00:57,730
Around that view. So it's random initialization.

11
00:00:57,730 --> 00:01:00,850
So on the next step we have to estimate the sources.

12
00:01:00,850 --> 00:01:04,790
We have to estimate the posterior probabilities of the latent variability.

13
00:01:04,790 --> 00:01:09,570
So from which Gaussian each data point came. Let's do it here.

14
00:01:09,570 --> 00:01:14,530
So, you can see that for example the rightmost data points,

15
00:01:14,530 --> 00:01:16,200
they are certainly orange.

16
00:01:16,200 --> 00:01:19,155
Because the orange Gaussian explains them

17
00:01:19,155 --> 00:01:23,150
much better than the blue one right? It makes sense.

18
00:01:23,150 --> 00:01:29,440
The data points in between our randomly initialized Gaussians are kind of not settled.

19
00:01:29,440 --> 00:01:31,730
They are half blue half orange.

20
00:01:31,730 --> 00:01:38,515
We're not sure about their exact assignment to these Gaussians.

21
00:01:38,515 --> 00:01:43,335
The interesting story is about the leftmost Gaussian, the leftmost point.

22
00:01:43,335 --> 00:01:45,585
So on the one hand,

23
00:01:45,585 --> 00:01:49,245
both Gaussians think that this data point is kind of weird.

24
00:01:49,245 --> 00:01:54,045
Because, it has really low density according to both Gaussians.

25
00:01:54,045 --> 00:01:59,545
And you can expect that they will both kind of not expect it to appear here.

26
00:01:59,545 --> 00:02:04,560
But on the other hand the blue Gaussian is so much,

27
00:02:04,560 --> 00:02:07,875
the blue Gaussian probability is so much higher,

28
00:02:07,875 --> 00:02:11,115
that this data point's almost completely blue.

29
00:02:11,115 --> 00:02:15,660
So you can see here that the blue Gaussian densities are around .01 here,

30
00:02:15,660 --> 00:02:18,160
which is low, it's almost zero.

31
00:02:18,160 --> 00:02:22,601
But the orange Gaussian density is around .00002,

32
00:02:22,601 --> 00:02:24,160
which is much lower.

33
00:02:24,160 --> 00:02:26,690
So the overall probability is around 1.

34
00:02:26,690 --> 00:02:31,235
Which means that these data points are almost certainly blue.

35
00:02:31,235 --> 00:02:37,175
And so I'll know both Gaussians do not think that these data points should exist.

36
00:02:37,175 --> 00:02:40,365
The blue Gaussian explains these data points much better,

37
00:02:40,365 --> 00:02:42,485
just because it's closer.

38
00:02:42,485 --> 00:02:47,450
Okay so it is the end of the third subtype of the first iteration.

39
00:02:47,450 --> 00:02:50,025
We have this kind of softer sense,

40
00:02:50,025 --> 00:02:55,780
for which data point we know its latent variability or at least a distribution of that.

41
00:02:55,780 --> 00:02:59,625
So now let's repeat our Gaussians,

42
00:02:59,625 --> 00:03:03,730
let's re-find our parameters.

43
00:03:03,730 --> 00:03:06,085
Well it will look something like this.

44
00:03:06,085 --> 00:03:09,930
So we feed a Gaussian into all the blue points and then we feed the Gaussian

45
00:03:09,930 --> 00:03:14,290
in all the orange points and we get some two Gaussians.

46
00:03:14,290 --> 00:03:20,220
Note that the blue Gaussian is now much shifted to the left just because,

47
00:03:20,220 --> 00:03:24,980
you were assigned lots of points on the left hand side of this plot.

48
00:03:24,980 --> 00:03:31,910
And it just has to explain them so it's shifted to be around these blue datapoints.

49
00:03:31,910 --> 00:03:37,565
So this is the end of iteration one and it's already a reasonable solution,

50
00:03:37,565 --> 00:03:42,330
so we are not converged yet but it is already an approximation of the solution.

51
00:03:42,330 --> 00:03:46,385
And from this I think we can already get some reasonable clustering,

52
00:03:46,385 --> 00:03:50,130
so we can now call all the orange points one cluster,

53
00:03:50,130 --> 00:03:52,260
all the blue points the other cluster,

54
00:03:52,260 --> 00:03:58,590
and all the points in between we can say like we're not certain about them and so,

55
00:03:58,590 --> 00:04:04,780
we can't say anything certain about these half blue half orange datapoints.

56
00:04:04,780 --> 00:04:07,425
But anyway let's proceed a few iterations more.

57
00:04:07,425 --> 00:04:11,965
So on the next iteration, we will re-assign.

58
00:04:11,965 --> 00:04:15,755
We'll compute these assignments of latent variables, right?

59
00:04:15,755 --> 00:04:18,680
So the colors of the data points will change.

60
00:04:18,680 --> 00:04:20,320
So look carefully here.

61
00:04:20,320 --> 00:04:22,410
They change just a little bit.

62
00:04:22,410 --> 00:04:27,291
And you can see that

63
00:04:27,291 --> 00:04:33,378
some of the data points which were half blue half orange,

64
00:04:33,378 --> 00:04:37,935
they shifted to be just a little bit of blue and much more orange.

65
00:04:37,935 --> 00:04:46,320
It's because now our blue Gaussian's more to the left and so it's explained less.

66
00:04:46,320 --> 00:04:57,075
It explains the data points that are around 175 height a little bit worse.

67
00:04:57,075 --> 00:05:00,690
So these data points became more orange than blue.

68
00:05:00,690 --> 00:05:02,920
But overall the situation didn't change.

69
00:05:02,920 --> 00:05:05,650
So we're really close to convergence.

70
00:05:05,650 --> 00:05:10,660
So on the next subtype of the second situation we estimate the parameters.

71
00:05:10,660 --> 00:05:13,450
And the Gaussians become a little bit more certain about

72
00:05:13,450 --> 00:05:17,740
their location and their variances reduce.

73
00:05:17,740 --> 00:05:19,085
And finally one more alteration.

74
00:05:19,085 --> 00:05:22,105
So look carefully at the color of the data points,

75
00:05:22,105 --> 00:05:24,305
they shift just a little bit more.

76
00:05:24,305 --> 00:05:26,875
And again we re-estimate the Gaussians.

77
00:05:26,875 --> 00:05:29,590
And we convert this kind of solution to

78
00:05:29,590 --> 00:05:33,460
our original Gaussian mixture model fitting problem.

79
00:05:33,460 --> 00:05:37,690
So this was an expectation maximisation algorithm

80
00:05:37,690 --> 00:05:41,650
for Gaussian Mixture Model on this one dimensional data.

81
00:05:41,650 --> 00:05:45,650
Note that however this algorithm doesn't give you an optimal solution.

82
00:05:45,650 --> 00:05:48,820
So it doesn't give you the global maxim of your likelihood.

83
00:05:48,820 --> 00:05:50,955
And actually it turns out that the problem of finding

84
00:05:50,955 --> 00:05:53,475
global maxim of this likelihood is NB hard.

85
00:05:53,475 --> 00:05:58,910
So you can't expect anyone to give you an optimal solution in reasonable time.

86
00:05:58,910 --> 00:06:03,415
But well it gives you something reasonable right?

87
00:06:03,415 --> 00:06:09,015
But sometimes this expectation maximisation suffers from local maxim a lot.

88
00:06:09,015 --> 00:06:11,505
We can see an example here.

89
00:06:11,505 --> 00:06:14,790
Let's initialize our Gaussians somewhere else.

90
00:06:14,790 --> 00:06:20,025
Not as we initialized them in the first demo but somewhere else.

91
00:06:20,025 --> 00:06:25,175
Let's say we initialized the Gaussians on the left hand side of the plot.

92
00:06:25,175 --> 00:06:27,622
The first substep of the first iteration,

93
00:06:27,622 --> 00:06:29,160
we estimate the sources,

94
00:06:29,160 --> 00:06:35,400
we estimate the latent distribution and latent variability and looks like this.

95
00:06:35,400 --> 00:06:43,610
So all the points that are more to the right than 165 height,

96
00:06:43,610 --> 00:06:47,220
they will be certainly orange because the orange explains them better.

97
00:06:47,220 --> 00:06:51,230
And the others will certainly be blue.

98
00:06:51,230 --> 00:06:54,935
It's not a very good solution but it's not a solution yet right?

99
00:06:54,935 --> 00:06:57,290
It is just the first subset of the first iteration.

100
00:06:57,290 --> 00:07:01,915
It shouldn't hurt us too much.

101
00:07:01,915 --> 00:07:04,435
But it turns out that because of this mistake,

102
00:07:04,435 --> 00:07:07,724
because we assign so many points to be orange,

103
00:07:07,724 --> 00:07:10,680
on the next substep we will have to

104
00:07:10,680 --> 00:07:15,195
estimate the variance of the orange Gaussian to be high like this.

105
00:07:15,195 --> 00:07:20,940
Just because it has so many points both on the right hand side and on the left hand side.

106
00:07:20,940 --> 00:07:23,250
And now we have a problem.

107
00:07:23,250 --> 00:07:24,720
So at the start of iteration two,

108
00:07:24,720 --> 00:07:27,105
look at the leftmost point.

109
00:07:27,105 --> 00:07:30,495
It will become a little bit orange.

110
00:07:30,495 --> 00:07:37,215
Just because now it is explained by the orange Gaussian as well as the blue Gaussian.

111
00:07:37,215 --> 00:07:41,255
Because now the orange Gaussian is high in this region.

112
00:07:41,255 --> 00:07:43,200
This is a disaster.

113
00:07:43,200 --> 00:07:49,450
Because now we'll repeat the Gaussians and the orange Gaussian will

114
00:07:49,450 --> 00:07:55,490
think that now it not only has lots of orange points on the left and on the right,

115
00:07:55,490 --> 00:07:59,355
but also it has this one leftmost point.

116
00:07:59,355 --> 00:08:03,485
Well not fully, just a little chunk of this leftmost point,

117
00:08:03,485 --> 00:08:07,585
but anyway it contributes to the variance estimation,

118
00:08:07,585 --> 00:08:11,070
and the variance estimation of the orange Gaussian will be really high.

119
00:08:11,070 --> 00:08:18,430
It thinks that it owns all the points on this plane and this will make

120
00:08:18,430 --> 00:08:22,750
the orange Gaussian variance estimation stay high and

121
00:08:22,750 --> 00:08:26,320
the blue Gaussian variance estimation will

122
00:08:26,320 --> 00:08:30,640
reduce a little bit because it becomes more certain in its estimation,

123
00:08:30,640 --> 00:08:32,880
and we're stuck in this kind of situation.

124
00:08:32,880 --> 00:08:35,200
So we can do a few more iterations,

125
00:08:35,200 --> 00:08:37,880
and we converged this local maximum,

126
00:08:37,880 --> 00:08:40,635
which is certainly not optimal.

127
00:08:40,635 --> 00:08:42,790
But the bottom line here is that

128
00:08:42,790 --> 00:08:46,983
the expectation maximisation suffers from local maximums.

129
00:08:46,983 --> 00:08:49,064
And as usual in optimization,

130
00:08:49,064 --> 00:08:54,595
the way to deal with it is to start several times from different random initializations,

131
00:08:54,595 --> 00:08:57,130
and choose the best run.

132
00:08:57,130 --> 00:09:02,425
The best thing here is that you can track the log likelihood,

133
00:09:02,425 --> 00:09:07,165
and you can find the best run

134
00:09:07,165 --> 00:09:10,530
according to the training log likelihood performance after all.

135
00:09:10,530 --> 00:09:16,140
So at least you can compare different logs to show and choose the best one.

136
00:09:16,140 --> 00:09:20,030
Summary. Gaussian Mixture Model is

137
00:09:20,030 --> 00:09:25,040
a flexible probabilistic model which you can fit into your data,

138
00:09:25,040 --> 00:09:28,690
and it allows you to solve the clustering problem,

139
00:09:28,690 --> 00:09:32,405
and also gives you a probabilistic model of your data.

140
00:09:32,405 --> 00:09:36,080
So you can for example sample from this model new data points.

141
00:09:36,080 --> 00:09:41,965
Expectation Maximisation algorithm is something to train this Gaussian Mixture Model,

142
00:09:41,965 --> 00:09:44,303
and other latent variables models,

143
00:09:44,303 --> 00:09:47,475
as we will see in the next few videos.

144
00:09:47,475 --> 00:09:50,320
But for now it's just for Gaussian Mixture Model.

145
00:09:50,320 --> 00:09:53,776
And sometimes it can be faster than Stochastic Gradient Descent,

146
00:09:53,776 --> 00:09:58,570
and it also helps you to handle

147
00:09:58,570 --> 00:10:01,930
complicated constraints like making

148
00:10:01,930 --> 00:10:06,160
the covariance matrix positive semi definite on each iteration.

149
00:10:06,160 --> 00:10:10,015
And finally expectation maximisation suffers from local maxima.

150
00:10:10,015 --> 00:10:13,390
But you know it's kind of expected because the overall problem was NB hard,

151
00:10:13,390 --> 00:10:15,510
and so thus Stochastic Gradient Descent.

152
00:10:15,510 --> 00:10:22,980
So the optimal solution in a reasonable time is not possible.