1
00:00:03,540 --> 00:00:06,065
On the M-step of

2
00:00:06,065 --> 00:00:11,035
the Expectation Maximization Algorithm for restricting the Gaussian mixture model,

3
00:00:11,035 --> 00:00:15,410
we have to maximize the following expression.

4
00:00:15,410 --> 00:00:23,819
So we have to find the maximum of sum with respect to individual objects in our data set,

5
00:00:23,819 --> 00:00:27,875
training objects of expected values

6
00:00:27,875 --> 00:00:32,462
with respect to the variational distributions q of t_i of

7
00:00:32,462 --> 00:00:41,850
the joined log likelihood of x_i and t_i given the parameters,

8
00:00:41,850 --> 00:00:43,360
and here with the parameter suggest mu,

9
00:00:43,360 --> 00:00:48,988
so given mu and maximize this expression with respect to mu.

10
00:00:48,988 --> 00:00:52,640
Mu one to mu number of clusters.

11
00:00:52,640 --> 00:00:57,422
We can maximize this thing explicitly by finally get zero,

12
00:00:57,422 --> 00:01:00,194
setting it to zero,

13
00:01:00,194 --> 00:01:02,255
but recall that on the previous video,

14
00:01:02,255 --> 00:01:06,057
we maximize the same expression for Gaussian mixture model.

15
00:01:06,057 --> 00:01:12,725
We can now reuse the results from that video in this one.

16
00:01:12,725 --> 00:01:14,960
Recall that the optimum parameter for μ number C,

17
00:01:14,960 --> 00:01:22,020
for example, was the following one.

18
00:01:22,020 --> 00:01:25,580
It was the average with respect to all the data

19
00:01:25,580 --> 00:01:29,240
or all the data points now are training to data set times

20
00:01:29,240 --> 00:01:33,825
the variational distribution q of the weights of

21
00:01:33,825 --> 00:01:38,457
ti equals to C. The weights according to the card and so the Gaussian C,

22
00:01:38,457 --> 00:01:42,585
that is the data points xi,

23
00:01:42,585 --> 00:01:46,490
divided by the normalization.

24
00:01:46,490 --> 00:01:59,825
So now, the sum of all the weights q of ti equals to C.

25
00:01:59,825 --> 00:02:03,151
Recall that in our particular case, on the E-step,

26
00:02:03,151 --> 00:02:08,420
we are using delta functions instead of the full variational distribution q.

27
00:02:08,420 --> 00:02:12,255
Our q looks like this,

28
00:02:12,255 --> 00:02:23,190
q of ti equals to one if ti equals to ci,

29
00:02:23,190 --> 00:02:27,030
ci-star, which we have already pre-computed on

30
00:02:27,030 --> 00:02:34,352
the E-step and zero otherwise.

31
00:02:34,352 --> 00:02:37,580
We can continue this expression and

32
00:02:37,580 --> 00:02:42,585
simplify it by using these definition of q to ti, in this particular case.

33
00:02:42,585 --> 00:02:50,895
This thing will be just again sum with respect to all the subjects but all the objects

34
00:02:50,895 --> 00:02:55,520
that doesn't have ti equals to ci-star or

35
00:02:55,520 --> 00:03:01,555
to c. They will not contribute to the sum, right?

36
00:03:01,555 --> 00:03:06,410
So any object which didn't came from the Gaussian number c,

37
00:03:06,410 --> 00:03:09,095
it will have the weight q being zero,

38
00:03:09,095 --> 00:03:13,035
both the numerator and denominator and so,

39
00:03:13,035 --> 00:03:16,350
it will not contribute to this sum.

40
00:03:16,350 --> 00:03:18,725
Thus, we can write down like this.

41
00:03:18,725 --> 00:03:22,225
We can say that this thing will be equal to sum with respect to

42
00:03:22,225 --> 00:03:27,710
all objects from the Gaussian number c. All i such

43
00:03:27,710 --> 00:03:33,470
that ci-star equals to c will arrange

44
00:03:33,470 --> 00:03:42,000
these data points and divide them by normalization by the number of such data points.

45
00:03:42,000 --> 00:03:50,941
It's number of i such that ci-star equals to c,

46
00:03:50,941 --> 00:03:59,410
and this is exactly the formula for the K-means algorithm on the second subset.

47
00:03:59,410 --> 00:04:04,470
To summarize, we have just proven that

48
00:04:04,470 --> 00:04:09,320
the M-step of the expectation maximization algorithm applied to

49
00:04:09,320 --> 00:04:14,420
this restricted Gaussian mixture model without some parameters

50
00:04:14,420 --> 00:04:20,405
where we leave just the locations of mu of the Gaussian parameters.

51
00:04:20,405 --> 00:04:22,600
If we use the q,

52
00:04:22,600 --> 00:04:26,975
the restricted q or being the delta function from the restricted E-step,

53
00:04:26,975 --> 00:04:30,730
then on the M-step, we obtain this kind of formula.

54
00:04:30,730 --> 00:04:35,183
We have to find,

55
00:04:35,183 --> 00:04:40,365
so we have to update the mu as being the centers of the all the data points

56
00:04:40,365 --> 00:04:47,175
assigned to the cluster number c. We use again exactly as we had in K-means.

57
00:04:47,175 --> 00:04:51,500
So to summarize, K-means is actually

58
00:04:51,500 --> 00:04:53,105
a special case for applying

59
00:04:53,105 --> 00:04:57,328
the expectation maximization algorithm to Gaussian mixture model,

60
00:04:57,328 --> 00:04:58,715
but first of all,

61
00:04:58,715 --> 00:05:03,020
these Gaussian mixture model has to be some kind of restricted one where we said

62
00:05:03,020 --> 00:05:08,830
covariance matrices are to be identical and the prior Gaussian c to be uniform.

63
00:05:08,830 --> 00:05:12,170
We also have to use

64
00:05:12,170 --> 00:05:15,050
simplified E-step so instead of considering

65
00:05:15,050 --> 00:05:19,265
the full posterior distributions on the E-step,

66
00:05:19,265 --> 00:05:25,065
we're approximating them with the delta function.

67
00:05:25,065 --> 00:05:30,600
Which kind of makes E-step and M-step faster but also restrict

68
00:05:30,600 --> 00:05:37,650
the flexibility of the model and efficiency of the EM algorithm.

69
00:05:37,650 --> 00:05:41,670
But anyway, this kind of restriction still gives you

70
00:05:41,670 --> 00:05:47,540
a valid EM algorithm which still is guaranteed to converge and has nice properties.

71
00:05:47,540 --> 00:05:54,570
It's really cool. We have just took your usual machine learning algorithm for clustering,

72
00:05:54,570 --> 00:05:58,035
which is not probabilistic in any case in any sense.

73
00:05:58,035 --> 00:06:01,170
We show that it's

74
00:06:01,170 --> 00:06:06,310
a special case of the EM algorithm applied to some restricted Gaussian mixture model.

75
00:06:06,310 --> 00:06:11,185
Using this knowledge, we can, for example,

76
00:06:11,185 --> 00:06:15,390
change something in K-means and we

77
00:06:15,390 --> 00:06:20,520
know how to adapt the the training algorithm to these changes.

78
00:06:20,520 --> 00:06:22,630
For example, we can introduce missing data.

79
00:06:22,630 --> 00:06:24,930
This way, we can build the missing data K-means

80
00:06:24,930 --> 00:06:29,311
without heuristics like setting missing data zero,

81
00:06:29,311 --> 00:06:31,020
but by just extending

82
00:06:31,020 --> 00:06:36,185
this restricted EM algorithm and handling these missing data naturally.

83
00:06:36,185 --> 00:06:42,780
Actually, this kind of trick to simplify the E-step by considering

84
00:06:42,780 --> 00:06:44,445
a restricted set of distributions

85
00:06:44,445 --> 00:06:50,640
is strongly connected to something called variational difference,

86
00:06:50,640 --> 00:06:52,990
which we'll discuss and week four.