1
00:00:00,090 --> 00:00:01,540
Most of the supervised learning algorithms

2
00:00:01,690 --> 00:00:02,890
we've seen, things like linear

3
00:00:03,130 --> 00:00:04,730
regression, logistic regression and

4
00:00:04,930 --> 00:00:05,850
so on. All of those

5
00:00:06,300 --> 00:00:08,089
algorithms have an optimization objective

6
00:00:08,670 --> 00:00:10,920
or some cost function that the algorithm was trying to minimize.

7
00:00:11,920 --> 00:00:13,180
It turns out that K-means also

8
00:00:13,770 --> 00:00:15,730
has an optimization objective or

9
00:00:15,870 --> 00:00:18,720
a cost function that is trying to minimize.

10
00:00:19,630 --> 00:00:20,180
And in this video, I'd like to tell

11
00:00:20,230 --> 00:00:23,620
you what that optimization objective is.

12
00:00:23,730 --> 00:00:24,420
And the reason I want to do so

13
00:00:24,750 --> 00:00:26,960
is because this will be useful to us for two purposes.

14
00:00:28,020 --> 00:00:29,330
First, knowing what is the

15
00:00:29,480 --> 00:00:30,890
optimization objective of K-means

16
00:00:31,150 --> 00:00:32,390
will help us to

17
00:00:32,690 --> 00:00:33,970
debug the learning algorithm and

18
00:00:34,070 --> 00:00:35,080
just make sure that K-means is

19
00:00:35,300 --> 00:00:37,100
running correctly, and second,

20
00:00:37,610 --> 00:00:39,290
and perhaps even more importantly, in

21
00:00:39,530 --> 00:00:41,290
a later video we'll talk

22
00:00:41,490 --> 00:00:42,580
about how we can use this to

23
00:00:42,730 --> 00:00:44,000
help K-means find better clusters

24
00:00:44,070 --> 00:00:46,290
and avoid local optima, but we'll do that in a later video that follows this one.

25
00:00:46,410 --> 00:00:47,330
Just as a quick reminder, while K-means is

26
00:00:49,680 --> 00:00:52,870
running we're going to be

27
00:00:54,450 --> 00:00:55,820
keeping track of two sets of variables.

28
00:00:56,430 --> 00:00:58,390
First is the CI's and

29
00:00:58,700 --> 00:00:59,830
that keeps track of the

30
00:01:00,190 --> 00:01:01,600
index or the number of the cluster

31
00:01:02,730 --> 00:01:05,040
to which an example x(i) is currently assigned.

32
00:01:05,230 --> 00:01:05,960
And then, the other set of variables

33
00:01:06,540 --> 00:01:07,580
we use as Mu subscript

34
00:01:08,120 --> 00:01:09,410
K, which is the location

35
00:01:10,140 --> 00:01:12,110
of cluster centroid K. And,

36
00:01:12,380 --> 00:01:13,750
again, for K-means

37
00:01:14,030 --> 00:01:17,230
we use capital K to denote the total number of clusters.

38
00:01:17,890 --> 00:01:19,310
And here lower case K,

39
00:01:20,010 --> 00:01:20,910
you know, is going to be an

40
00:01:21,040 --> 00:01:22,650
index into the cluster

41
00:01:22,970 --> 00:01:23,930
centroids, and so lower

42
00:01:24,030 --> 00:01:24,940
case k is going to be

43
00:01:25,140 --> 00:01:26,390
a number between 1 and

44
00:01:26,600 --> 00:01:29,630
capital K. Now, here's

45
00:01:29,840 --> 00:01:31,040
one more bit of notation which

46
00:01:31,270 --> 00:01:32,280
is going to use Mu

47
00:01:32,630 --> 00:01:34,560
subscript c(i) to denote

48
00:01:34,970 --> 00:01:36,660
the cluster centroid of the

49
00:01:36,780 --> 00:01:38,400
cluster to which example x(i)

50
00:01:38,880 --> 00:01:40,500
has been assigned and

51
00:01:40,710 --> 00:01:42,030
to explain that notation

52
00:01:42,450 --> 00:01:43,450
a little bit more, let's

53
00:01:43,660 --> 00:01:45,600
say that x(i) has been

54
00:01:45,740 --> 00:01:47,760
assigned to cluster number five.

55
00:01:48,880 --> 00:01:49,830
What that means is that c(i),

56
00:01:50,850 --> 00:01:52,290
that is the index of x(i),

57
00:01:53,130 --> 00:01:54,300
that that is equal to 5.

58
00:01:54,420 --> 00:01:57,640
Right? Because you know, having c(i) equals 5,

59
00:01:57,800 --> 00:01:59,270
that's what it means for the

60
00:02:00,500 --> 00:02:01,720
example x(i) to be

61
00:02:01,910 --> 00:02:03,440
assigned to cluster number 5.

62
00:02:03,510 --> 00:02:05,700
And so Mu subscript

63
00:02:06,290 --> 00:02:07,960
c(i) is going to

64
00:02:08,100 --> 00:02:09,630
be equal to Mu subscript

65
00:02:10,080 --> 00:02:12,260
5 because c(i) is equal

66
00:02:13,700 --> 00:02:14,100
to 5.

67
00:02:15,100 --> 00:02:16,570
This Mu substitute c(i) is the

68
00:02:16,660 --> 00:02:18,420
cluster centroid of cluster number

69
00:02:18,730 --> 00:02:19,670
5, which is the cluster

70
00:02:20,120 --> 00:02:22,480
to which my example x(i) has been assigned.

71
00:02:23,470 --> 00:02:24,730
With this notation, we're now

72
00:02:24,960 --> 00:02:26,040
ready to write out what

73
00:02:26,200 --> 00:02:28,150
is the optimization objective of

74
00:02:28,290 --> 00:02:30,360
the K Mu clustering algorithm.

75
00:02:30,760 --> 00:02:30,800
And here it is.

76
00:02:31,330 --> 00:02:32,940
The cost function that K-means

77
00:02:33,040 --> 00:02:34,380
is minimizing is the

78
00:02:34,570 --> 00:02:35,770
function J of all of

79
00:02:35,880 --> 00:02:37,470
these parameters c1 through

80
00:02:37,890 --> 00:02:39,610
cM, Mu1 through MuK, that

81
00:02:39,790 --> 00:02:41,570
K-means is varying as the algorithm runs.

82
00:02:42,100 --> 00:02:43,930
And the optimization objective is shown

83
00:02:44,160 --> 00:02:45,520
on the right, is the average of

84
00:02:45,610 --> 00:02:46,430
one over M of the sum

85
00:02:46,620 --> 00:02:48,730
of i equals one through M of this term here

86
00:02:50,400 --> 00:02:52,670
that I've just drawn the red box around.

87
00:02:52,870 --> 00:02:54,680
The squared distance between

88
00:02:55,160 --> 00:02:57,540
each example x(i) and the

89
00:02:57,690 --> 00:02:58,740
location of the cluster

90
00:02:59,130 --> 00:03:00,210
centroid to which x(i)

91
00:03:01,320 --> 00:03:01,920
has been assigned.

92
00:03:03,240 --> 00:03:06,070
So let me just draw this in, let me explain this.

93
00:03:06,240 --> 00:03:07,800
Here is the location of training

94
00:03:08,190 --> 00:03:09,780
example x(i), and here's the location

95
00:03:10,410 --> 00:03:11,760
of the cluster centroid to which

96
00:03:11,970 --> 00:03:13,660
example x(i) has been assigned.

97
00:03:14,560 --> 00:03:17,080
So to explain this in pictures, if here is X1, X2.

98
00:03:17,420 --> 00:03:19,540
And if a point

99
00:03:19,760 --> 00:03:21,210
here, is my example

100
00:03:22,080 --> 00:03:23,060
x(i), so if that

101
00:03:23,110 --> 00:03:24,840
is equal to my example x(i),

102
00:03:25,860 --> 00:03:27,000
and if x(i) has been assigned

103
00:03:27,240 --> 00:03:28,270
to some cluster centroid, and

104
00:03:28,340 --> 00:03:30,240
I'll denote my cluster centroid with a cross.

105
00:03:30,630 --> 00:03:32,130
So if that's the location of,

106
00:03:32,300 --> 00:03:33,830
you know, Mu 5, let's

107
00:03:34,370 --> 00:03:35,640
say, if x(i) has been

108
00:03:35,850 --> 00:03:37,960
assigned to cluster centroid 5 in my example up there.

109
00:03:38,810 --> 00:03:40,660
Then, the squared distance, that's

110
00:03:40,940 --> 00:03:41,840
the squared of the distance

111
00:03:43,810 --> 00:03:46,010
between the point x(i) and this

112
00:03:46,220 --> 00:03:48,400
cluster centroid, to which x(i) has been assigned.

113
00:03:49,570 --> 00:03:50,720
And what K-means can be shown

114
00:03:51,070 --> 00:03:52,540
to be doing is that, it

115
00:03:52,680 --> 00:03:54,480
is trying to find parameters c(i)

116
00:03:55,270 --> 00:03:57,410
and Mu(i), try to

117
00:03:57,570 --> 00:03:58,840
find cMU to try to

118
00:03:58,960 --> 00:04:00,450
minimize this cost function J.

119
00:04:01,440 --> 00:04:03,180
This cost function is sometimes

120
00:04:03,680 --> 00:04:06,770
also called the distortion cost

121
00:04:07,060 --> 00:04:10,030
function or the distortion of

122
00:04:10,240 --> 00:04:12,130
the K-means algorithm.

123
00:04:12,790 --> 00:04:13,360
And, just to provide a little bit more

124
00:04:13,630 --> 00:04:15,750
detail, here's the K-means algorithm,

125
00:04:15,820 --> 00:04:16,450
Here's exactly the algorithm as we have it, in the real

126
00:04:16,610 --> 00:04:17,960
form the earlier slide.

127
00:04:18,950 --> 00:04:20,200
And what this first step

128
00:04:21,030 --> 00:04:23,120
of this algorithm is, this was

129
00:04:23,830 --> 00:04:25,910
the cluster assignment step

130
00:04:27,920 --> 00:04:29,850
where we assign each

131
00:04:30,030 --> 00:04:32,910
point to the cluster centroid, and

132
00:04:33,010 --> 00:04:34,830
it's possible to show mathematically that

133
00:04:35,050 --> 00:04:36,210
what the cluster assignment step

134
00:04:36,450 --> 00:04:38,560
is doing is exactly minimizing

135
00:04:40,770 --> 00:04:42,950
J with respect

136
00:04:43,420 --> 00:04:45,900
to the variables, C1, C2

137
00:04:46,380 --> 00:04:48,050
and so on, up

138
00:04:48,170 --> 00:04:52,030
to C(m), while holding the

139
00:04:52,480 --> 00:04:54,240
closest centroids, MU1 up to

140
00:04:54,720 --> 00:04:57,000
MUK fixed.

141
00:04:58,580 --> 00:04:59,640
So, what the first assignment step

142
00:04:59,900 --> 00:05:00,990
does is you know, it doesn't change

143
00:05:01,240 --> 00:05:02,850
the cluster centroids, but what it's

144
00:05:02,960 --> 00:05:05,730
doing is, exactly picking the values of C1, C2, up to CM

145
00:05:07,790 --> 00:05:10,240
that minimizes the cost

146
00:05:10,500 --> 00:05:11,790
function or the distortion function,

147
00:05:12,510 --> 00:05:14,440
J. And it's possible to prove

148
00:05:14,670 --> 00:05:16,550
that mathematically but, I won't do so here.

149
00:05:17,170 --> 00:05:18,210
That has a pretty intuitive meaning

150
00:05:18,610 --> 00:05:19,630
of just, yo know, well let's assign

151
00:05:20,090 --> 00:05:21,040
these points to the cluster centroid

152
00:05:21,530 --> 00:05:22,820
that is closest to it, because

153
00:05:23,120 --> 00:05:24,160
that's what minimizes the square

154
00:05:24,660 --> 00:05:26,860
of distance between the points and the corresponding cluster centroid.

155
00:05:27,840 --> 00:05:29,090
And then the other part of

156
00:05:29,790 --> 00:05:32,880
the second step of K-means, this second step over here.

157
00:05:33,960 --> 00:05:35,480
This second step was the move

158
00:05:35,690 --> 00:05:38,770
centroid step and,

159
00:05:39,000 --> 00:05:40,020
once again, I won't prove it,

160
00:05:40,510 --> 00:05:41,250
but it can be shown

161
00:05:41,520 --> 00:05:42,590
mathematically, that what the

162
00:05:43,140 --> 00:05:44,910
root centroid step does, is

163
00:05:45,150 --> 00:05:46,740
it chooses the values

164
00:05:47,260 --> 00:05:49,370
of mu that minimizes J.

165
00:05:50,150 --> 00:05:51,270
So it minimizes the cost function

166
00:05:51,650 --> 00:05:53,000
J with respect to,

167
00:05:53,380 --> 00:05:54,710
where wrt is my

168
00:05:54,920 --> 00:05:56,930
abbreviation for with respect to.

169
00:05:57,030 --> 00:05:58,380
But it minimizes J with respect

170
00:05:58,790 --> 00:06:01,930
to the locations of the cluster centroids, Mu1 through MuK.

171
00:06:02,040 --> 00:06:05,690
So, what K-means really

172
00:06:05,790 --> 00:06:06,910
is doing is it's taking the

173
00:06:07,010 --> 00:06:08,380
two sets of variables and

174
00:06:09,070 --> 00:06:11,210
partitioning them into two halves right here.

175
00:06:11,550 --> 00:06:14,490
First the C set of variables and then you have the Mu sets of variables.

176
00:06:15,450 --> 00:06:15,990
And what it does is it first

177
00:06:16,560 --> 00:06:17,750
minimizes J with respect

178
00:06:18,050 --> 00:06:19,350
to the variable C, and then minimizes

179
00:06:19,700 --> 00:06:20,610
J with respect the variables

180
00:06:21,120 --> 00:06:22,590
Mu, and then it keeps on iterating.

181
00:06:25,180 --> 00:06:26,680
And so that's all that K-means does.

182
00:06:27,700 --> 00:06:28,570
And, now that we understand

183
00:06:29,150 --> 00:06:30,870
K-means, let's try to

184
00:06:31,030 --> 00:06:32,190
minimize this cost function J.  We

185
00:06:32,430 --> 00:06:33,640
can also use this to

186
00:06:33,800 --> 00:06:34,890
try to debug our learning

187
00:06:35,090 --> 00:06:36,350
algorithm and just kind

188
00:06:36,520 --> 00:06:37,980
of make sure that our implementation

189
00:06:38,900 --> 00:06:39,950
of K-means is running correctly.

190
00:06:41,220 --> 00:06:42,560
So, we now understand the

191
00:06:43,070 --> 00:06:44,260
K-means algorithm as trying to

192
00:06:44,610 --> 00:06:45,960
optimize this cost function J,

193
00:06:46,640 --> 00:06:48,790
which is also called the dispulsion function.

194
00:06:50,650 --> 00:06:51,600
We can use that to debug K-means

195
00:06:52,090 --> 00:06:53,060
and help me show that K-means

196
00:06:53,130 --> 00:06:54,050
is converging, and that it's

197
00:06:54,510 --> 00:06:56,160
running properly, and in the

198
00:06:56,240 --> 00:06:57,460
next video, we'll also see

199
00:06:57,690 --> 00:06:59,040
how we can us this to

200
00:06:59,120 --> 00:07:00,650
help K-means find better clusters

201
00:07:01,300 --> 00:07:03,240
and help K-means to avoid local optima.