1
00:00:00,300 --> 00:00:02,220
In the clustering problem we are

2
00:00:02,360 --> 00:00:03,630
given an unlabeled data

3
00:00:03,950 --> 00:00:05,040
set and we would like

4
00:00:05,200 --> 00:00:06,480
to have an algorithm automatically

5
00:00:07,320 --> 00:00:08,700
group the data into coherent

6
00:00:09,340 --> 00:00:11,000
subsets or into coherent clusters for us.

7
00:00:12,380 --> 00:00:14,160
The K Means algorithm is by

8
00:00:14,310 --> 00:00:15,860
far the most popular, by

9
00:00:16,090 --> 00:00:17,410
far the most widely used clustering

10
00:00:17,780 --> 00:00:19,380
algorithm, and in this

11
00:00:19,550 --> 00:00:20,320
video I would like to tell

12
00:00:20,570 --> 00:00:23,400
you what the K Means Algorithm is and how it works.

13
00:00:27,000 --> 00:00:29,310
The K means clustering algorithm is best illustrated in pictures.

14
00:00:29,960 --> 00:00:30,770
Let's say I want to take

15
00:00:31,080 --> 00:00:32,330
an unlabeled data set like

16
00:00:32,490 --> 00:00:34,040
the one shown here, and I

17
00:00:34,100 --> 00:00:36,450
want to group the data into two clusters.

18
00:00:37,710 --> 00:00:38,740
If I run the K Means clustering

19
00:00:39,080 --> 00:00:41,560
algorithm, here is what

20
00:00:41,910 --> 00:00:44,190
I'm going to do. The first step is to randomly initialize two

21
00:00:44,410 --> 00:00:45,920
points, called the cluster centroids.

22
00:00:46,700 --> 00:00:48,170
So, these two crosses here,

23
00:00:49,010 --> 00:00:51,730
these are called the Cluster Centroids

24
00:00:53,270 --> 00:00:54,320
and I have two of them

25
00:00:55,100 --> 00:00:57,840
because I want to group my data into two clusters.

26
00:00:59,130 --> 00:01:02,400
K Means is an iterative algorithm and it does two things.

27
00:01:03,480 --> 00:01:04,790
First is a cluster assignment

28
00:01:05,330 --> 00:01:07,800
step, and second is a move centroid step.

29
00:01:08,360 --> 00:01:09,630
So, let me tell you what those things mean.

30
00:01:11,170 --> 00:01:12,520
The first of the two steps in the

31
00:01:12,700 --> 00:01:14,930
loop of K means, is this cluster assignment step.

32
00:01:15,840 --> 00:01:17,070
What that means is that, it's

33
00:01:17,220 --> 00:01:18,360
going through each of the

34
00:01:18,700 --> 00:01:19,880
examples, each of these green

35
00:01:20,170 --> 00:01:22,120
dots shown here and depending

36
00:01:22,580 --> 00:01:24,140
on whether it's closer to the

37
00:01:24,350 --> 00:01:25,530
red cluster centroid or the

38
00:01:25,620 --> 00:01:27,390
blue cluster centroid, it is going

39
00:01:27,560 --> 00:01:28,570
to assign each of the

40
00:01:28,670 --> 00:01:30,670
data points to one of the two cluster centroids.

41
00:01:32,040 --> 00:01:33,350
Specifically, what I mean

42
00:01:33,460 --> 00:01:34,610
by that, is to go through your

43
00:01:34,730 --> 00:01:36,930
data set and color each

44
00:01:37,130 --> 00:01:38,510
of the points either red or

45
00:01:38,810 --> 00:01:39,890
blue, depending on whether

46
00:01:40,160 --> 00:01:41,060
it is closer to the red

47
00:01:41,170 --> 00:01:42,150
cluster centroid or the blue

48
00:01:42,470 --> 00:01:45,210
cluster centroid, and I've done that in this diagram here.

49
00:01:46,930 --> 00:01:48,700
So, that was the cluster assignment step.

50
00:01:49,780 --> 00:01:52,270
The other part of K means, in the

51
00:01:52,410 --> 00:01:53,390
loop of K means, is the move

52
00:01:53,590 --> 00:01:54,860
centroid step, and what

53
00:01:55,020 --> 00:01:55,730
we are going to do is, we

54
00:01:55,800 --> 00:01:56,890
are going to take the two cluster centroids,

55
00:01:57,390 --> 00:01:58,550
that is, the red cross and

56
00:01:58,830 --> 00:02:00,270
the blue cross, and we are

57
00:02:00,420 --> 00:02:01,420
going to move them to the average

58
00:02:02,070 --> 00:02:03,900
of the points colored the same colour.

59
00:02:04,880 --> 00:02:05,700
So what we are going

60
00:02:05,730 --> 00:02:06,510
to do is look at all the

61
00:02:06,630 --> 00:02:07,810
red points and compute the

62
00:02:08,240 --> 00:02:09,520
average, really the mean

63
00:02:10,080 --> 00:02:11,500
of the location of all the red points,

64
00:02:11,650 --> 00:02:13,690
and we are going to move the red cluster centroid there.

65
00:02:14,190 --> 00:02:15,260
And the same things for the

66
00:02:15,460 --> 00:02:16,370
blue cluster centroid, look at all

67
00:02:16,560 --> 00:02:17,720
the blue dots and compute their

68
00:02:17,840 --> 00:02:19,710
mean, and then move the blue cluster centroid there.

69
00:02:20,320 --> 00:02:20,880
So, let me do that now.

70
00:02:21,170 --> 00:02:22,990
We're going to move the cluster centroids as follows

71
00:02:24,590 --> 00:02:27,350
and I've now moved them to their new means.

72
00:02:28,300 --> 00:02:29,760
The red one moved like that

73
00:02:29,820 --> 00:02:31,350
and the blue one moved

74
00:02:31,510 --> 00:02:34,460
like that and the red one moved like that.

75
00:02:34,620 --> 00:02:35,460
And then we go back to another cluster

76
00:02:35,910 --> 00:02:36,920
assignment step, so we're again

77
00:02:37,190 --> 00:02:38,090
going to look at all of

78
00:02:38,160 --> 00:02:39,670
my unlabeled examples and depending

79
00:02:40,090 --> 00:02:42,840
on whether it's closer the red or the blue cluster centroid,

80
00:02:43,340 --> 00:02:45,150
I'm going to color them either red or blue.

81
00:02:45,640 --> 00:02:47,160
I'm going to assign each point

82
00:02:47,530 --> 00:02:48,550
to one of the two cluster centroids, so let me do that now.

83
00:02:51,450 --> 00:02:52,260
And so the colors of some of the points just changed.

84
00:02:53,400 --> 00:02:55,690
And then I'm going to do another move centroid step.

85
00:02:56,040 --> 00:02:56,810
So I'm going to compute the

86
00:02:57,070 --> 00:02:57,880
average of all the blue points,

87
00:02:58,110 --> 00:02:59,000
compute the average of all

88
00:02:59,040 --> 00:03:00,360
the red points and move my

89
00:03:00,480 --> 00:03:03,770
cluster centroids like this, and

90
00:03:03,930 --> 00:03:05,650
so, let's do that again.

91
00:03:06,160 --> 00:03:07,810
Let me do one more cluster assignment step.

92
00:03:08,320 --> 00:03:09,450
So colour each point red

93
00:03:09,620 --> 00:03:10,840
or blue, based on what it's

94
00:03:11,170 --> 00:03:13,070
closer to and then

95
00:03:13,310 --> 00:03:20,000
do another move centroid step and we're done.

96
00:03:20,350 --> 00:03:21,230
And in fact if you

97
00:03:21,290 --> 00:03:23,250
keep running additional iterations of

98
00:03:23,500 --> 00:03:26,020
K means from here the

99
00:03:26,160 --> 00:03:27,240
cluster centroids will not change

100
00:03:27,540 --> 00:03:28,770
any further and the colours of

101
00:03:28,830 --> 00:03:29,760
the points will not change any

102
00:03:29,940 --> 00:03:31,520
further. And so, this is

103
00:03:31,810 --> 00:03:33,520
the, at this point,

104
00:03:33,770 --> 00:03:35,290
K means has converged and it's

105
00:03:35,400 --> 00:03:36,430
done a pretty good job finding

106
00:03:37,470 --> 00:03:38,750
the two clusters in this data.

107
00:03:39,360 --> 00:03:40,310
Let's write out the K means algorithm more formally.

108
00:03:42,150 --> 00:03:43,930
The K means algorithm takes two inputs.

109
00:03:44,570 --> 00:03:46,200
One is a parameter K,

110
00:03:46,450 --> 00:03:47,260
which is the number of clusters

111
00:03:47,830 --> 00:03:48,900
you want to find in the data.

112
00:03:49,640 --> 00:03:50,820
I'll later say how we might

113
00:03:51,170 --> 00:03:53,290
go about trying to choose k, but

114
00:03:53,470 --> 00:03:54,600
for now let's just say that

115
00:03:55,110 --> 00:03:56,210
we've decided we want a

116
00:03:56,360 --> 00:03:57,600
certain number of clusters and we're

117
00:03:57,690 --> 00:03:58,810
going to tell the algorithm how many

118
00:03:59,040 --> 00:04:00,730
clusters we think there are in the data set.

119
00:04:01,170 --> 00:04:02,120
And then K means also

120
00:04:02,490 --> 00:04:03,430
takes as input this sort

121
00:04:03,880 --> 00:04:05,060
of unlabeled training set of

122
00:04:05,250 --> 00:04:06,530
just the Xs and

123
00:04:06,710 --> 00:04:08,430
because this is unsupervised learning, we

124
00:04:08,520 --> 00:04:10,690
don't have the labels Y anymore.

125
00:04:10,980 --> 00:04:12,470
And for unsupervised learning of

126
00:04:12,740 --> 00:04:14,020
the K means I'm going to

127
00:04:14,550 --> 00:04:16,160
use the convention that XI

128
00:04:16,420 --> 00:04:17,750
is an RN dimensional vector.

129
00:04:18,280 --> 00:04:19,190
And that's why my training examples

130
00:04:19,750 --> 00:04:22,460
are now N dimensional rather N plus one dimensional vectors.

131
00:04:24,340 --> 00:04:25,430
This is what the K means algorithm does.

132
00:04:27,180 --> 00:04:28,630
The first step is that it

133
00:04:28,790 --> 00:04:31,170
randomly initializes k cluster

134
00:04:31,570 --> 00:04:33,550
centroids which we will

135
00:04:33,820 --> 00:04:34,610
call mu 1, mu 2, up

136
00:04:34,840 --> 00:04:36,250
to mu k. And so

137
00:04:36,650 --> 00:04:38,450
in the earlier diagram, the

138
00:04:38,550 --> 00:04:40,770
cluster centroids corresponded to the

139
00:04:41,060 --> 00:04:42,240
location of the red cross

140
00:04:42,660 --> 00:04:44,020
and the location of the blue cross.

141
00:04:44,410 --> 00:04:45,640
So there we had two cluster

142
00:04:45,960 --> 00:04:47,000
centroids, so maybe the red

143
00:04:47,170 --> 00:04:48,470
cross was mu 1

144
00:04:48,650 --> 00:04:49,940
and the blue cross was mu

145
00:04:50,300 --> 00:04:51,360
2, and more generally we would have

146
00:04:51,820 --> 00:04:53,830
k cluster centroids rather than just 2.

147
00:04:54,520 --> 00:04:56,240
Then the inner loop

148
00:04:56,520 --> 00:04:57,360
of k means does the following,

149
00:04:57,830 --> 00:04:59,020
we're going to repeatedly do the following.

150
00:05:00,070 --> 00:05:01,950
First for each of

151
00:05:02,160 --> 00:05:03,920
my training examples, I'm going

152
00:05:04,110 --> 00:05:05,950
to set this variable CI

153
00:05:06,790 --> 00:05:07,960
to be the index 1 through

154
00:05:08,170 --> 00:05:10,520
K of the cluster centroid closest to XI.

155
00:05:11,170 --> 00:05:13,810
So this was my cluster assignment

156
00:05:14,330 --> 00:05:16,870
step, where we

157
00:05:17,000 --> 00:05:18,680
took each of my examples and

158
00:05:18,980 --> 00:05:20,740
coloured it either red

159
00:05:21,050 --> 00:05:22,050
or blue, depending on which

160
00:05:22,380 --> 00:05:23,940
cluster centroid it was closest to.

161
00:05:24,140 --> 00:05:25,090
So CI is going to be

162
00:05:25,280 --> 00:05:26,280
a number from 1 to

163
00:05:26,380 --> 00:05:27,680
K that tells us, you

164
00:05:27,780 --> 00:05:28,760
know, is it closer to the

165
00:05:28,920 --> 00:05:29,820
red cross or is it

166
00:05:29,900 --> 00:05:31,170
closer to the blue cross,

167
00:05:32,200 --> 00:05:33,210
and another way of writing this

168
00:05:33,580 --> 00:05:35,350
is I'm going to,

169
00:05:35,620 --> 00:05:37,820
to compute Ci, I'm

170
00:05:37,890 --> 00:05:39,120
going to take my Ith

171
00:05:39,380 --> 00:05:41,170
example Xi and and

172
00:05:41,360 --> 00:05:42,670
I'm going to measure it's distance

173
00:05:43,900 --> 00:05:44,860
to each of my cluster centroids,

174
00:05:45,410 --> 00:05:46,690
this is mu and then

175
00:05:47,060 --> 00:05:48,640
lower-case k, right, so

176
00:05:48,890 --> 00:05:50,630
capital K is the total

177
00:05:50,910 --> 00:05:51,900
number centroids and I'm going

178
00:05:52,100 --> 00:05:53,160
to use lower case k here

179
00:05:53,770 --> 00:05:55,140
to index into the different centroids.

180
00:05:56,240 --> 00:05:58,470
But so, Ci is going to, I'm going

181
00:05:58,550 --> 00:06:00,110
to minimize over my values

182
00:06:00,550 --> 00:06:01,930
of k and find the

183
00:06:02,120 --> 00:06:03,650
value of K that minimizes this

184
00:06:03,900 --> 00:06:04,750
distance between Xi and the

185
00:06:04,800 --> 00:06:06,130
cluster centroid, and then,

186
00:06:06,340 --> 00:06:08,990
you know, the

187
00:06:09,070 --> 00:06:10,350
value of k that minimizes

188
00:06:10,940 --> 00:06:12,160
this, that's what gets set in

189
00:06:12,300 --> 00:06:14,100
Ci. So, here's

190
00:06:14,360 --> 00:06:16,470
another way of writing out what Ci is.

191
00:06:18,050 --> 00:06:19,150
If I write the norm between

192
00:06:19,270 --> 00:06:21,500
Xi minus Mu-k,

193
00:06:23,000 --> 00:06:24,120
then this is the distance between

194
00:06:24,630 --> 00:06:26,040
my ith training example

195
00:06:26,180 --> 00:06:27,350
Xi and the cluster centroid

196
00:06:28,140 --> 00:06:30,280
Mu subscript K, this is--this

197
00:06:31,150 --> 00:06:32,830
here, that's a lowercase K. So uppercase

198
00:06:33,320 --> 00:06:34,710
K is going to be

199
00:06:34,980 --> 00:06:36,210
used to denote the total

200
00:06:36,450 --> 00:06:38,020
number of cluster centroids,

201
00:06:38,770 --> 00:06:40,430
and this lowercase K's

202
00:06:40,790 --> 00:06:41,840
a number between one and

203
00:06:41,960 --> 00:06:42,940
capital K. I'm just using

204
00:06:43,210 --> 00:06:44,450
lower case K to index

205
00:06:44,930 --> 00:06:45,990
into my different cluster centroids.

206
00:06:47,130 --> 00:06:49,020
Next is lower case k. So

207
00:06:50,050 --> 00:06:51,330
that's the distance between the example and the cluster centroid

208
00:06:51,490 --> 00:06:52,810
and so what I'm going to

209
00:06:53,050 --> 00:06:54,330
do is find the value

210
00:06:55,250 --> 00:06:56,390
of K, of lower case

211
00:06:56,710 --> 00:06:58,900
k that minimizes this, and

212
00:06:59,080 --> 00:07:00,320
so the value of

213
00:07:00,480 --> 00:07:02,100
k that minimizes you know,

214
00:07:02,280 --> 00:07:03,610
that's what I'm going to

215
00:07:04,000 --> 00:07:06,560
set as Ci, and

216
00:07:06,760 --> 00:07:07,850
by convention here I've written

217
00:07:08,190 --> 00:07:09,430
the distance between Xi and

218
00:07:09,480 --> 00:07:11,310
the cluster centroid, by convention

219
00:07:11,820 --> 00:07:13,330
people actually tend to write this as the squared distance.

220
00:07:13,780 --> 00:07:15,370
So we think of Ci as picking

221
00:07:15,660 --> 00:07:16,860
the cluster centroid with the smallest

222
00:07:17,450 --> 00:07:20,110
squared distance to my training example Xi.

223
00:07:20,750 --> 00:07:22,080
But of course minimizing squared distance,

224
00:07:22,500 --> 00:07:23,700
and minimizing distance that should

225
00:07:23,880 --> 00:07:25,670
give you the same value of Ci,

226
00:07:25,830 --> 00:07:26,670
but we usually put in the

227
00:07:26,750 --> 00:07:28,120
square there, just as the

228
00:07:28,430 --> 00:07:31,020
convention that people use for K means.

229
00:07:31,170 --> 00:07:32,320
So that was the cluster assignment step.

230
00:07:33,480 --> 00:07:34,750
The other in the loop

231
00:07:34,980 --> 00:07:37,740
of K means does the move centroid step.

232
00:07:40,540 --> 00:07:41,750
And what that does is for

233
00:07:42,160 --> 00:07:43,460
each of my cluster centroids, so

234
00:07:43,550 --> 00:07:44,740
for lower case k equals 1 through

235
00:07:44,870 --> 00:07:46,190
K, it sets Mu-k equals

236
00:07:46,710 --> 00:07:48,460
to the average of the points assigned to cluster.

237
00:07:49,270 --> 00:07:50,720
So as a concrete example, let's say

238
00:07:50,910 --> 00:07:52,100
that one of my cluster

239
00:07:52,340 --> 00:07:53,420
centroids, let's say cluster centroid

240
00:07:53,750 --> 00:07:55,030
two, has training examples,

241
00:07:55,820 --> 00:08:02,390
you know, 1, 5, 6, and 10 assigned to it.

242
00:08:03,220 --> 00:08:04,270
And what this means is,

243
00:08:04,470 --> 00:08:05,510
really this means that C1 equals

244
00:08:06,560 --> 00:08:09,180
to C5 equals to

245
00:08:10,690 --> 00:08:12,180
C6 equals to and

246
00:08:12,300 --> 00:08:13,730
similarly well c10 equals, too, right?

247
00:08:14,970 --> 00:08:17,070
If we got that

248
00:08:17,160 --> 00:08:18,940
from the cluster assignment step, then

249
00:08:19,190 --> 00:08:21,250
that means examples 1,5,6 and

250
00:08:21,450 --> 00:08:22,960
10 were assigned to the cluster centroid two.

251
00:08:24,020 --> 00:08:25,210
Then in this move centroid step,

252
00:08:25,540 --> 00:08:26,580
what I'm going to do is just

253
00:08:27,180 --> 00:08:29,290
compute the average of these four things.

254
00:08:31,340 --> 00:08:33,950
So X1 plus X5 plus X6

255
00:08:34,270 --> 00:08:35,620
plus X10.

256
00:08:35,890 --> 00:08:37,190
And now I'm going

257
00:08:37,380 --> 00:08:38,630
to average them so here I

258
00:08:38,920 --> 00:08:40,020
have four points assigned to

259
00:08:40,100 --> 00:08:41,700
this cluster centroid, just take

260
00:08:42,280 --> 00:08:43,240
one quarter of that.

261
00:08:43,980 --> 00:08:45,890
And now Mu2 is going to

262
00:08:46,100 --> 00:08:47,910
be an n-dimensional vector.

263
00:08:48,420 --> 00:08:49,480
Because each of these

264
00:08:49,700 --> 00:08:51,050
example x1, x5, x6, x10

265
00:08:52,160 --> 00:08:53,170
each of them were an n-dimensional

266
00:08:53,700 --> 00:08:55,150
vector, and I'm going to

267
00:08:55,240 --> 00:08:56,270
add up these things and, you

268
00:08:56,550 --> 00:08:57,870
know, divide by four because I

269
00:08:57,940 --> 00:08:59,320
have four points assigned to this

270
00:08:59,490 --> 00:09:00,730
cluster centroid, I end up

271
00:09:01,280 --> 00:09:02,770
with my move centroid step,

272
00:09:03,870 --> 00:09:05,260
for my cluster centroid mu-2.

273
00:09:05,870 --> 00:09:06,850
This has the effect of moving

274
00:09:07,210 --> 00:09:08,950
mu-2 to the average of

275
00:09:09,130 --> 00:09:10,620
the four points listed here.

276
00:09:12,430 --> 00:09:13,850
One thing that I've asked is, well here we said, let's

277
00:09:14,080 --> 00:09:16,600
let mu-k be the average of the points assigned to the cluster.

278
00:09:17,500 --> 00:09:18,900
But what if there is

279
00:09:18,960 --> 00:09:21,310
a cluster centroid no points

280
00:09:21,690 --> 00:09:23,000
with zero points assigned to it.

281
00:09:23,280 --> 00:09:24,300
In that case the more common

282
00:09:24,650 --> 00:09:25,720
thing to do is to just

283
00:09:26,140 --> 00:09:27,220
eliminate that cluster centroid.

284
00:09:27,830 --> 00:09:28,630
And if you do that, you end

285
00:09:28,840 --> 00:09:30,260
up with K minus one clusters

286
00:09:31,350 --> 00:09:33,840
instead of k clusters.

287
00:09:34,400 --> 00:09:35,620
Sometimes if you really need

288
00:09:35,830 --> 00:09:37,380
k clusters, then the other

289
00:09:37,490 --> 00:09:38,220
thing you can do if you

290
00:09:38,290 --> 00:09:39,530
have a cluster centroid with no

291
00:09:39,740 --> 00:09:41,170
points assigned to it is you can

292
00:09:41,260 --> 00:09:42,590
just randomly reinitialize that cluster

293
00:09:43,450 --> 00:09:44,870
centroid, but it's more

294
00:09:45,170 --> 00:09:46,590
common to just eliminate a

295
00:09:46,670 --> 00:09:48,210
cluster if somewhere during

296
00:09:48,410 --> 00:09:49,690
K means it with no points

297
00:09:50,290 --> 00:09:52,020
assigned to that cluster centroid, and

298
00:09:52,140 --> 00:09:53,340
that can happen, altthough in practice

299
00:09:53,820 --> 00:09:55,590
it happens not that often.

300
00:09:55,810 --> 00:09:57,280
So that's the K means Algorithm.

301
00:09:59,330 --> 00:10:00,220
Before wrapping up this video

302
00:10:00,620 --> 00:10:01,290
I just want to tell you

303
00:10:01,350 --> 00:10:02,710
about one other common application

304
00:10:03,350 --> 00:10:04,680
of K Means and that's

305
00:10:04,920 --> 00:10:06,840
to the problems with non well separated clusters.

306
00:10:08,160 --> 00:10:08,640
Here's what I mean.

307
00:10:09,280 --> 00:10:10,320
So far we've been picturing K Means

308
00:10:10,950 --> 00:10:12,090
and applying it to data

309
00:10:12,330 --> 00:10:13,520
sets like that shown here where

310
00:10:14,150 --> 00:10:15,590
we have three pretty

311
00:10:15,900 --> 00:10:17,380
well separated clusters, and we'd

312
00:10:17,670 --> 00:10:19,930
like an algorithm to find maybe the 3 clusters for us.

313
00:10:20,750 --> 00:10:21,840
But it turns out that

314
00:10:21,980 --> 00:10:23,180
very often K Means is also

315
00:10:23,600 --> 00:10:24,860
applied to data sets that

316
00:10:25,170 --> 00:10:26,240
look like this where there may

317
00:10:26,330 --> 00:10:28,300
not be several very

318
00:10:28,550 --> 00:10:30,250
well separated clusters.

319
00:10:30,830 --> 00:10:32,960
Here is an example application, to t-shirt sizing.

320
00:10:34,070 --> 00:10:34,650
Let's say you are a t-shirt

321
00:10:35,270 --> 00:10:37,360
manufacturer you've done is you've gone

322
00:10:38,030 --> 00:10:39,310
to the population that you

323
00:10:39,380 --> 00:10:40,520
want to sell t-shirts to, and

324
00:10:40,800 --> 00:10:42,190
you've collected a number of

325
00:10:42,580 --> 00:10:43,770
examples of the height and weight

326
00:10:44,270 --> 00:10:45,350
of these people in your

327
00:10:45,680 --> 00:10:46,740
population and so, well I

328
00:10:47,220 --> 00:10:48,280
guess height and weight tend to

329
00:10:48,370 --> 00:10:50,310
be positively highlighted so maybe

330
00:10:50,540 --> 00:10:51,160
you end up with a data

331
00:10:51,430 --> 00:10:52,590
set like this, you know, with

332
00:10:52,830 --> 00:10:53,910
a sample or set of

333
00:10:53,980 --> 00:10:56,000
examples of different peoples heights and weight.

334
00:10:56,530 --> 00:10:57,880
Let's say you want to size your t shirts.

335
00:10:58,570 --> 00:10:59,810
Let's say I want to design

336
00:11:00,330 --> 00:11:01,480
and sell t shirts of three

337
00:11:01,660 --> 00:11:03,970
sizes, small, medium and large.

338
00:11:04,660 --> 00:11:06,420
So how big should I make my small one?

339
00:11:06,550 --> 00:11:07,320
How big should I my medium?

340
00:11:07,690 --> 00:11:09,300
And how big should I make my large t-shirts.

341
00:11:10,370 --> 00:11:11,290
One way to do that would

342
00:11:11,410 --> 00:11:12,050
to be to run my k means

343
00:11:12,270 --> 00:11:13,570
clustering logarithm on this data

344
00:11:13,830 --> 00:11:14,640
set that I have shown on

345
00:11:14,820 --> 00:11:16,570
the right and maybe what

346
00:11:16,770 --> 00:11:18,270
K Means will do is group

347
00:11:18,600 --> 00:11:20,460
all of these points into one

348
00:11:20,660 --> 00:11:22,120
cluster and group all

349
00:11:22,340 --> 00:11:24,150
of these points into a

350
00:11:24,190 --> 00:11:25,530
second cluster and group

351
00:11:25,740 --> 00:11:28,080
all of those points into a third cluster.

352
00:11:28,520 --> 00:11:29,870
So, even though the data, you

353
00:11:30,020 --> 00:11:30,960
know, before hand it didn't

354
00:11:31,060 --> 00:11:31,990
seem like we had 3

355
00:11:32,050 --> 00:11:33,920
well separated clusters, K Means will

356
00:11:34,050 --> 00:11:36,870
kind of separate out the data

357
00:11:37,320 --> 00:11:38,560
into multiple pluses for you.

358
00:11:39,270 --> 00:11:40,220
And what you can do is then

359
00:11:40,420 --> 00:11:42,010
look at this first population of

360
00:11:42,130 --> 00:11:44,340
people and look at

361
00:11:44,500 --> 00:11:45,590
them and, you know, look

362
00:11:45,780 --> 00:11:46,810
at the height and weight, and

363
00:11:46,900 --> 00:11:47,900
try to design a small

364
00:11:48,350 --> 00:11:49,540
t-shirt so that it kind

365
00:11:49,710 --> 00:11:51,160
of fits this first population of

366
00:11:51,310 --> 00:11:52,830
people well and then design

367
00:11:53,150 --> 00:11:55,800
a medium t-shirt and design a large t-shirt.

368
00:11:56,510 --> 00:11:57,860
And this is in fact kind of

369
00:11:57,990 --> 00:11:59,740
an example of market segmentation

370
00:12:01,140 --> 00:12:02,800
where you're using K Means to separate your

371
00:12:02,940 --> 00:12:04,320
market into 3 different segments.

372
00:12:05,220 --> 00:12:06,500
So you can design a product

373
00:12:07,000 --> 00:12:08,260
separately that is a small, medium, and large t-shirts,

374
00:12:09,880 --> 00:12:11,230
that tries to suit

375
00:12:11,650 --> 00:12:12,770
the needs of each of your

376
00:12:12,920 --> 00:12:15,310
3 separate sub-populations well.

377
00:12:16,220 --> 00:12:17,570
So that's the K Means algorithm.

378
00:12:18,240 --> 00:12:19,080
And by now you should

379
00:12:19,300 --> 00:12:20,500
know how to implement the K

380
00:12:20,630 --> 00:12:22,510
Means Algorithm and kind of get it to work for some problems.

381
00:12:23,420 --> 00:12:24,380
But in the next few videos

382
00:12:24,860 --> 00:12:26,430
what I want to do is

383
00:12:26,580 --> 00:12:27,690
really get more deeply into the nuts

384
00:12:28,010 --> 00:12:29,210
and bolts of K means and

385
00:12:29,510 --> 00:12:32,080
to talk a bit about how to actually get this to work really well.