1 00:00:00,000 --> 00:00:07,193 [MUSIC] 2 00:00:07,193 --> 00:00:13,547 Well let's look at an algorithm for doing clustering that uses this 3 00:00:13,547 --> 00:00:20,290 metric of just looking at the distance to the cluster center. 4 00:00:20,290 --> 00:00:23,730 Okay, so here, we see the data that we're gonna wanna cluster. 5 00:00:24,860 --> 00:00:30,769 And this algorithm, which is called the k-means algorithm, 6 00:00:30,769 --> 00:00:36,466 starts by assuming that you are gonna end up with k clusters. 7 00:00:36,466 --> 00:00:39,150 So you specify the number of clusters ahead of time. 8 00:00:39,150 --> 00:00:43,640 So the reason the algorithm is called k-means is we have k clusters, and 9 00:00:43,640 --> 00:00:47,370 we're looking at the means of the clusters, just the cluster centers, 10 00:00:47,370 --> 00:00:50,750 when we're assigning points to the different clusters. 11 00:00:50,750 --> 00:00:54,180 Okay, so let's talk about how we initialize the algorithm. 12 00:00:54,180 --> 00:00:56,670 So the first thing we do is we just one example. 13 00:00:56,670 --> 00:01:00,710 There are many ways to initialize where we put the cluster centers. 14 00:01:00,710 --> 00:01:04,620 We will talk about these things more in-depth later. 15 00:01:04,620 --> 00:01:05,410 But for now, 16 00:01:05,410 --> 00:01:09,460 let's just assume that we randomly assigned three different cluster centers. 17 00:01:09,460 --> 00:01:14,210 If we're going to assume some, three means algorithm here. 18 00:01:14,210 --> 00:01:18,340 And then the first step is we're going to assign every observation 19 00:01:18,340 --> 00:01:19,980 to the closest cluster center. 20 00:01:19,980 --> 00:01:23,510 So this observation here gets assigned to the red cluster. 21 00:01:24,850 --> 00:01:32,730 These observations are all closest to the green cluster, or this green center here. 22 00:01:32,730 --> 00:01:36,700 And these are all closest to the blue center. 23 00:01:38,140 --> 00:01:42,680 And a way to do this is something that's just called a Voronoi tessellation. 24 00:01:42,680 --> 00:01:48,092 So we look at the cluster centers and well, we can do our define, 25 00:01:48,092 --> 00:01:51,716 let me switch again to this magenta color. 26 00:01:51,716 --> 00:01:55,160 So here the cluster centers. 27 00:01:55,160 --> 00:02:02,730 And we can just define these regions here. 28 00:02:02,730 --> 00:02:07,998 And these regions represent areas where any observation we might get, 29 00:02:07,998 --> 00:02:12,920 so any new observation, that's a very bad color on red. 30 00:02:12,920 --> 00:02:14,659 What other colors do I have? 31 00:02:14,659 --> 00:02:16,385 How about white? 32 00:02:16,385 --> 00:02:19,590 Cool, white. 33 00:02:19,590 --> 00:02:24,530 So some new observation I get here, if it falls within this red region, 34 00:02:24,530 --> 00:02:27,350 I know it's closest to this red cluster center. 35 00:02:27,350 --> 00:02:30,285 So that's what these colored regions are representing. 36 00:02:32,328 --> 00:02:34,700 Okay so, that's the first step of the algorithm. 37 00:02:34,700 --> 00:02:41,670 So, what I end up with are observations that are assigned to clusters. 38 00:02:41,670 --> 00:02:44,980 But I just randomly initialize those clusters centers, so I probably don't 39 00:02:44,980 --> 00:02:47,870 believe that that really represents the structure underlying the data. 40 00:02:47,870 --> 00:02:53,930 So what I wanna do is I wanna iterate this process, where I then wanna update what my 41 00:02:53,930 --> 00:02:58,530 definition is of the cluster center based on the observations that I've assigned. 42 00:02:58,530 --> 00:03:02,950 So if you remember this red cluster here, 43 00:03:02,950 --> 00:03:05,600 it just had one observation assigned to it. 44 00:03:05,600 --> 00:03:09,880 So when I go to revise the cluster center for that cluster, 45 00:03:09,880 --> 00:03:14,480 it just moves to the previous observation. 46 00:03:14,480 --> 00:03:19,662 But for this green cluster, if I look at the previous cluster center here, 47 00:03:19,662 --> 00:03:23,033 well I'm gonna move it to the center of mass of all 48 00:03:23,033 --> 00:03:27,663 these observations that have been assigned to the green cluster. 49 00:03:27,663 --> 00:03:29,793 And the center of mass of all of them is here, 50 00:03:29,793 --> 00:03:32,480 so this becomes the new cluster center. 51 00:03:32,480 --> 00:03:35,175 And likewise, I do this for all of the blue observations. 52 00:03:35,175 --> 00:03:38,790 This is the new cluster center for this blue observation. 53 00:03:40,140 --> 00:03:43,696 Okay, so now I have a new set of cluster centers. 54 00:03:43,696 --> 00:03:48,204 And what I can do is redraw this Voronoi tessellation, and 55 00:03:48,204 --> 00:03:53,310 reassign my observations to the nearest cluster center. 56 00:03:53,310 --> 00:03:56,223 And then I iterate this process until convergence. 57 00:03:56,223 --> 00:03:56,723 [MUSIC]