1 00:00:00,000 --> 00:00:04,546 [MUSIC] 2 00:00:04,546 --> 00:00:08,670 Well let's start talking about clustering algorithms and to begin with, 3 00:00:08,670 --> 00:00:13,094 let's talk about the bread and butter clustering algorithm called k-means. 4 00:00:13,094 --> 00:00:18,490 So the k-means algorithm assumes this simple definition of score that we 5 00:00:18,490 --> 00:00:23,230 described before where you just measured the distance between the observation, and 6 00:00:23,230 --> 00:00:29,050 the cluster center as a way of associating data points to specific clusters. 7 00:00:29,050 --> 00:00:31,620 So, in particular here, 8 00:00:31,620 --> 00:00:35,670 these little gray boxes are going to be our data points to cluster. 9 00:00:35,670 --> 00:00:43,210 So this might be datapoint xi. 10 00:00:43,210 --> 00:00:49,260 And the first step of the algorithm is to initialize our cluster centers. 11 00:00:49,260 --> 00:00:54,350 So these circles are going to represent our cluster centers and 12 00:00:54,350 --> 00:00:59,938 we're going to notate the cluster centers with the Greek letter mu, 13 00:00:59,938 --> 00:01:02,647 so we have mu 1, mu 2 and mu 3. 14 00:01:02,647 --> 00:01:08,770 Here we're looking at an example with k = 3 clusters. 15 00:01:09,940 --> 00:01:15,564 So we have three clusters centers. 16 00:01:15,564 --> 00:01:18,488 And then having to find those cluster centers, 17 00:01:18,488 --> 00:01:23,239 the next step of the algorithm is to assign observations to the clusters based 18 00:01:23,239 --> 00:01:26,810 on the minimum distance to the cluster center. 19 00:01:26,810 --> 00:01:30,850 So in equations, that's written right here where 20 00:01:32,040 --> 00:01:37,454 this is our jth cluster 21 00:01:37,454 --> 00:01:44,900 center and xi is our observation. 22 00:01:44,900 --> 00:01:50,090 This is the cluster label for observation i. 23 00:01:50,090 --> 00:01:54,170 So remember, this is going to be our inferred cluster label for 24 00:01:54,170 --> 00:01:55,550 the ith observation, 25 00:01:55,550 --> 00:02:01,720 whereas in supervised learning we were assuming we had a given label yi. 26 00:02:01,720 --> 00:02:07,420 But what we're doing here with this function is this thing called arg min. 27 00:02:07,420 --> 00:02:10,200 So let's describe what arg min does. 28 00:02:10,200 --> 00:02:13,843 So arg min is going to return the index j, 29 00:02:17,047 --> 00:02:22,199 Of the cluster whose center is closest 30 00:02:22,199 --> 00:02:30,361 to observation Xi And this is contrast to min, 31 00:02:32,936 --> 00:02:39,770 Whereas if we think about minimizing a function, 32 00:02:39,770 --> 00:02:44,814 returns the value of the objective, 33 00:02:44,814 --> 00:02:49,696 which would be minimum value of this 34 00:02:49,696 --> 00:02:56,180 distance calculation we're doing here. 35 00:02:56,180 --> 00:03:01,052 So just to be clear by saying, arg min we're returning which index as 36 00:03:01,052 --> 00:03:06,008 we're searching over all cluster centers muj, mu1, mu2, mu3, 37 00:03:06,008 --> 00:03:08,361 with the ith observation fixed. 38 00:03:10,946 --> 00:03:17,080 And varying, This cluster center. 39 00:03:17,080 --> 00:03:21,660 We're saying which was the index where this distance was the smallest. 40 00:03:21,660 --> 00:03:27,080 Where as if we had just written min, we would return the minimum distance itself, 41 00:03:27,080 --> 00:03:30,400 rather than the index of the cluster that achieved that minimum distance. 42 00:03:31,410 --> 00:03:35,380 Okay, so it's important to explain this equation because we're going to see 43 00:03:35,380 --> 00:03:40,150 this equation throughout the rest of the module. 44 00:03:40,150 --> 00:03:46,932 Okay, so the point here is that we're computing distances to cluster centers and 45 00:03:48,876 --> 00:03:54,414 All of these colored points were ones where the distance to. 46 00:03:54,414 --> 00:03:57,609 Okay, this blue is not visible at all. 47 00:03:57,609 --> 00:03:59,838 Let me just for this one switch colors. 48 00:04:05,148 --> 00:04:08,431 The distances to these cluster centers were smaller than had 49 00:04:08,431 --> 00:04:11,776 we computed distances to any of the other cluster centers, so 50 00:04:11,776 --> 00:04:14,370 that's how they got assigned to these points. 51 00:04:16,430 --> 00:04:20,962 And this here, this little, 52 00:04:23,414 --> 00:04:27,911 Partitioning of this 2D space into these different 53 00:04:27,911 --> 00:04:31,600 colors is called a Voronoi tessellation. 54 00:04:34,730 --> 00:04:40,060 And what it represents is that, if any point falls into, for 55 00:04:40,060 --> 00:04:45,910 example, this blue region here, that means that the distance to this cluster center 56 00:04:45,910 --> 00:04:50,090 is smaller than it is to the distance to any of the other cluster centers. 57 00:04:50,090 --> 00:04:54,564 So this is the region in which any observation here will get 58 00:04:54,564 --> 00:04:57,041 assigned to this blue cluster. 59 00:04:57,041 --> 00:05:01,289 And any observation here will get assigned to the green cluster and finally, 60 00:05:01,289 --> 00:05:04,570 any observation here will get assigned to the red cluster. 61 00:05:07,620 --> 00:05:13,860 But I want to emphasize this is for visualization only. 62 00:05:16,177 --> 00:05:19,178 You don't need to compute this to do k-means. 63 00:05:23,842 --> 00:05:27,440 Okay, so far we've initialized our cluster centers. 64 00:05:27,440 --> 00:05:30,130 We've assigned each one of our observations based on 65 00:05:31,230 --> 00:05:33,350 which cluster center it was closest to. 66 00:05:33,350 --> 00:05:38,890 And then the next step in the k-means algorithm is we're going to update 67 00:05:38,890 --> 00:05:44,940 the positions of the cluster centers based on the points that were assigned to them. 68 00:05:44,940 --> 00:05:50,370 In particular we're going to compute the mean of the data points in that cluster or 69 00:05:50,370 --> 00:05:54,720 equivalently, we can think of this as the center mass of these data points. 70 00:05:54,720 --> 00:05:57,410 And that's going to be our new cluster center. 71 00:05:57,410 --> 00:06:02,370 So in equations, we're going to say that our new cluster center for 72 00:06:02,370 --> 00:06:07,752 the jth cluster is equal to 1/nj, where nj is the number of 73 00:06:07,752 --> 00:06:12,670 observations in cluster j, 74 00:06:12,670 --> 00:06:18,740 and then we're going to sum over all observations. 75 00:06:18,740 --> 00:06:26,674 So this is all observations i, 76 00:06:26,674 --> 00:06:32,167 such that zi = J which 77 00:06:32,167 --> 00:06:38,272 means that observation 78 00:06:38,272 --> 00:06:42,859 i is in cluster j. 79 00:06:42,859 --> 00:06:47,764 So just to make sure this is clear, if we look at this sum, 80 00:06:47,764 --> 00:06:52,740 how many terms are appearing in this sum, n j terms. 81 00:06:52,740 --> 00:06:54,230 That's the size of the set. 82 00:06:54,230 --> 00:07:02,200 The size of the number of observations where the cluster indicator is j. 83 00:07:02,200 --> 00:07:04,670 Okay, so this is just an equation to say, 84 00:07:04,670 --> 00:07:08,750 we're summing over all our observations, in cluster j. 85 00:07:08,750 --> 00:07:12,620 And then, we're dividing by the total number of observations in the cluster and 86 00:07:12,620 --> 00:07:16,560 that gives us our average or the center of mass in that cluster. 87 00:07:16,560 --> 00:07:21,432 Finally, the last step or it won't actually be the last step but, 88 00:07:21,432 --> 00:07:26,478 the last stage of k-means is to repeat the last two steps that we did. 89 00:07:26,478 --> 00:07:32,133 So we're going to keep iterating between assigning observations to cluster centers, 90 00:07:32,133 --> 00:07:37,391 then updating the cluster centers and going on and on and on until convergence, 91 00:07:37,391 --> 00:07:42,274 and we're going to talk about the notion of convergence a little bit later. 92 00:07:42,274 --> 00:07:42,774 [MUSIC]