1 00:00:00,090 --> 00:00:01,540 Most of the supervised learning algorithms 2 00:00:01,690 --> 00:00:02,890 we've seen, things like linear 3 00:00:03,130 --> 00:00:04,730 regression, logistic regression and 4 00:00:04,930 --> 00:00:05,850 so on. All of those 5 00:00:06,300 --> 00:00:08,089 algorithms have an optimization objective 6 00:00:08,670 --> 00:00:10,920 or some cost function that the algorithm was trying to minimize. 7 00:00:11,920 --> 00:00:13,180 It turns out that K-means also 8 00:00:13,770 --> 00:00:15,730 has an optimization objective or 9 00:00:15,870 --> 00:00:18,720 a cost function that is trying to minimize. 10 00:00:19,630 --> 00:00:20,180 And in this video, I'd like to tell 11 00:00:20,230 --> 00:00:23,620 you what that optimization objective is. 12 00:00:23,730 --> 00:00:24,420 And the reason I want to do so 13 00:00:24,750 --> 00:00:26,960 is because this will be useful to us for two purposes. 14 00:00:28,020 --> 00:00:29,330 First, knowing what is the 15 00:00:29,480 --> 00:00:30,890 optimization objective of K-means 16 00:00:31,150 --> 00:00:32,390 will help us to 17 00:00:32,690 --> 00:00:33,970 debug the learning algorithm and 18 00:00:34,070 --> 00:00:35,080 just make sure that K-means is 19 00:00:35,300 --> 00:00:37,100 running correctly, and second, 20 00:00:37,610 --> 00:00:39,290 and perhaps even more importantly, in 21 00:00:39,530 --> 00:00:41,290 a later video we'll talk 22 00:00:41,490 --> 00:00:42,580 about how we can use this to 23 00:00:42,730 --> 00:00:44,000 help K-means find better clusters 24 00:00:44,070 --> 00:00:46,290 and avoid local optima, but we'll do that in a later video that follows this one. 25 00:00:46,410 --> 00:00:47,330 Just as a quick reminder, while K-means is 26 00:00:49,680 --> 00:00:52,870 running we're going to be 27 00:00:54,450 --> 00:00:55,820 keeping track of two sets of variables. 28 00:00:56,430 --> 00:00:58,390 First is the CI's and 29 00:00:58,700 --> 00:00:59,830 that keeps track of the 30 00:01:00,190 --> 00:01:01,600 index or the number of the cluster 31 00:01:02,730 --> 00:01:05,040 to which an example x(i) is currently assigned. 32 00:01:05,230 --> 00:01:05,960 And then, the other set of variables 33 00:01:06,540 --> 00:01:07,580 we use as Mu subscript 34 00:01:08,120 --> 00:01:09,410 K, which is the location 35 00:01:10,140 --> 00:01:12,110 of cluster centroid K. And, 36 00:01:12,380 --> 00:01:13,750 again, for K-means 37 00:01:14,030 --> 00:01:17,230 we use capital K to denote the total number of clusters. 38 00:01:17,890 --> 00:01:19,310 And here lower case K, 39 00:01:20,010 --> 00:01:20,910 you know, is going to be an 40 00:01:21,040 --> 00:01:22,650 index into the cluster 41 00:01:22,970 --> 00:01:23,930 centroids, and so lower 42 00:01:24,030 --> 00:01:24,940 case k is going to be 43 00:01:25,140 --> 00:01:26,390 a number between 1 and 44 00:01:26,600 --> 00:01:29,630 capital K. Now, here's 45 00:01:29,840 --> 00:01:31,040 one more bit of notation which 46 00:01:31,270 --> 00:01:32,280 is going to use Mu 47 00:01:32,630 --> 00:01:34,560 subscript c(i) to denote 48 00:01:34,970 --> 00:01:36,660 the cluster centroid of the 49 00:01:36,780 --> 00:01:38,400 cluster to which example x(i) 50 00:01:38,880 --> 00:01:40,500 has been assigned and 51 00:01:40,710 --> 00:01:42,030 to explain that notation 52 00:01:42,450 --> 00:01:43,450 a little bit more, let's 53 00:01:43,660 --> 00:01:45,600 say that x(i) has been 54 00:01:45,740 --> 00:01:47,760 assigned to cluster number five. 55 00:01:48,880 --> 00:01:49,830 What that means is that c(i), 56 00:01:50,850 --> 00:01:52,290 that is the index of x(i), 57 00:01:53,130 --> 00:01:54,300 that that is equal to 5. 58 00:01:54,420 --> 00:01:57,640 Right? Because you know, having c(i) equals 5, 59 00:01:57,800 --> 00:01:59,270 that's what it means for the 60 00:02:00,500 --> 00:02:01,720 example x(i) to be 61 00:02:01,910 --> 00:02:03,440 assigned to cluster number 5. 62 00:02:03,510 --> 00:02:05,700 And so Mu subscript 63 00:02:06,290 --> 00:02:07,960 c(i) is going to 64 00:02:08,100 --> 00:02:09,630 be equal to Mu subscript 65 00:02:10,080 --> 00:02:12,260 5 because c(i) is equal 66 00:02:13,700 --> 00:02:14,100 to 5. 67 00:02:15,100 --> 00:02:16,570 This Mu substitute c(i) is the 68 00:02:16,660 --> 00:02:18,420 cluster centroid of cluster number 69 00:02:18,730 --> 00:02:19,670 5, which is the cluster 70 00:02:20,120 --> 00:02:22,480 to which my example x(i) has been assigned. 71 00:02:23,470 --> 00:02:24,730 With this notation, we're now 72 00:02:24,960 --> 00:02:26,040 ready to write out what 73 00:02:26,200 --> 00:02:28,150 is the optimization objective of 74 00:02:28,290 --> 00:02:30,360 the K Mu clustering algorithm. 75 00:02:30,760 --> 00:02:30,800 And here it is. 76 00:02:31,330 --> 00:02:32,940 The cost function that K-means 77 00:02:33,040 --> 00:02:34,380 is minimizing is the 78 00:02:34,570 --> 00:02:35,770 function J of all of 79 00:02:35,880 --> 00:02:37,470 these parameters c1 through 80 00:02:37,890 --> 00:02:39,610 cM, Mu1 through MuK, that 81 00:02:39,790 --> 00:02:41,570 K-means is varying as the algorithm runs. 82 00:02:42,100 --> 00:02:43,930 And the optimization objective is shown 83 00:02:44,160 --> 00:02:45,520 on the right, is the average of 84 00:02:45,610 --> 00:02:46,430 one over M of the sum 85 00:02:46,620 --> 00:02:48,730 of i equals one through M of this term here 86 00:02:50,400 --> 00:02:52,670 that I've just drawn the red box around. 87 00:02:52,870 --> 00:02:54,680 The squared distance between 88 00:02:55,160 --> 00:02:57,540 each example x(i) and the 89 00:02:57,690 --> 00:02:58,740 location of the cluster 90 00:02:59,130 --> 00:03:00,210 centroid to which x(i) 91 00:03:01,320 --> 00:03:01,920 has been assigned. 92 00:03:03,240 --> 00:03:06,070 So let me just draw this in, let me explain this. 93 00:03:06,240 --> 00:03:07,800 Here is the location of training 94 00:03:08,190 --> 00:03:09,780 example x(i), and here's the location 95 00:03:10,410 --> 00:03:11,760 of the cluster centroid to which 96 00:03:11,970 --> 00:03:13,660 example x(i) has been assigned. 97 00:03:14,560 --> 00:03:17,080 So to explain this in pictures, if here is X1, X2. 98 00:03:17,420 --> 00:03:19,540 And if a point 99 00:03:19,760 --> 00:03:21,210 here, is my example 100 00:03:22,080 --> 00:03:23,060 x(i), so if that 101 00:03:23,110 --> 00:03:24,840 is equal to my example x(i), 102 00:03:25,860 --> 00:03:27,000 and if x(i) has been assigned 103 00:03:27,240 --> 00:03:28,270 to some cluster centroid, and 104 00:03:28,340 --> 00:03:30,240 I'll denote my cluster centroid with a cross. 105 00:03:30,630 --> 00:03:32,130 So if that's the location of, 106 00:03:32,300 --> 00:03:33,830 you know, Mu 5, let's 107 00:03:34,370 --> 00:03:35,640 say, if x(i) has been 108 00:03:35,850 --> 00:03:37,960 assigned to cluster centroid 5 in my example up there. 109 00:03:38,810 --> 00:03:40,660 Then, the squared distance, that's 110 00:03:40,940 --> 00:03:41,840 the squared of the distance 111 00:03:43,810 --> 00:03:46,010 between the point x(i) and this 112 00:03:46,220 --> 00:03:48,400 cluster centroid, to which x(i) has been assigned. 113 00:03:49,570 --> 00:03:50,720 And what K-means can be shown 114 00:03:51,070 --> 00:03:52,540 to be doing is that, it 115 00:03:52,680 --> 00:03:54,480 is trying to find parameters c(i) 116 00:03:55,270 --> 00:03:57,410 and Mu(i), try to 117 00:03:57,570 --> 00:03:58,840 find cMU to try to 118 00:03:58,960 --> 00:04:00,450 minimize this cost function J. 119 00:04:01,440 --> 00:04:03,180 This cost function is sometimes 120 00:04:03,680 --> 00:04:06,770 also called the distortion cost 121 00:04:07,060 --> 00:04:10,030 function or the distortion of 122 00:04:10,240 --> 00:04:12,130 the K-means algorithm. 123 00:04:12,790 --> 00:04:13,360 And, just to provide a little bit more 124 00:04:13,630 --> 00:04:15,750 detail, here's the K-means algorithm, 125 00:04:15,820 --> 00:04:16,450 Here's exactly the algorithm as we have it, in the real 126 00:04:16,610 --> 00:04:17,960 form the earlier slide. 127 00:04:18,950 --> 00:04:20,200 And what this first step 128 00:04:21,030 --> 00:04:23,120 of this algorithm is, this was 129 00:04:23,830 --> 00:04:25,910 the cluster assignment step 130 00:04:27,920 --> 00:04:29,850 where we assign each 131 00:04:30,030 --> 00:04:32,910 point to the cluster centroid, and 132 00:04:33,010 --> 00:04:34,830 it's possible to show mathematically that 133 00:04:35,050 --> 00:04:36,210 what the cluster assignment step 134 00:04:36,450 --> 00:04:38,560 is doing is exactly minimizing 135 00:04:40,770 --> 00:04:42,950 J with respect 136 00:04:43,420 --> 00:04:45,900 to the variables, C1, C2 137 00:04:46,380 --> 00:04:48,050 and so on, up 138 00:04:48,170 --> 00:04:52,030 to C(m), while holding the 139 00:04:52,480 --> 00:04:54,240 closest centroids, MU1 up to 140 00:04:54,720 --> 00:04:57,000 MUK fixed. 141 00:04:58,580 --> 00:04:59,640 So, what the first assignment step 142 00:04:59,900 --> 00:05:00,990 does is you know, it doesn't change 143 00:05:01,240 --> 00:05:02,850 the cluster centroids, but what it's 144 00:05:02,960 --> 00:05:05,730 doing is, exactly picking the values of C1, C2, up to CM 145 00:05:07,790 --> 00:05:10,240 that minimizes the cost 146 00:05:10,500 --> 00:05:11,790 function or the distortion function, 147 00:05:12,510 --> 00:05:14,440 J. And it's possible to prove 148 00:05:14,670 --> 00:05:16,550 that mathematically but, I won't do so here. 149 00:05:17,170 --> 00:05:18,210 That has a pretty intuitive meaning 150 00:05:18,610 --> 00:05:19,630 of just, yo know, well let's assign 151 00:05:20,090 --> 00:05:21,040 these points to the cluster centroid 152 00:05:21,530 --> 00:05:22,820 that is closest to it, because 153 00:05:23,120 --> 00:05:24,160 that's what minimizes the square 154 00:05:24,660 --> 00:05:26,860 of distance between the points and the corresponding cluster centroid. 155 00:05:27,840 --> 00:05:29,090 And then the other part of 156 00:05:29,790 --> 00:05:32,880 the second step of K-means, this second step over here. 157 00:05:33,960 --> 00:05:35,480 This second step was the move 158 00:05:35,690 --> 00:05:38,770 centroid step and, 159 00:05:39,000 --> 00:05:40,020 once again, I won't prove it, 160 00:05:40,510 --> 00:05:41,250 but it can be shown 161 00:05:41,520 --> 00:05:42,590 mathematically, that what the 162 00:05:43,140 --> 00:05:44,910 root centroid step does, is 163 00:05:45,150 --> 00:05:46,740 it chooses the values 164 00:05:47,260 --> 00:05:49,370 of mu that minimizes J. 165 00:05:50,150 --> 00:05:51,270 So it minimizes the cost function 166 00:05:51,650 --> 00:05:53,000 J with respect to, 167 00:05:53,380 --> 00:05:54,710 where wrt is my 168 00:05:54,920 --> 00:05:56,930 abbreviation for with respect to. 169 00:05:57,030 --> 00:05:58,380 But it minimizes J with respect 170 00:05:58,790 --> 00:06:01,930 to the locations of the cluster centroids, Mu1 through MuK. 171 00:06:02,040 --> 00:06:05,690 So, what K-means really 172 00:06:05,790 --> 00:06:06,910 is doing is it's taking the 173 00:06:07,010 --> 00:06:08,380 two sets of variables and 174 00:06:09,070 --> 00:06:11,210 partitioning them into two halves right here. 175 00:06:11,550 --> 00:06:14,490 First the C set of variables and then you have the Mu sets of variables. 176 00:06:15,450 --> 00:06:15,990 And what it does is it first 177 00:06:16,560 --> 00:06:17,750 minimizes J with respect 178 00:06:18,050 --> 00:06:19,350 to the variable C, and then minimizes 179 00:06:19,700 --> 00:06:20,610 J with respect the variables 180 00:06:21,120 --> 00:06:22,590 Mu, and then it keeps on iterating. 181 00:06:25,180 --> 00:06:26,680 And so that's all that K-means does. 182 00:06:27,700 --> 00:06:28,570 And, now that we understand 183 00:06:29,150 --> 00:06:30,870 K-means, let's try to 184 00:06:31,030 --> 00:06:32,190 minimize this cost function J. We 185 00:06:32,430 --> 00:06:33,640 can also use this to 186 00:06:33,800 --> 00:06:34,890 try to debug our learning 187 00:06:35,090 --> 00:06:36,350 algorithm and just kind 188 00:06:36,520 --> 00:06:37,980 of make sure that our implementation 189 00:06:38,900 --> 00:06:39,950 of K-means is running correctly. 190 00:06:41,220 --> 00:06:42,560 So, we now understand the 191 00:06:43,070 --> 00:06:44,260 K-means algorithm as trying to 192 00:06:44,610 --> 00:06:45,960 optimize this cost function J, 193 00:06:46,640 --> 00:06:48,790 which is also called the dispulsion function. 194 00:06:50,650 --> 00:06:51,600 We can use that to debug K-means 195 00:06:52,090 --> 00:06:53,060 and help me show that K-means 196 00:06:53,130 --> 00:06:54,050 is converging, and that it's 197 00:06:54,510 --> 00:06:56,160 running properly, and in the 198 00:06:56,240 --> 00:06:57,460 next video, we'll also see 199 00:06:57,690 --> 00:06:59,040 how we can us this to 200 00:06:59,120 --> 00:07:00,650 help K-means find better clusters 201 00:07:01,300 --> 00:07:03,240 and help K-means to avoid local optima.