1 00:00:00,530 --> 00:00:01,550 In the last video, we started 2 00:00:01,950 --> 00:00:03,230 to talk about the kernels idea 3 00:00:03,710 --> 00:00:04,590 and how it can be used to 4 00:00:04,860 --> 00:00:07,900 define new features for the support vector machine. 5 00:00:08,100 --> 00:00:08,910 In this video, I'd like to throw 6 00:00:09,230 --> 00:00:10,670 in some of the missing details and, 7 00:00:11,020 --> 00:00:12,070 also, say a few words about 8 00:00:12,270 --> 00:00:14,100 how to use these ideas in practice. 9 00:00:14,650 --> 00:00:15,850 Such as, how they pertain 10 00:00:16,340 --> 00:00:20,120 to, for example, the bias variance trade-off in support vector machines. 11 00:00:22,690 --> 00:00:23,680 In the last video, I talked 12 00:00:24,000 --> 00:00:25,970 about the process of picking a few landmarks. 13 00:00:26,660 --> 00:00:28,890 You know, l1, l2, l3 and that 14 00:00:29,150 --> 00:00:30,220 allowed us to define the 15 00:00:30,300 --> 00:00:31,900 similarity function also called 16 00:00:32,200 --> 00:00:33,500 the kernel or in this 17 00:00:33,690 --> 00:00:34,830 example if you have 18 00:00:35,070 --> 00:00:37,410 this similarity function this is a Gaussian kernel. 19 00:00:38,610 --> 00:00:40,370 And that allowed us to build 20 00:00:40,660 --> 00:00:42,070 this form of a hypothesis function. 21 00:00:43,180 --> 00:00:44,880 But where do we get these landmarks from? 22 00:00:45,150 --> 00:00:45,670 Where do we get l1, l2, l3 from? 23 00:00:45,690 --> 00:00:49,080 And it seems, also, that for complex learning 24 00:00:49,610 --> 00:00:50,830 problems, maybe we want a 25 00:00:50,920 --> 00:00:53,060 lot more landmarks than just three of them that we might choose by hand. 26 00:00:55,160 --> 00:00:56,450 So in practice this is 27 00:00:56,580 --> 00:00:57,730 how the landmarks are chosen 28 00:00:57,830 --> 00:00:59,910 which is that given the 29 00:01:00,150 --> 00:01:01,110 machine learning problem. We have some 30 00:01:01,370 --> 00:01:02,230 data set of some some positive 31 00:01:02,710 --> 00:01:04,460 and negative examples. So, this is the idea here 32 00:01:05,310 --> 00:01:06,270 which is that we're gonna take the 33 00:01:06,630 --> 00:01:08,200 examples and for every 34 00:01:08,470 --> 00:01:09,780 training example that we have, 35 00:01:10,490 --> 00:01:11,430 we are just going to call 36 00:01:11,980 --> 00:01:13,270 it. We're just going 37 00:01:13,440 --> 00:01:14,850 to put landmarks as exactly 38 00:01:15,490 --> 00:01:17,600 the same locations as the training examples. 39 00:01:18,930 --> 00:01:20,360 So if I have one training 40 00:01:20,680 --> 00:01:21,880 example if that is x1, 41 00:01:22,120 --> 00:01:23,460 well then I'm going 42 00:01:23,670 --> 00:01:24,550 to choose this is my first landmark 43 00:01:25,100 --> 00:01:26,470 to be at xactly the same location 44 00:01:27,250 --> 00:01:28,170 as my first training example. 45 00:01:29,260 --> 00:01:30,180 And if I have a different training 46 00:01:30,470 --> 00:01:32,340 example x2. Well we're 47 00:01:32,500 --> 00:01:33,980 going to set the second landmark 48 00:01:35,060 --> 00:01:37,300 to be the location of my second training example. 49 00:01:38,480 --> 00:01:39,320 On the figure on the right, I 50 00:01:39,480 --> 00:01:40,480 used red and blue dots 51 00:01:40,820 --> 00:01:41,930 just as illustration, the color 52 00:01:42,420 --> 00:01:44,320 of this figure, the color of 53 00:01:44,370 --> 00:01:46,030 the dots on the figure on the right is not significant. 54 00:01:47,120 --> 00:01:47,930 But what I'm going to end up 55 00:01:48,110 --> 00:01:49,660 with using this method is I'm 56 00:01:49,790 --> 00:01:51,450 going to end up with m 57 00:01:52,160 --> 00:01:53,690 landmarks of l1, l2 58 00:01:54,950 --> 00:01:56,320 down to l(m) if I 59 00:01:56,380 --> 00:01:58,180 have m training examples with 60 00:01:58,420 --> 00:02:00,500 one landmark per location of 61 00:02:00,810 --> 00:02:02,680 my per location of each 62 00:02:02,860 --> 00:02:04,810 of my training examples. And this is 63 00:02:04,950 --> 00:02:05,920 nice because it is saying that 64 00:02:06,120 --> 00:02:07,630 my features are basically going 65 00:02:07,700 --> 00:02:09,300 to measure how close an 66 00:02:09,380 --> 00:02:10,800 example is to one 67 00:02:10,970 --> 00:02:13,150 of the things I saw in my training set. 68 00:02:13,440 --> 00:02:14,180 So, just to write this outline a 69 00:02:14,350 --> 00:02:16,270 little more concretely, given m 70 00:02:16,470 --> 00:02:17,870 training examples, I'm going 71 00:02:18,050 --> 00:02:19,100 to choose the the location 72 00:02:19,310 --> 00:02:20,430 of my landmarks to be exactly 73 00:02:21,190 --> 00:02:23,920 near the locations of my m training examples. 74 00:02:25,430 --> 00:02:26,600 When you are given example x, 75 00:02:26,920 --> 00:02:28,090 and in this example x can be 76 00:02:28,230 --> 00:02:29,260 something in the training set, 77 00:02:29,570 --> 00:02:30,800 it can be something in the cross validation 78 00:02:31,490 --> 00:02:32,470 set, or it can be something in the test set. 79 00:02:33,320 --> 00:02:34,090 Given an example x we are 80 00:02:34,320 --> 00:02:35,470 going to compute, you know, 81 00:02:35,750 --> 00:02:37,220 these features as so f1, 82 00:02:37,560 --> 00:02:39,180 f2, and so on. 83 00:02:39,580 --> 00:02:41,120 Where l1 is actually equal 84 00:02:41,490 --> 00:02:42,850 to x1 and so on. 85 00:02:43,570 --> 00:02:46,080 And these then give me a feature vector. 86 00:02:46,840 --> 00:02:49,540 So let me write f as the feature vector. 87 00:02:50,270 --> 00:02:52,090 I'm going to take these f1, f2 and 88 00:02:52,290 --> 00:02:53,370 so on, and just group 89 00:02:53,580 --> 00:02:55,330 them into feature vector. 90 00:02:56,330 --> 00:02:58,000 Take those down to fm. 91 00:02:59,320 --> 00:03:01,080 And, you know, just by convention. 92 00:03:01,610 --> 00:03:02,870 If we want, we can add an 93 00:03:02,990 --> 00:03:06,250 extra feature f0, which is always equal to 1. 94 00:03:06,450 --> 00:03:08,530 So this plays a role similar to what we had previously. 95 00:03:09,480 --> 00:03:11,200 For x0, which was our interceptor. 96 00:03:13,200 --> 00:03:14,450 So, for example, if we 97 00:03:14,580 --> 00:03:16,550 have a training example x(i), y(i), 98 00:03:18,270 --> 00:03:19,300 the features we would compute for 99 00:03:20,080 --> 00:03:21,330 this training example will be 100 00:03:21,440 --> 00:03:23,440 as follows: given x(i), we 101 00:03:23,640 --> 00:03:26,560 will then map it to, you know, f1(i). 102 00:03:27,980 --> 00:03:29,670 Which is the similarity. I'm going to 103 00:03:29,960 --> 00:03:31,980 abbreviate as SIM instead of writing out the whole 104 00:03:32,090 --> 00:03:33,380 word 105 00:03:35,540 --> 00:03:35,540 similarity, right? 106 00:03:37,050 --> 00:03:39,180 And f2(i) equals the similarity 107 00:03:40,090 --> 00:03:42,780 between x(i) and l2, 108 00:03:43,140 --> 00:03:45,050 and so on, 109 00:03:45,230 --> 00:03:48,370 down to fm(i) equals 110 00:03:49,600 --> 00:03:54,480 the similarity between x(i) and l(m). 111 00:03:55,700 --> 00:03:58,700 And somewhere in the middle. 112 00:03:59,160 --> 00:04:01,320 Somewhere in this list, you know, at 113 00:04:01,480 --> 00:04:03,930 the i-th component, I will 114 00:04:04,230 --> 00:04:05,740 actually have one feature 115 00:04:06,150 --> 00:04:07,590 component which is f subscript 116 00:04:08,170 --> 00:04:09,930 i(i), which is 117 00:04:10,050 --> 00:04:11,180 going to be the similarity 118 00:04:13,080 --> 00:04:14,550 between x and l(i). 119 00:04:15,680 --> 00:04:16,990 Where l(i) is equal to 120 00:04:17,190 --> 00:04:18,560 x(i), and so you know 121 00:04:19,140 --> 00:04:20,320 fi(i) is just going to 122 00:04:20,410 --> 00:04:22,250 be the similarity between x and itself. 123 00:04:23,960 --> 00:04:25,380 And if you're using the Gaussian kernel this is 124 00:04:25,620 --> 00:04:26,720 actually e to the minus 0 125 00:04:27,170 --> 00:04:29,440 over 2 sigma squared and so, this will be equal to 1 and that's okay. 126 00:04:29,790 --> 00:04:31,060 So one of my features for this 127 00:04:31,370 --> 00:04:32,940 training example is going to be equal to 1. 128 00:04:34,290 --> 00:04:35,570 And then similar to what I have above. 129 00:04:35,990 --> 00:04:36,940 I can take all of these 130 00:04:37,870 --> 00:04:39,910 m features and group them into a feature vector. 131 00:04:40,340 --> 00:04:41,730 So instead of representing my example, 132 00:04:42,710 --> 00:04:44,200 using, you know, x(i) which is this what 133 00:04:44,430 --> 00:04:46,970 R(n) plus R(n) one dimensional vector. 134 00:04:48,290 --> 00:04:49,590 Depending on whether you can 135 00:04:49,990 --> 00:04:51,120 set terms, is either R(n) 136 00:04:52,070 --> 00:04:52,750 or R(n) plus 1. 137 00:04:53,440 --> 00:04:55,140 We can now instead represent my 138 00:04:55,300 --> 00:04:56,700 training example using this feature 139 00:04:56,980 --> 00:04:58,810 vector f. I am 140 00:04:58,920 --> 00:05:01,240 going to write this f superscript i. Which 141 00:05:01,400 --> 00:05:03,060 is going to be taking all 142 00:05:03,300 --> 00:05:06,010 of these things and stacking them into a vector. 143 00:05:06,540 --> 00:05:09,180 So, f1(i) down 144 00:05:09,430 --> 00:05:12,740 to fm(i) and if you want and 145 00:05:13,030 --> 00:05:15,160 well, usually we'll also add this 146 00:05:15,420 --> 00:05:16,990 f0(i), where 147 00:05:17,130 --> 00:05:19,370 f0(i) is equal to 1. 148 00:05:19,370 --> 00:05:20,970 And so this vector 149 00:05:21,300 --> 00:05:23,260 here gives me my 150 00:05:23,430 --> 00:05:25,180 new feature vector with which 151 00:05:25,480 --> 00:05:28,310 to represent my training example. 152 00:05:29,040 --> 00:05:30,980 So given these kernels 153 00:05:31,530 --> 00:05:33,160 and similarity functions, here's how 154 00:05:33,400 --> 00:05:35,030 we use a simple vector machine. 155 00:05:35,720 --> 00:05:37,100 If you already have a learning 156 00:05:37,300 --> 00:05:39,040 set of parameters theta, then if you given a value of x and you want to make a prediction. 157 00:05:41,680 --> 00:05:42,850 What we do is we compute the 158 00:05:43,060 --> 00:05:44,170 features f, which is now 159 00:05:44,450 --> 00:05:46,920 an R(m) plus 1 dimensional feature vector. 160 00:05:49,040 --> 00:05:50,640 And we have m here because we have 161 00:05:51,610 --> 00:05:53,190 m training examples and thus 162 00:05:53,570 --> 00:05:56,370 m landmarks and what 163 00:05:57,330 --> 00:05:58,310 we do is we predict 164 00:05:58,600 --> 00:06:00,180 1 if theta transpose f 165 00:06:00,780 --> 00:06:01,860 is greater than or equal to 0. 166 00:06:02,230 --> 00:06:02,430 Right. 167 00:06:02,640 --> 00:06:03,770 So, if theta transpose f, of course, 168 00:06:04,090 --> 00:06:07,200 that's just equal to theta 0, f0 plus theta 1, 169 00:06:07,900 --> 00:06:08,990 f1 plus dot dot 170 00:06:09,120 --> 00:06:11,200 dot, plus theta m 171 00:06:12,170 --> 00:06:13,900 f(m). And so my 172 00:06:14,050 --> 00:06:15,720 parameter vector theta is also now 173 00:06:16,170 --> 00:06:17,730 going to be an m 174 00:06:17,990 --> 00:06:21,260 plus 1 dimensional vector. 175 00:06:21,780 --> 00:06:23,100 And we have m here because where 176 00:06:23,260 --> 00:06:25,030 the number of landmarks is equal 177 00:06:25,450 --> 00:06:26,600 to the training set size. 178 00:06:26,910 --> 00:06:28,190 So m was the training set size and now, the 179 00:06:29,100 --> 00:06:31,950 parameter vector theta is going to be m plus one dimensional. 180 00:06:32,990 --> 00:06:33,990 So that's how you make a prediction 181 00:06:34,360 --> 00:06:36,870 if you already have a setting for the parameter's theta. 182 00:06:37,840 --> 00:06:39,160 How do you get the parameter's theta? 183 00:06:39,680 --> 00:06:40,650 Well you do that using the 184 00:06:40,920 --> 00:06:43,040 SVM learning algorithm, and specifically 185 00:06:43,850 --> 00:06:46,460 what you do is you would solve this minimization problem. 186 00:06:46,690 --> 00:06:48,170 You've minimized the parameter's 187 00:06:48,540 --> 00:06:51,630 theta of C times this cost function which we had before. 188 00:06:52,430 --> 00:06:54,770 Only now, instead of looking 189 00:06:55,040 --> 00:06:56,650 there instead of making 190 00:06:56,970 --> 00:06:59,300 predictions using theta transpose 191 00:07:00,020 --> 00:07:01,410 x(i) using our original 192 00:07:01,720 --> 00:07:03,320 features, x(i). Instead we've 193 00:07:03,520 --> 00:07:04,840 taken the features x(i) 194 00:07:05,090 --> 00:07:06,260 and replace them with a new features 195 00:07:07,270 --> 00:07:09,080 so we are using theta transpose 196 00:07:09,380 --> 00:07:10,840 f(i) to make a 197 00:07:11,130 --> 00:07:12,480 prediction on the i'f training 198 00:07:12,860 --> 00:07:13,860 examples and we see that, you know, 199 00:07:14,230 --> 00:07:16,580 in both places here and 200 00:07:16,700 --> 00:07:18,270 it's by solving this minimization problem 201 00:07:18,760 --> 00:07:22,130 that you get the parameters for your Support Vector Machine. 202 00:07:23,240 --> 00:07:24,640 And one last detail is 203 00:07:24,870 --> 00:07:26,880 because this optimization 204 00:07:27,510 --> 00:07:29,580 problem we really have 205 00:07:30,570 --> 00:07:32,300 n equals m features. 206 00:07:32,860 --> 00:07:33,650 That is here. 207 00:07:34,520 --> 00:07:36,010 The number of features we have. 208 00:07:37,100 --> 00:07:38,240 Really, the effective number of 209 00:07:38,410 --> 00:07:39,390 features we have is dimension 210 00:07:39,670 --> 00:07:41,020 of f. So that n 211 00:07:41,730 --> 00:07:42,690 is actually going to be equal 212 00:07:42,900 --> 00:07:44,470 to m. So, if you want to, you can 213 00:07:44,610 --> 00:07:45,530 think of this as a sum, 214 00:07:46,340 --> 00:07:47,280 this really is a sum 215 00:07:47,590 --> 00:07:48,680 from j equals 1 through 216 00:07:49,490 --> 00:07:50,390 m. And then one way to think 217 00:07:50,470 --> 00:07:51,500 about this, is you can 218 00:07:51,620 --> 00:07:53,250 think of it as n being 219 00:07:53,550 --> 00:07:55,060 equal to m, because if 220 00:07:55,570 --> 00:07:57,320 f isn't a new feature, then 221 00:07:57,970 --> 00:07:59,650 we have m plus 1 222 00:08:00,120 --> 00:08:02,920 features, with the plus 1 coming from the interceptor. 223 00:08:05,090 --> 00:08:06,760 And here, we still do sum 224 00:08:06,990 --> 00:08:08,110 from j equal 1 through n, 225 00:08:08,440 --> 00:08:10,070 because similar to our 226 00:08:10,380 --> 00:08:11,700 earlier videos on regularization, 227 00:08:12,580 --> 00:08:14,110 we still do not regularize the 228 00:08:14,180 --> 00:08:15,650 parameter theta zero, which is 229 00:08:15,780 --> 00:08:16,560 why this is a sum for 230 00:08:16,740 --> 00:08:17,930 j equals 1 through m 231 00:08:18,880 --> 00:08:19,840 instead of j equals zero though 232 00:08:20,000 --> 00:08:22,200 m. So that's 233 00:08:22,580 --> 00:08:23,760 the support vector machine learning algorithm. 234 00:08:24,660 --> 00:08:26,260 That's one sort of, mathematical 235 00:08:27,160 --> 00:08:28,310 detail aside that I 236 00:08:28,440 --> 00:08:29,840 should mention, which is 237 00:08:29,930 --> 00:08:30,780 that in the way the support 238 00:08:31,310 --> 00:08:33,020 vector machine is implemented, this last 239 00:08:33,320 --> 00:08:34,750 term is actually done a little bit differently. 240 00:08:35,680 --> 00:08:36,730 So you don't really need to 241 00:08:36,770 --> 00:08:38,080 know about this last detail in 242 00:08:38,190 --> 00:08:39,190 order to use support vector machines, 243 00:08:39,700 --> 00:08:41,330 and in fact the equations that 244 00:08:41,450 --> 00:08:42,500 are written down here should give 245 00:08:42,620 --> 00:08:45,160 you all the intuitions that should need. 246 00:08:45,310 --> 00:08:46,190 But in the way the support vector machine 247 00:08:46,450 --> 00:08:48,450 is implemented, you know, that term, the 248 00:08:48,570 --> 00:08:50,960 sum of j of theta j squared right? 249 00:08:53,110 --> 00:08:54,780 Another way to write this is this can 250 00:08:55,580 --> 00:08:57,660 be written as theta transpose 251 00:08:58,500 --> 00:08:59,530 theta if we ignore 252 00:09:00,120 --> 00:09:02,730 the parameter theta 0. 253 00:09:03,570 --> 00:09:05,640 So theta 1 down to 254 00:09:05,800 --> 00:09:10,090 theta m. Ignoring theta 0. 255 00:09:11,130 --> 00:09:13,790 Then this sum of 256 00:09:14,510 --> 00:09:15,900 j of theta j squared that this 257 00:09:16,040 --> 00:09:18,870 can also be written theta transpose theta. 258 00:09:19,930 --> 00:09:21,520 And what most support vector 259 00:09:21,730 --> 00:09:23,380 machine implementations do is actually 260 00:09:23,720 --> 00:09:25,520 replace this theta transpose theta, 261 00:09:26,280 --> 00:09:28,270 will instead, theta transpose times 262 00:09:28,590 --> 00:09:30,140 some matrix inside, that depends 263 00:09:30,820 --> 00:09:33,930 on the kernel you use, times theta. 264 00:09:34,160 --> 00:09:35,500 And so this gives us a slightly different distance metric. 265 00:09:36,140 --> 00:09:37,770 We'll use a slightly different 266 00:09:38,070 --> 00:09:40,050 measure instead of minimizing exactly 267 00:09:41,320 --> 00:09:43,250 the norm of theta squared means 268 00:09:43,790 --> 00:09:45,990 that minimize something slightly similar to it. 269 00:09:46,140 --> 00:09:47,610 That's like a rescale version of 270 00:09:47,770 --> 00:09:50,150 the parameter vector theta that depends on the kernel. 271 00:09:50,950 --> 00:09:52,440 But this is kind of a mathematical detail. 272 00:09:53,210 --> 00:09:54,360 That allows the support vector 273 00:09:54,650 --> 00:09:56,350 machine software to run much more efficiently. 274 00:09:58,300 --> 00:09:59,410 And the reason the support vector machine 275 00:09:59,700 --> 00:10:01,500 does this is with this modification. 276 00:10:02,020 --> 00:10:03,250 It allows it to 277 00:10:03,300 --> 00:10:05,740 scale to much bigger training sets. 278 00:10:06,370 --> 00:10:07,800 Because for example, if you have 279 00:10:07,970 --> 00:10:11,530 a training set with 10,000 training examples. 280 00:10:12,590 --> 00:10:13,560 Then, you know, the way we define 281 00:10:13,950 --> 00:10:15,750 landmarks, we end up with 10,000 landmarks. 282 00:10:16,780 --> 00:10:18,060 And so theta becomes 10,000 dimensional. 283 00:10:18,490 --> 00:10:20,450 And maybe that works, but when m 284 00:10:20,450 --> 00:10:21,710 becomes really, really big 285 00:10:22,470 --> 00:10:24,020 then solving for all 286 00:10:24,150 --> 00:10:25,480 of these parameters, you know, if m were 287 00:10:25,590 --> 00:10:26,590 50,000 or a 100,000 288 00:10:26,880 --> 00:10:28,170 then solving for 289 00:10:28,340 --> 00:10:29,660 all of these parameters can become 290 00:10:29,890 --> 00:10:31,240 expensive for the support 291 00:10:31,420 --> 00:10:33,690 vector machine optimization software, thus 292 00:10:33,870 --> 00:10:35,750 solving the minimization problem that I drew here. 293 00:10:36,490 --> 00:10:37,570 So kind of as mathematical 294 00:10:37,860 --> 00:10:39,580 detail, which again you really don't need to know about. 295 00:10:41,000 --> 00:10:43,070 It actually modifies that last 296 00:10:43,350 --> 00:10:44,380 term a little bit to 297 00:10:44,500 --> 00:10:45,940 optimize something slightly different than 298 00:10:46,080 --> 00:10:48,560 just minimizing the norm squared of theta squared, of theta. 299 00:10:49,370 --> 00:10:50,600 But if you want, 300 00:10:51,080 --> 00:10:52,450 you can feel free to think 301 00:10:52,710 --> 00:10:54,880 of this as an kind of a n implementational detail 302 00:10:55,340 --> 00:10:56,750 that does change the objective a 303 00:10:56,880 --> 00:10:58,260 bit, but is done primarily 304 00:10:58,930 --> 00:11:01,590 for reasons of computational efficiency, 305 00:11:02,260 --> 00:11:04,390 so usually you don't really have to worry about this. 306 00:11:07,640 --> 00:11:09,460 And by the way, in case your 307 00:11:09,560 --> 00:11:10,730 wondering why we don't apply 308 00:11:11,100 --> 00:11:12,210 the kernel's idea to other 309 00:11:12,570 --> 00:11:13,690 algorithms as well like logistic 310 00:11:14,040 --> 00:11:15,450 regression, it turns out 311 00:11:15,670 --> 00:11:16,770 that if you want, you 312 00:11:16,900 --> 00:11:18,120 can actually apply the kernel's 313 00:11:18,550 --> 00:11:19,850 idea and define the source 314 00:11:19,990 --> 00:11:22,920 of features using landmarks and so on for logistic regression. 315 00:11:23,880 --> 00:11:25,860 But the computational tricks that apply 316 00:11:26,440 --> 00:11:28,110 for support vector machines don't 317 00:11:28,430 --> 00:11:30,700 generalize well to other algorithms like logistic regression. 318 00:11:31,310 --> 00:11:33,110 And so, using kernels with 319 00:11:33,260 --> 00:11:34,390 logistic regression is going too 320 00:11:34,580 --> 00:11:36,330 very slow, whereas, because of 321 00:11:36,440 --> 00:11:37,940 computational tricks, like that 322 00:11:38,150 --> 00:11:39,490 embodied and how it modifies 323 00:11:39,900 --> 00:11:41,130 this and the details of how 324 00:11:41,320 --> 00:11:43,140 the support vector machine software is 325 00:11:43,240 --> 00:11:44,990 implemented, support vector machines and 326 00:11:45,300 --> 00:11:47,090 kernels tend go particularly well together. 327 00:11:47,930 --> 00:11:49,450 Whereas, logistic regression and kernels, 328 00:11:50,250 --> 00:11:51,990 you know, you can do it, but this would run very slowly. 329 00:11:52,890 --> 00:11:53,670 And it won't be able to 330 00:11:53,750 --> 00:11:55,420 take advantage of advanced optimization 331 00:11:56,040 --> 00:11:57,360 techniques that people have figured 332 00:11:57,530 --> 00:11:58,530 out for the particular case 333 00:11:59,140 --> 00:12:00,950 of running a support vector machine with a kernel. 334 00:12:01,540 --> 00:12:03,340 But all this pertains only 335 00:12:03,710 --> 00:12:04,850 to how you actually implement 336 00:12:05,230 --> 00:12:06,900 software to minimize the cost function. 337 00:12:07,870 --> 00:12:08,940 I will say more about that in 338 00:12:09,040 --> 00:12:09,950 the next video, but you really don't 339 00:12:10,150 --> 00:12:11,530 need to know about 340 00:12:12,200 --> 00:12:13,520 how to write software to 341 00:12:13,670 --> 00:12:14,890 minimize this cost function because 342 00:12:15,170 --> 00:12:17,560 you can find very good off the shelf software for doing so. 343 00:12:18,670 --> 00:12:19,890 And just as, you know, I wouldn't 344 00:12:20,140 --> 00:12:21,340 recommend writing code to invert 345 00:12:21,850 --> 00:12:22,960 a matrix or to compute a 346 00:12:23,150 --> 00:12:24,490 square root, I actually do 347 00:12:24,660 --> 00:12:26,420 not recommend writing software to 348 00:12:26,560 --> 00:12:27,750 minimize this cost function yourself, 349 00:12:28,240 --> 00:12:29,610 but instead to use off 350 00:12:29,780 --> 00:12:31,490 the shelf software packages that people 351 00:12:31,740 --> 00:12:33,240 have developed and so 352 00:12:33,540 --> 00:12:35,140 those software packages already embody 353 00:12:35,790 --> 00:12:37,720 these numerical optimization tricks, 354 00:12:39,540 --> 00:12:41,770 so you don't really have to worry about them. 355 00:12:41,950 --> 00:12:42,920 But one other thing that is 356 00:12:43,180 --> 00:12:45,200 worth knowing about is when 357 00:12:45,350 --> 00:12:46,400 you're applying a support vector 358 00:12:46,640 --> 00:12:47,730 machine, how do you 359 00:12:47,820 --> 00:12:50,220 choose the parameters of the support vector machine? 360 00:12:51,520 --> 00:12:52,300 And the last thing I want to 361 00:12:52,400 --> 00:12:53,290 do in this video is say a 362 00:12:53,450 --> 00:12:54,680 little word about the bias and 363 00:12:54,840 --> 00:12:57,070 variance trade offs when using a support vector machine. 364 00:12:57,900 --> 00:12:59,230 When using an SVM, one of 365 00:12:59,390 --> 00:13:00,670 the things you need to choose is 366 00:13:00,960 --> 00:13:03,850 the parameter C which 367 00:13:04,090 --> 00:13:05,880 was in the optimization objective, and 368 00:13:05,980 --> 00:13:07,690 you recall that C played a 369 00:13:07,770 --> 00:13:09,800 role similar to 1 over 370 00:13:10,050 --> 00:13:11,750 lambda, where lambda was the regularization 371 00:13:12,520 --> 00:13:13,970 parameter we had for logistic regression. 372 00:13:15,360 --> 00:13:16,760 So, if you have a 373 00:13:16,930 --> 00:13:18,760 large value of C, this corresponds 374 00:13:19,520 --> 00:13:20,560 to what we have back in logistic 375 00:13:21,270 --> 00:13:22,260 regression, of a small 376 00:13:22,670 --> 00:13:25,080 value of lambda meaning of not using much regularization. 377 00:13:25,980 --> 00:13:26,960 And if you do that, you 378 00:13:27,050 --> 00:13:29,330 tend to have a hypothesis with lower bias and higher variance. 379 00:13:30,570 --> 00:13:31,420 Whereas if you use a smaller 380 00:13:31,630 --> 00:13:33,050 value of C then this 381 00:13:33,240 --> 00:13:34,510 corresponds to when we 382 00:13:34,660 --> 00:13:36,450 are using logistic regression with a 383 00:13:36,620 --> 00:13:38,090 large value of lambda and that corresponds 384 00:13:38,690 --> 00:13:40,180 to a hypothesis with higher 385 00:13:40,470 --> 00:13:41,760 bias and lower variance. 386 00:13:42,580 --> 00:13:44,520 And so, hypothesis with large 387 00:13:45,000 --> 00:13:46,870 C has a higher 388 00:13:47,450 --> 00:13:48,380 variance, and is more prone 389 00:13:48,580 --> 00:13:50,290 to overfitting, whereas hypothesis with 390 00:13:50,450 --> 00:13:52,820 small C has higher bias 391 00:13:52,910 --> 00:13:54,900 and is thus more prone to underfitting. 392 00:13:56,710 --> 00:13:59,870 So this parameter C is one of the parameters we need to choose. 393 00:14:00,210 --> 00:14:01,280 The other one is the parameter 394 00:14:02,280 --> 00:14:04,580 sigma squared, which appeared in the Gaussian kernel. 395 00:14:05,760 --> 00:14:07,080 So if the Gaussian kernel 396 00:14:07,750 --> 00:14:09,370 sigma squared is large, then 397 00:14:09,640 --> 00:14:11,350 in the similarity function, which 398 00:14:11,530 --> 00:14:12,710 was this you know E to the 399 00:14:13,390 --> 00:14:14,710 minus x minus landmark 400 00:14:16,280 --> 00:14:17,950 varies squared over 2 sigma squared. 401 00:14:20,130 --> 00:14:21,290 In this one of the example; If I 402 00:14:21,480 --> 00:14:23,330 have only one feature, x1, if 403 00:14:23,570 --> 00:14:25,390 I have a landmark there at 404 00:14:25,490 --> 00:14:27,710 that location, if sigma 405 00:14:27,960 --> 00:14:29,230 squared is large, then, you know, the 406 00:14:29,480 --> 00:14:30,600 Gaussian kernel would tend to 407 00:14:30,690 --> 00:14:32,940 fall off relatively slowly 408 00:14:33,960 --> 00:14:34,740 and so this would be my feature 409 00:14:35,210 --> 00:14:36,690 f(i), and so this 410 00:14:36,880 --> 00:14:38,970 would be smoother function that varies 411 00:14:39,060 --> 00:14:40,640 more smoothly, and so this will 412 00:14:40,760 --> 00:14:42,750 give you a hypothesis with higher 413 00:14:43,030 --> 00:14:44,170 bias and lower variance, because 414 00:14:44,550 --> 00:14:46,000 the Gaussian kernel that falls off smoothly, 415 00:14:46,840 --> 00:14:48,240 you tend to get a hypothesis that 416 00:14:48,520 --> 00:14:50,060 varies slowly, or varies smoothly 417 00:14:50,130 --> 00:14:51,860 as you change the 418 00:14:52,050 --> 00:14:53,680 input x. Whereas in contrast, 419 00:14:54,030 --> 00:14:55,330 if sigma squared was 420 00:14:55,660 --> 00:14:57,430 small and if that's my 421 00:14:57,540 --> 00:14:58,830 landmark given my 1 422 00:14:58,960 --> 00:15:01,440 feature x1, you know, my Gaussian 423 00:15:01,820 --> 00:15:04,630 kernel, my similarity function, will vary more abruptly. 424 00:15:05,310 --> 00:15:07,520 And in both cases I'd pick 425 00:15:07,580 --> 00:15:08,550 out 1, and so if sigma squared 426 00:15:08,870 --> 00:15:11,730 is small, then my features vary less smoothly. 427 00:15:12,190 --> 00:15:13,740 So if it's just higher slopes 428 00:15:14,250 --> 00:15:15,300 or higher derivatives here. 429 00:15:16,020 --> 00:15:17,170 And using this, you end 430 00:15:17,330 --> 00:15:19,620 up fitting hypotheses of lower 431 00:15:19,840 --> 00:15:21,870 bias and you can have higher variance. 432 00:15:23,030 --> 00:15:24,460 And if you look at this 433 00:15:24,680 --> 00:15:26,240 week's points exercise, you actually get 434 00:15:26,450 --> 00:15:27,230 to play around with some 435 00:15:27,330 --> 00:15:29,480 of these ideas yourself and see these effects yourself. 436 00:15:31,590 --> 00:15:34,430 So, that was the support vector machine with kernels algorithm. 437 00:15:35,320 --> 00:15:36,450 And hopefully this discussion of 438 00:15:37,090 --> 00:15:39,170 bias and variance will give 439 00:15:39,310 --> 00:15:40,380 you some sense of how you 440 00:15:40,460 --> 00:15:42,600 can expect this algorithm to behave as well.