1 00:00:00,140 --> 00:00:01,310 So far we've been talking about 2 00:00:01,640 --> 00:00:03,290 SVMs in a fairly abstract level. 3 00:00:03,980 --> 00:00:05,030 In this video I'd like to 4 00:00:05,200 --> 00:00:06,460 talk about what you actually need 5 00:00:06,740 --> 00:00:09,410 to do in order to run or to use an SVM. 6 00:00:11,320 --> 00:00:12,300 The support vector machine algorithm 7 00:00:12,850 --> 00:00:14,870 poses a particular optimization problem. 8 00:00:15,530 --> 00:00:16,940 But as I briefly mentioned in 9 00:00:17,120 --> 00:00:18,150 an earlier video, I really 10 00:00:18,380 --> 00:00:20,570 do not recommend writing your 11 00:00:20,630 --> 00:00:22,810 own software to solve for the parameter's theta yourself. 12 00:00:23,950 --> 00:00:26,110 So just as today, very 13 00:00:26,420 --> 00:00:27,730 few of us, or maybe almost essentially 14 00:00:28,090 --> 00:00:29,400 none of us would think of 15 00:00:29,530 --> 00:00:31,680 writing code ourselves to invert a matrix 16 00:00:31,950 --> 00:00:33,940 or take a square root of a number, and so on. 17 00:00:34,190 --> 00:00:36,570 We just, you know, call some library function to do that. 18 00:00:36,700 --> 00:00:38,090 In the same way, the 19 00:00:38,850 --> 00:00:40,310 software for solving the SVM 20 00:00:40,620 --> 00:00:42,200 optimization problem is very 21 00:00:42,440 --> 00:00:43,880 complex, and there have 22 00:00:43,990 --> 00:00:44,960 been researchers that have been 23 00:00:45,110 --> 00:00:47,560 doing essentially numerical optimization research for many years. 24 00:00:47,850 --> 00:00:48,960 So you come up with good 25 00:00:49,150 --> 00:00:50,550 software libraries and good software 26 00:00:50,930 --> 00:00:52,270 packages to do this. 27 00:00:52,470 --> 00:00:53,480 And then strongly recommend just using 28 00:00:53,860 --> 00:00:55,260 one of the highly optimized software 29 00:00:55,710 --> 00:00:57,780 libraries rather than trying to implement something yourself. 30 00:00:58,730 --> 00:01:00,680 And there are lots of good software libraries out there. 31 00:01:00,970 --> 00:01:02,060 The two that I happen to 32 00:01:02,210 --> 00:01:03,220 use the most often are the 33 00:01:03,400 --> 00:01:05,000 linear SVM but there are really 34 00:01:05,410 --> 00:01:06,860 lots of good software libraries for 35 00:01:07,030 --> 00:01:08,430 doing this that you know, you can 36 00:01:08,600 --> 00:01:10,190 link to many of the 37 00:01:10,450 --> 00:01:11,860 major programming languages that you 38 00:01:11,950 --> 00:01:14,410 may be using to code up learning algorithm. 39 00:01:15,280 --> 00:01:16,460 Even though you shouldn't be writing 40 00:01:16,730 --> 00:01:18,330 your own SVM optimization software, 41 00:01:19,120 --> 00:01:20,680 there are a few things you need to do, though. 42 00:01:21,420 --> 00:01:23,130 First is to come up 43 00:01:23,130 --> 00:01:24,230 with with some choice of the 44 00:01:24,320 --> 00:01:25,640 parameter's C. We talked a 45 00:01:25,940 --> 00:01:26,930 little bit of the bias/variance properties of 46 00:01:27,040 --> 00:01:28,850 this in the earlier video. 47 00:01:30,290 --> 00:01:31,480 Second, you also need to 48 00:01:31,630 --> 00:01:33,040 choose the kernel or the 49 00:01:33,410 --> 00:01:34,880 similarity function that you want to use. 50 00:01:35,730 --> 00:01:37,080 So one choice might 51 00:01:37,280 --> 00:01:38,980 be if we decide not to use any kernel. 52 00:01:40,560 --> 00:01:41,510 And the idea of no kernel 53 00:01:41,910 --> 00:01:43,600 is also called a linear kernel. 54 00:01:44,130 --> 00:01:45,320 So if someone says, I use 55 00:01:45,530 --> 00:01:46,760 an SVM with a linear kernel, 56 00:01:47,180 --> 00:01:48,330 what that means is you know, they use 57 00:01:48,490 --> 00:01:50,690 an SVM without using without 58 00:01:51,020 --> 00:01:52,250 using a kernel and it 59 00:01:52,360 --> 00:01:53,410 was a version of the SVM 60 00:01:54,120 --> 00:01:55,870 that just uses theta transpose X, right, 61 00:01:56,140 --> 00:01:57,620 that predicts 1 theta 0 62 00:01:57,850 --> 00:01:59,420 plus theta 1 X1 63 00:01:59,740 --> 00:02:01,000 plus so on plus theta 64 00:02:01,690 --> 00:02:04,160 N, X N is greater than equals 0. 65 00:02:05,520 --> 00:02:06,830 This term linear kernel, you 66 00:02:06,950 --> 00:02:08,250 can think of this as you know this 67 00:02:08,480 --> 00:02:09,290 is the version of the SVM 68 00:02:10,340 --> 00:02:12,320 that just gives you a standard linear classifier. 69 00:02:13,940 --> 00:02:14,700 So that would be one 70 00:02:15,040 --> 00:02:16,160 reasonable choice for some problems, 71 00:02:17,130 --> 00:02:18,080 and you know, there would be many software 72 00:02:18,470 --> 00:02:20,900 libraries, like linear, was 73 00:02:21,210 --> 00:02:22,320 one example, out of many, 74 00:02:22,840 --> 00:02:23,880 one example of a software library 75 00:02:24,560 --> 00:02:25,620 that can train an SVM 76 00:02:25,980 --> 00:02:27,410 without using a kernel, also 77 00:02:27,760 --> 00:02:29,470 called a linear kernel. 78 00:02:29,850 --> 00:02:31,340 So, why would you want to do this? 79 00:02:31,410 --> 00:02:32,820 If you have a large number of 80 00:02:33,150 --> 00:02:34,280 features, if N is 81 00:02:34,430 --> 00:02:37,800 large, and M the 82 00:02:37,990 --> 00:02:39,590 number of training examples is 83 00:02:39,670 --> 00:02:41,050 small, then you know 84 00:02:41,230 --> 00:02:42,300 you have a huge number of 85 00:02:42,360 --> 00:02:43,630 features that if X, this is 86 00:02:43,710 --> 00:02:45,850 an X is an Rn, Rn +1. 87 00:02:46,010 --> 00:02:46,940 So if you have a 88 00:02:47,080 --> 00:02:48,700 huge number of features already, with 89 00:02:48,800 --> 00:02:50,540 a small training set, you know, maybe you 90 00:02:50,610 --> 00:02:51,430 want to just fit a linear 91 00:02:51,710 --> 00:02:52,890 decision boundary and not try 92 00:02:53,060 --> 00:02:54,420 to fit a very complicated nonlinear 93 00:02:54,860 --> 00:02:56,980 function, because might not have enough data. 94 00:02:57,560 --> 00:02:59,330 And you might risk overfitting, if 95 00:02:59,470 --> 00:03:00,530 you're trying to fit a very complicated function 96 00:03:01,540 --> 00:03:03,220 in a very high dimensional feature space, 97 00:03:03,980 --> 00:03:04,990 but if your training set sample 98 00:03:05,040 --> 00:03:07,120 is small. So this 99 00:03:07,340 --> 00:03:08,600 would be one reasonable setting where 100 00:03:08,740 --> 00:03:09,950 you might decide to just 101 00:03:10,700 --> 00:03:11,960 not use a kernel, or 102 00:03:12,250 --> 00:03:15,580 equivalents to use what's called a linear kernel. 103 00:03:15,740 --> 00:03:16,740 A second choice for the kernel that 104 00:03:16,820 --> 00:03:18,010 you might make, is this Gaussian 105 00:03:18,370 --> 00:03:19,920 kernel, and this is what we had previously. 106 00:03:21,270 --> 00:03:22,350 And if you do this, then the 107 00:03:22,440 --> 00:03:23,130 other choice you need to make 108 00:03:23,420 --> 00:03:25,980 is to choose this parameter sigma squared 109 00:03:26,850 --> 00:03:29,800 when we also talk a little bit about the bias variance tradeoffs 110 00:03:30,820 --> 00:03:32,360 of how, if sigma squared is 111 00:03:32,600 --> 00:03:33,890 large, then you tend 112 00:03:34,160 --> 00:03:35,580 to have a higher bias, lower 113 00:03:35,770 --> 00:03:37,650 variance classifier, but if 114 00:03:37,800 --> 00:03:39,700 sigma squared is small, then you 115 00:03:40,060 --> 00:03:42,360 have a higher variance, lower bias classifier. 116 00:03:43,940 --> 00:03:45,350 So when would you choose a Gaussian kernel? 117 00:03:46,210 --> 00:03:48,050 Well, if your omission 118 00:03:48,310 --> 00:03:49,540 of features X, I mean 119 00:03:49,820 --> 00:03:51,370 Rn, and if N 120 00:03:51,570 --> 00:03:53,890 is small, and, ideally, you know, 121 00:03:55,660 --> 00:03:57,110 if n is large, right, 122 00:03:58,470 --> 00:04:00,170 so that's if, you know, we have 123 00:04:00,550 --> 00:04:02,340 say, a two-dimensional training set, 124 00:04:03,130 --> 00:04:04,880 like the example I drew earlier. 125 00:04:05,470 --> 00:04:08,320 So n is equal to 2, but we have a pretty large training set. 126 00:04:08,680 --> 00:04:09,770 So, you know, I've drawn in a 127 00:04:09,950 --> 00:04:10,890 fairly large number of training examples, 128 00:04:11,650 --> 00:04:12,410 then maybe you want to use 129 00:04:12,540 --> 00:04:14,400 a kernel to fit a more 130 00:04:14,910 --> 00:04:16,260 complex nonlinear decision boundary, 131 00:04:16,650 --> 00:04:18,750 and the Gaussian kernel would be a fine way to do this. 132 00:04:19,480 --> 00:04:20,610 I'll say more towards the end 133 00:04:20,720 --> 00:04:22,570 of the video, a little bit 134 00:04:22,660 --> 00:04:23,760 more about when you might choose a 135 00:04:23,970 --> 00:04:26,310 linear kernel, a Gaussian kernel and so on. 136 00:04:27,860 --> 00:04:29,740 But if concretely, if you 137 00:04:30,040 --> 00:04:31,210 decide to use a Gaussian 138 00:04:31,720 --> 00:04:33,910 kernel, then here's what you need to do. 139 00:04:35,380 --> 00:04:36,550 Depending on what support vector machine 140 00:04:37,280 --> 00:04:38,990 software package you use, it 141 00:04:39,100 --> 00:04:40,960 may ask you to implement a 142 00:04:41,070 --> 00:04:42,200 kernel function, or to implement 143 00:04:43,060 --> 00:04:43,880 the similarity function. 144 00:04:45,020 --> 00:04:46,750 So if you're using an 145 00:04:47,010 --> 00:04:49,820 octave or MATLAB implementation of 146 00:04:50,000 --> 00:04:50,720 an SVM, it may ask you 147 00:04:50,810 --> 00:04:52,560 to provide a function to 148 00:04:52,690 --> 00:04:54,680 compute a particular feature of the kernel. 149 00:04:55,110 --> 00:04:56,480 So this is really computing f 150 00:04:56,770 --> 00:04:57,890 subscript i for one 151 00:04:58,220 --> 00:04:59,560 particular value of i, where 152 00:05:00,570 --> 00:05:02,310 f here is just a 153 00:05:02,330 --> 00:05:03,570 single real number, so maybe 154 00:05:03,840 --> 00:05:05,060 I should move this better written 155 00:05:05,250 --> 00:05:07,230 f(i), but what you 156 00:05:07,510 --> 00:05:08,130 need to do is to write a kernel 157 00:05:08,480 --> 00:05:09,530 function that takes this input, you know, 158 00:05:10,610 --> 00:05:11,910 a training example or a 159 00:05:12,020 --> 00:05:13,140 test example whatever it takes 160 00:05:13,280 --> 00:05:14,640 in some vector X and takes 161 00:05:14,990 --> 00:05:16,220 as input one of the 162 00:05:16,370 --> 00:05:18,270 landmarks and but 163 00:05:18,880 --> 00:05:20,750 only I've come down X1 and 164 00:05:20,950 --> 00:05:21,810 X2 here, because the 165 00:05:21,900 --> 00:05:23,750 landmarks are really training examples as well. 166 00:05:24,470 --> 00:05:26,160 But what you 167 00:05:26,400 --> 00:05:27,490 need to do is write software that 168 00:05:27,670 --> 00:05:28,960 takes this input, you know, X1, X2 169 00:05:29,150 --> 00:05:30,320 and computes this sort 170 00:05:30,580 --> 00:05:31,950 of similarity function between them 171 00:05:32,530 --> 00:05:33,470 and return a real number. 172 00:05:36,180 --> 00:05:37,430 And so what some support vector machine 173 00:05:37,580 --> 00:05:39,040 packages do is expect 174 00:05:39,510 --> 00:05:40,860 you to provide this kernel function 175 00:05:41,410 --> 00:05:44,580 that take this input you know, X1, X2 and returns a real number. 176 00:05:45,580 --> 00:05:46,460 And then it will take it from there 177 00:05:46,850 --> 00:05:49,070 and it will automatically generate all the features, and 178 00:05:49,410 --> 00:05:51,480 so automatically take X and 179 00:05:51,600 --> 00:05:53,370 map it to f1, 180 00:05:53,420 --> 00:05:54,420 f2, down to f(m) using 181 00:05:54,750 --> 00:05:56,200 this function that you write, and 182 00:05:56,310 --> 00:05:57,190 generate all the features and 183 00:05:57,650 --> 00:05:59,080 train the support vector machine from there. 184 00:05:59,870 --> 00:06:00,800 But sometimes you do need to 185 00:06:00,880 --> 00:06:04,710 provide this function yourself. 186 00:06:05,680 --> 00:06:06,770 Other if you are using the Gaussian kernel, some SVM implementations will also include the Gaussian kernel 187 00:06:06,980 --> 00:06:09,950 and a 188 00:06:10,040 --> 00:06:10,990 few other kernels as well, since 189 00:06:11,230 --> 00:06:13,580 the Gaussian kernel is probably the most common kernel. 190 00:06:14,880 --> 00:06:16,290 Gaussian and linear kernels are 191 00:06:16,380 --> 00:06:18,210 really the two most popular kernels by far. 192 00:06:19,130 --> 00:06:20,230 Just one implementational note. 193 00:06:20,750 --> 00:06:21,820 If you have features of very 194 00:06:22,080 --> 00:06:23,620 different scales, it is important 195 00:06:24,700 --> 00:06:26,270 to perform feature scaling before 196 00:06:26,600 --> 00:06:27,780 using the Gaussian kernel. 197 00:06:28,580 --> 00:06:29,180 And here's why. 198 00:06:30,150 --> 00:06:31,600 If you imagine the computing 199 00:06:32,290 --> 00:06:33,570 the norm between X and 200 00:06:33,790 --> 00:06:34,890 l, right, so this term here, 201 00:06:35,390 --> 00:06:37,150 and the numerator term over there. 202 00:06:38,300 --> 00:06:39,780 What this is doing, the norm 203 00:06:40,070 --> 00:06:40,930 between X and l, that's really 204 00:06:41,130 --> 00:06:42,140 saying, you know, let's compute the vector 205 00:06:42,450 --> 00:06:43,290 V, which is equal to 206 00:06:43,410 --> 00:06:44,980 X minus l. And then 207 00:06:45,250 --> 00:06:47,940 let's compute the norm does 208 00:06:48,130 --> 00:06:49,080 vector V, which is the 209 00:06:49,170 --> 00:06:50,510 difference between X. So the 210 00:06:50,580 --> 00:06:51,510 norm of V is really 211 00:06:53,360 --> 00:06:54,140 equal to V1 squared 212 00:06:54,250 --> 00:06:55,610 plus V2 squared plus 213 00:06:55,830 --> 00:06:58,290 dot dot dot, plus Vn squared. 214 00:06:58,900 --> 00:07:00,320 Because here X is in 215 00:07:01,060 --> 00:07:02,200 Rn, or Rn 216 00:07:02,290 --> 00:07:05,180 plus 1, but I'm going to ignore, you know, X0. 217 00:07:06,540 --> 00:07:08,420 So, let's pretend X is 218 00:07:08,510 --> 00:07:10,800 an Rn, square on 219 00:07:10,950 --> 00:07:12,320 the left side is what makes this correct. 220 00:07:12,570 --> 00:07:14,090 So this is equal 221 00:07:14,400 --> 00:07:16,120 to that, right? 222 00:07:17,210 --> 00:07:18,710 And so written differently, this is 223 00:07:18,850 --> 00:07:20,100 going to be X1 minus l1 224 00:07:20,290 --> 00:07:22,600 squared, plus x2 225 00:07:22,910 --> 00:07:24,590 minus l2 squared, plus 226 00:07:24,910 --> 00:07:26,580 dot dot dot plus Xn minus 227 00:07:27,130 --> 00:07:28,540 ln squared. 228 00:07:29,720 --> 00:07:30,790 And now if your features 229 00:07:31,850 --> 00:07:33,460 take on very different ranges of value. 230 00:07:33,940 --> 00:07:35,150 So take a housing 231 00:07:35,360 --> 00:07:37,180 prediction, for example, if 232 00:07:38,020 --> 00:07:40,490 your data is some data about houses. 233 00:07:41,420 --> 00:07:43,000 And if X is in the 234 00:07:43,140 --> 00:07:44,660 range of thousands of square 235 00:07:44,950 --> 00:07:47,190 feet, for the 236 00:07:48,010 --> 00:07:48,840 first feature, X1. 237 00:07:49,700 --> 00:07:51,630 But if your second feature, X2 is the number of bedrooms. 238 00:07:52,540 --> 00:07:53,610 So if this is in the 239 00:07:53,730 --> 00:07:56,720 range of one to five bedrooms, then 240 00:07:57,810 --> 00:07:59,320 X1 minus l1 is going to be huge. 241 00:07:59,780 --> 00:08:00,820 This could be like a thousand squared, 242 00:08:01,000 --> 00:08:02,880 whereas X2 minus l2 243 00:08:03,200 --> 00:08:04,620 is going to be much smaller and if 244 00:08:04,750 --> 00:08:06,800 that's the case, then in this term, 245 00:08:08,320 --> 00:08:09,660 those distances will be almost 246 00:08:10,060 --> 00:08:12,060 essentially dominated by the 247 00:08:12,570 --> 00:08:13,280 sizes of the houses 248 00:08:14,390 --> 00:08:15,760 and the number of bathrooms would be largely ignored. 249 00:08:16,950 --> 00:08:18,060 As so as, to avoid this in 250 00:08:18,230 --> 00:08:19,070 order to make a machine work 251 00:08:19,360 --> 00:08:21,890 well, do perform future scaling. 252 00:08:23,420 --> 00:08:24,830 And that will sure that the SVM 253 00:08:25,810 --> 00:08:27,020 gives, you know, comparable amount of attention 254 00:08:27,950 --> 00:08:28,870 to all of your different features, 255 00:08:29,190 --> 00:08:30,450 and not just to in 256 00:08:30,600 --> 00:08:31,870 this example to size of 257 00:08:32,150 --> 00:08:33,440 houses were big movement here the features. 258 00:08:34,700 --> 00:08:35,810 When you try a support vector 259 00:08:36,110 --> 00:08:38,760 machines chances are by 260 00:08:38,970 --> 00:08:40,000 far the two most common 261 00:08:40,460 --> 00:08:41,750 kernels you use will 262 00:08:41,850 --> 00:08:43,120 be the linear kernel, meaning no 263 00:08:43,320 --> 00:08:45,600 kernel, or the Gaussian kernel that we talked about. 264 00:08:46,520 --> 00:08:47,390 And just one note of warning 265 00:08:47,900 --> 00:08:49,070 which is that not all similarity 266 00:08:49,580 --> 00:08:50,590 functions you might come up 267 00:08:50,770 --> 00:08:52,520 with are valid kernels. 268 00:08:53,450 --> 00:08:54,840 And the Gaussian kernel and the linear 269 00:08:55,090 --> 00:08:56,410 kernel and other kernels that you 270 00:08:56,710 --> 00:08:57,850 sometimes others will use, all 271 00:08:58,030 --> 00:08:59,840 of them need to satisfy a technical condition. 272 00:09:00,380 --> 00:09:02,510 It's called Mercer's Theorem and 273 00:09:02,630 --> 00:09:03,560 the reason you need to this 274 00:09:03,710 --> 00:09:05,430 is because support vector machine 275 00:09:06,380 --> 00:09:08,140 algorithms or implementations of the 276 00:09:08,480 --> 00:09:09,560 SVM have lots of clever 277 00:09:10,050 --> 00:09:11,380 numerical optimization tricks. 278 00:09:12,110 --> 00:09:13,270 In order to solve for the 279 00:09:13,340 --> 00:09:15,650 parameter's theta efficiently and 280 00:09:16,590 --> 00:09:18,840 in the original design envisaged, 281 00:09:19,470 --> 00:09:21,010 those are decision made to restrict 282 00:09:21,540 --> 00:09:22,900 our attention only to kernels 283 00:09:23,510 --> 00:09:25,860 that satisfy this technical condition called Mercer's Theorem. 284 00:09:26,280 --> 00:09:27,360 And what that does is, that 285 00:09:27,570 --> 00:09:28,540 makes sure that all of these 286 00:09:28,820 --> 00:09:30,270 SVM packages, all of these SVM 287 00:09:30,500 --> 00:09:32,210 software packages can use the 288 00:09:32,310 --> 00:09:34,740 large class of optimizations and 289 00:09:35,280 --> 00:09:37,470 get the parameter theta very quickly. 290 00:09:39,320 --> 00:09:40,340 So, what most people end up doing 291 00:09:40,840 --> 00:09:42,470 is using either the linear 292 00:09:42,610 --> 00:09:44,210 or Gaussian kernel, but there 293 00:09:44,430 --> 00:09:45,610 are a few other kernels that also 294 00:09:45,940 --> 00:09:47,460 satisfy Mercer's theorem and 295 00:09:47,560 --> 00:09:48,690 that you may run across other 296 00:09:48,850 --> 00:09:50,050 people using, although I personally 297 00:09:50,880 --> 00:09:53,780 end up using other kernels you know, very, very rarely, if at all. 298 00:09:54,160 --> 00:09:56,990 Just to mention some of the other kernels that you may run across. 299 00:09:57,990 --> 00:10:00,300 One is the polynomial kernel. 300 00:10:01,570 --> 00:10:03,350 And for that the similarity between 301 00:10:03,800 --> 00:10:05,520 X and l is 302 00:10:05,730 --> 00:10:06,760 defined as, there are 303 00:10:06,830 --> 00:10:07,880 a lot of options, you can 304 00:10:08,640 --> 00:10:10,370 take X transpose l squared. 305 00:10:10,960 --> 00:10:13,410 So, here's one measure of how similar X and l are. 306 00:10:13,610 --> 00:10:14,930 If X and l are very close with 307 00:10:15,500 --> 00:10:18,260 each other, then the inner product will tend to be large. 308 00:10:20,200 --> 00:10:21,870 And so, you know, this is a slightly 309 00:10:23,080 --> 00:10:23,520 unusual kernel. 310 00:10:24,000 --> 00:10:25,130 That is not used that often, but 311 00:10:26,490 --> 00:10:29,190 you may run across some people using it. 312 00:10:30,050 --> 00:10:31,810 This is one version of a polynomial kernel. 313 00:10:32,330 --> 00:10:35,090 Another is X transpose l cubed. 314 00:10:36,690 --> 00:10:38,780 These are all examples of the polynomial kernel. 315 00:10:39,040 --> 00:10:41,270 X transpose l plus 1 cubed. 316 00:10:42,560 --> 00:10:43,620 X transpose l plus maybe 317 00:10:43,910 --> 00:10:44,930 a number different then one 5 318 00:10:44,970 --> 00:10:46,680 and, you know, to the power of 4 and 319 00:10:47,700 --> 00:10:49,840 so the polynomial kernel actually has two parameters. 320 00:10:50,610 --> 00:10:53,020 One is, what number do you add over here? 321 00:10:53,520 --> 00:10:53,920 It could be 0. 322 00:10:54,430 --> 00:10:58,660 This is really plus 0 over there, as well as what's the degree of the polynomial over there. 323 00:10:58,680 --> 00:11:01,670 So the degree power and these numbers. 324 00:11:02,250 --> 00:11:04,140 And the more general form of the 325 00:11:04,280 --> 00:11:05,530 polynomial kernel is X 326 00:11:05,720 --> 00:11:07,620 transpose l, plus some 327 00:11:07,940 --> 00:11:11,510 constant and then 328 00:11:11,800 --> 00:11:14,850 to some degree in the 329 00:11:15,060 --> 00:11:16,720 X1 and so both 330 00:11:16,940 --> 00:11:19,650 of these are parameters for the polynomial kernel. 331 00:11:20,510 --> 00:11:22,820 So the polynomial kernel almost always 332 00:11:23,350 --> 00:11:24,440 or usually performs worse. 333 00:11:24,820 --> 00:11:25,950 And the Gaussian kernel does not 334 00:11:26,270 --> 00:11:28,370 use that much, but this is just something that you may run across. 335 00:11:29,320 --> 00:11:30,480 Usually it is used only for 336 00:11:30,750 --> 00:11:31,710 data where X and l 337 00:11:32,000 --> 00:11:33,180 are all strictly non negative, 338 00:11:33,740 --> 00:11:34,720 and so that ensures that these 339 00:11:34,910 --> 00:11:36,710 inner products are never negative. 340 00:11:37,850 --> 00:11:40,010 And this captures the intuition that 341 00:11:40,390 --> 00:11:41,340 X and l are very similar 342 00:11:41,540 --> 00:11:44,110 to each other, then maybe the inter product between them will be large. 343 00:11:44,420 --> 00:11:45,590 They have some other properties as well 344 00:11:46,260 --> 00:11:48,080 but people tend not to use it much. 345 00:11:49,130 --> 00:11:50,150 And then, depending on what you're 346 00:11:50,260 --> 00:11:51,210 doing, there are other, sort of more 347 00:11:52,330 --> 00:11:54,950 esoteric kernels as well, that you may come across. 348 00:11:55,670 --> 00:11:57,180 You know, there's a string kernel, this 349 00:11:57,340 --> 00:11:58,430 is sometimes used if your 350 00:11:58,550 --> 00:12:01,350 input data is text strings or other types of strings. 351 00:12:02,270 --> 00:12:02,940 There are things like the 352 00:12:03,260 --> 00:12:06,000 chi-square kernel, the histogram intersection kernel, and so on. 353 00:12:06,690 --> 00:12:08,420 There are sort of more esoteric kernels that 354 00:12:08,660 --> 00:12:09,840 you can use to measure similarity 355 00:12:10,760 --> 00:12:12,030 between different objects. 356 00:12:12,660 --> 00:12:13,800 So for example, if you're trying to 357 00:12:14,380 --> 00:12:15,840 do some sort of text classification 358 00:12:16,170 --> 00:12:17,060 problem, where the input 359 00:12:17,200 --> 00:12:19,300 x is a string then 360 00:12:19,490 --> 00:12:20,490 maybe we want to find the 361 00:12:20,550 --> 00:12:22,050 similarity between two strings 362 00:12:22,430 --> 00:12:24,240 using the string kernel, but I 363 00:12:24,520 --> 00:12:26,440 personally you know end up very rarely, 364 00:12:26,990 --> 00:12:29,340 if at all, using these more esoteric kernels. I 365 00:12:29,880 --> 00:12:30,970 think I might have use the chi-square 366 00:12:31,170 --> 00:12:32,270 kernel, may be once in 367 00:12:32,340 --> 00:12:33,670 my life and the histogram kernel, 368 00:12:34,240 --> 00:12:35,580 may be once or twice in my life. I've 369 00:12:35,630 --> 00:12:38,500 actually never used the string kernel myself. But in 370 00:12:39,350 --> 00:12:41,560 case you've run across this in other applications. You know, if 371 00:12:42,700 --> 00:12:43,640 you do a quick web 372 00:12:43,860 --> 00:12:44,850 search we do a quick Google 373 00:12:45,040 --> 00:12:46,000 search or quick Bing search 374 00:12:46,590 --> 00:12:48,240 you should have found definitions that these are the kernels as well. So 375 00:12:51,480 --> 00:12:55,680 just two last details I want to talk about in this video. One in multiclass classification. So, you 376 00:12:56,370 --> 00:12:59,510 have four classes or more generally 377 00:12:59,800 --> 00:13:01,880 3 classes output some appropriate 378 00:13:02,530 --> 00:13:06,860 decision bounday between your multiple classes. Most SVM, many SVM 379 00:13:07,220 --> 00:13:08,750 packages already have built-in 380 00:13:09,030 --> 00:13:10,430 multiclass classification functionality. So 381 00:13:11,100 --> 00:13:12,060 if your using a pattern like 382 00:13:12,270 --> 00:13:13,320 that, you just use the 383 00:13:13,540 --> 00:13:15,370 both that functionality and that 384 00:13:15,490 --> 00:13:16,940 should work fine. Otherwise, 385 00:13:17,790 --> 00:13:18,790 one way to do this 386 00:13:19,000 --> 00:13:19,880 is to use the one 387 00:13:20,000 --> 00:13:21,280 versus all method that we 388 00:13:21,370 --> 00:13:23,690 talked about when we are developing logistic regression. So 389 00:13:24,680 --> 00:13:25,410 what you do is you trade 390 00:13:26,160 --> 00:13:27,550 kSVM's if you have 391 00:13:27,700 --> 00:13:29,190 k classes, one to distinguish 392 00:13:29,900 --> 00:13:31,060 each of the classes from the rest. 393 00:13:31,850 --> 00:13:32,930 And this would give you k parameter 394 00:13:33,520 --> 00:13:34,530 vectors, so this will 395 00:13:34,680 --> 00:13:36,210 give you, upi lmpw. theta 1, which 396 00:13:36,530 --> 00:13:38,170 is trying to distinguish class y equals 397 00:13:38,630 --> 00:13:39,980 one from all of 398 00:13:40,130 --> 00:13:41,340 the other classes, then you 399 00:13:41,420 --> 00:13:42,910 get the second parameter, vector 400 00:13:42,970 --> 00:13:43,910 theta 2, which is what 401 00:13:44,020 --> 00:13:45,420 you get when you, you know, have 402 00:13:45,720 --> 00:13:47,080 y equals 2 as the positive class 403 00:13:47,460 --> 00:13:48,680 and all the others as negative class 404 00:13:49,260 --> 00:13:50,550 and so on up to 405 00:13:50,800 --> 00:13:52,400 a parameter vector theta k, 406 00:13:52,750 --> 00:13:54,520 which is the parameter vector for 407 00:13:54,600 --> 00:13:56,770 distinguishing the final class 408 00:13:57,360 --> 00:13:59,380 key from anything else, and 409 00:13:59,490 --> 00:14:00,590 then lastly, this is exactly 410 00:14:01,270 --> 00:14:02,040 the same as the one versus 411 00:14:02,420 --> 00:14:04,230 all method we have for logistic regression. 412 00:14:04,760 --> 00:14:05,910 Where we you just predict the class 413 00:14:06,390 --> 00:14:07,690 i with the largest theta 414 00:14:08,030 --> 00:14:11,840 transpose X. So let's multiclass classification designate. 415 00:14:12,440 --> 00:14:13,750 For the more common cases 416 00:14:14,300 --> 00:14:15,090 that there is a good 417 00:14:15,180 --> 00:14:16,460 chance that whatever software package 418 00:14:16,780 --> 00:14:18,010 you use, you know, there will be 419 00:14:18,340 --> 00:14:19,650 a reasonable chance that are already 420 00:14:19,920 --> 00:14:21,740 have built in multiclass classification functionality, 421 00:14:21,920 --> 00:14:24,410 and so you don't need to worry about this result. 422 00:14:25,280 --> 00:14:27,010 Finally, we developed support vector 423 00:14:27,210 --> 00:14:28,650 machines starting off with logistic 424 00:14:29,090 --> 00:14:31,500 regression and then modifying the cost function a little bit. 425 00:14:31,910 --> 00:14:34,900 The last thing we want to do in this video is, just say a little bit about. 426 00:14:35,550 --> 00:14:36,570 when you will use one of 427 00:14:36,660 --> 00:14:38,840 these two algorithms, so let's 428 00:14:39,080 --> 00:14:40,000 say n is the number 429 00:14:40,160 --> 00:14:42,000 of features and m is the number of training examples. 430 00:14:43,190 --> 00:14:45,250 So, when should we use one algorithm versus the other? 431 00:14:47,130 --> 00:14:48,430 Well, if n is larger 432 00:14:48,980 --> 00:14:50,140 relative to your training set 433 00:14:50,360 --> 00:14:51,390 size, so for example, 434 00:14:52,810 --> 00:14:53,990 if you take a business 435 00:14:54,250 --> 00:14:55,180 with a number of features this is 436 00:14:55,330 --> 00:14:56,870 much larger than m and this 437 00:14:57,120 --> 00:14:58,210 might be, for example, if you 438 00:14:58,320 --> 00:15:00,590 have a text classification problem, where 439 00:15:01,550 --> 00:15:02,430 you know, the dimension of the feature 440 00:15:02,700 --> 00:15:04,160 vector is I don't know, maybe, 10 thousand. 441 00:15:05,370 --> 00:15:06,350 And if your training 442 00:15:06,720 --> 00:15:08,290 set size is maybe 10 443 00:15:08,510 --> 00:15:10,250 you know, maybe, up to 1000. 444 00:15:10,500 --> 00:15:12,140 So, imagine a spam 445 00:15:12,320 --> 00:15:14,250 classification problem, where email 446 00:15:14,510 --> 00:15:15,840 spam, where you have 10,000 447 00:15:16,150 --> 00:15:18,010 features corresponding to 10,000 words 448 00:15:18,190 --> 00:15:19,550 but you have, you know, maybe 10 449 00:15:19,780 --> 00:15:21,150 training examples or maybe up to 1,000 examples. 450 00:15:22,450 --> 00:15:23,750 So if n is large relative to 451 00:15:23,890 --> 00:15:25,090 m, then what I 452 00:15:25,250 --> 00:15:26,480 would usually do is use logistic 453 00:15:26,850 --> 00:15:27,990 regression or use it 454 00:15:28,100 --> 00:15:29,030 as the m without a kernel or 455 00:15:29,460 --> 00:15:30,790 use it with a linear kernel. 456 00:15:31,620 --> 00:15:32,430 Because, if you have so many 457 00:15:32,580 --> 00:15:33,830 features with smaller training sets, you know, 458 00:15:34,530 --> 00:15:35,870 a linear function will probably 459 00:15:36,330 --> 00:15:37,380 do fine, and you don't have 460 00:15:37,640 --> 00:15:38,790 really enough data to 461 00:15:38,910 --> 00:15:40,760 fit a very complicated nonlinear function. 462 00:15:41,340 --> 00:15:42,410 Now if is n is 463 00:15:42,520 --> 00:15:44,020 small and m is 464 00:15:44,350 --> 00:15:45,890 intermediate what I mean 465 00:15:45,940 --> 00:15:47,450 by this is n is 466 00:15:48,040 --> 00:15:50,350 maybe anywhere from 1 - 1000, 1 would be very small. 467 00:15:50,530 --> 00:15:51,470 But maybe up to 1000 468 00:15:51,700 --> 00:15:54,270 features and if 469 00:15:54,590 --> 00:15:56,180 the number of training 470 00:15:56,330 --> 00:15:57,700 examples is maybe anywhere from 471 00:15:58,210 --> 00:16:00,750 10, you know, 10 to maybe up to 10,000 examples. 472 00:16:01,350 --> 00:16:03,160 Maybe up to 50,000 examples. 473 00:16:03,630 --> 00:16:06,490 If m is pretty big like maybe 10,000 but not a million. 474 00:16:06,760 --> 00:16:08,100 Right? So if m is an 475 00:16:08,300 --> 00:16:09,950 intermediate size then often 476 00:16:10,790 --> 00:16:12,980 an SVM with a linear kernel will work well. 477 00:16:13,530 --> 00:16:14,580 We talked about this early as 478 00:16:14,710 --> 00:16:15,800 well, with the one concrete example, 479 00:16:16,350 --> 00:16:17,100 this would be if you have 480 00:16:17,520 --> 00:16:19,720 a two dimensional training set. So, if n 481 00:16:19,900 --> 00:16:21,010 is equal to 2 where you 482 00:16:21,320 --> 00:16:23,710 have, you know, drawing in a pretty large number of training examples. 483 00:16:24,710 --> 00:16:25,860 So Gaussian kernel will do 484 00:16:26,130 --> 00:16:28,160 a pretty good job separating positive and negative classes. 485 00:16:29,770 --> 00:16:30,890 One third setting that's of 486 00:16:30,980 --> 00:16:32,420 interest is if n is 487 00:16:32,520 --> 00:16:34,270 small but m is large. 488 00:16:34,890 --> 00:16:36,560 So if n is you know, again maybe 489 00:16:37,390 --> 00:16:39,280 1 to 1000, could be larger. 490 00:16:40,200 --> 00:16:42,750 But if m was, maybe 491 00:16:43,320 --> 00:16:46,400 50,000 and greater to millions. 492 00:16:47,520 --> 00:16:50,270 So, 50,000, a 100,000, million, trillion. 493 00:16:51,290 --> 00:16:54,020 You have very very large training set sizes, right. 494 00:16:55,240 --> 00:16:56,160 So if this is the case, 495 00:16:56,380 --> 00:16:57,630 then a SVM of the 496 00:16:57,900 --> 00:16:59,850 Gaussian Kernel will be somewhat slow to run. 497 00:17:00,160 --> 00:17:02,300 Today's SVM packages, if you're 498 00:17:02,410 --> 00:17:04,900 using a Gaussian Kernel, tend to struggle a bit. 499 00:17:05,050 --> 00:17:06,250 If you have, you know, maybe 50 500 00:17:06,590 --> 00:17:07,530 thousands okay, but if you 501 00:17:07,620 --> 00:17:10,250 have a million training examples, maybe 502 00:17:10,450 --> 00:17:11,950 or even a 100,000 with a 503 00:17:12,170 --> 00:17:13,730 massive value of m. Today's 504 00:17:14,180 --> 00:17:15,590 SVM packages are very good, 505 00:17:15,870 --> 00:17:17,100 but they can still struggle 506 00:17:17,600 --> 00:17:18,400 a little bit when you have a 507 00:17:19,010 --> 00:17:20,940 massive, massive trainings that size when using a Gaussian Kernel. 508 00:17:22,050 --> 00:17:23,150 So in that case, what I 509 00:17:23,350 --> 00:17:24,960 would usually do is try to just 510 00:17:25,330 --> 00:17:26,660 manually create have more 511 00:17:26,800 --> 00:17:28,600 features and then use 512 00:17:28,930 --> 00:17:30,340 logistic regression or an SVM 513 00:17:30,630 --> 00:17:32,060 without the Kernel. 514 00:17:33,140 --> 00:17:34,030 And in case you look at this 515 00:17:34,230 --> 00:17:35,900 slide and you see logistic regression 516 00:17:36,460 --> 00:17:37,750 or SVM without a kernel. 517 00:17:38,510 --> 00:17:39,890 In both of these places, I 518 00:17:39,980 --> 00:17:41,750 kind of paired them together. There's 519 00:17:42,060 --> 00:17:43,050 a reason for that, is that 520 00:17:43,900 --> 00:17:45,640 logistic regression and SVM without 521 00:17:46,000 --> 00:17:47,130 the kernel, those are really pretty 522 00:17:47,350 --> 00:17:49,450 similar algorithms and, you know, either 523 00:17:49,680 --> 00:17:51,170 logistic regression or SVM 524 00:17:51,500 --> 00:17:53,230 without a kernel will usually do 525 00:17:53,380 --> 00:17:54,780 pretty similar things and give 526 00:17:54,900 --> 00:17:56,690 pretty similar performance, but depending 527 00:17:57,060 --> 00:18:00,340 on your implementational details, one may be more efficient than the other. 528 00:18:00,930 --> 00:18:02,220 But, where one of 529 00:18:02,310 --> 00:18:03,530 these algorithms applies, logistic 530 00:18:03,740 --> 00:18:05,190 regression where SVM without a 531 00:18:05,420 --> 00:18:05,840 kernel, the other one is to likely 532 00:18:06,650 --> 00:18:07,600 to work pretty well as well. 533 00:18:08,540 --> 00:18:09,660 But along with the power of 534 00:18:09,720 --> 00:18:11,610 the SVM is when you 535 00:18:11,810 --> 00:18:14,100 use different kernels to learn 536 00:18:14,430 --> 00:18:15,860 complex nonlinear functions. 537 00:18:16,680 --> 00:18:20,300 And this regime, you know, when you 538 00:18:20,550 --> 00:18:22,530 have maybe up to 10,000 examples, maybe up to 50,000. 539 00:18:22,610 --> 00:18:25,010 And your number of features, 540 00:18:26,580 --> 00:18:27,540 this is reasonably large. 541 00:18:27,840 --> 00:18:29,230 That's a very common regime 542 00:18:29,670 --> 00:18:30,910 and maybe that's a regime 543 00:18:31,430 --> 00:18:33,830 where a support vector machine with a kernel kernel will shine. 544 00:18:34,320 --> 00:18:35,640 You can do things that are much 545 00:18:35,860 --> 00:18:39,850 harder to do that will need logistic regression. 546 00:18:40,100 --> 00:18:40,930 And finally, where do neural networks fit in? 547 00:18:41,120 --> 00:18:42,230 Well for all of these 548 00:18:42,440 --> 00:18:43,890 problems, for all of 549 00:18:43,960 --> 00:18:46,310 these different regimes, a well 550 00:18:46,630 --> 00:18:49,110 designed neural network is likely to work well as well. 551 00:18:50,320 --> 00:18:51,700 The one disadvantage, or the one 552 00:18:51,830 --> 00:18:52,980 reason that might not sometimes use 553 00:18:53,220 --> 00:18:54,690 the neural network is that, 554 00:18:54,920 --> 00:18:56,080 for some of these problems, the 555 00:18:56,180 --> 00:18:57,640 neural network might be slow to train. 556 00:18:58,250 --> 00:18:59,080 But if you have a very good 557 00:18:59,350 --> 00:19:01,190 SVM implementation package, that 558 00:19:01,400 --> 00:19:04,120 could run faster, quite a bit faster than your neural network. 559 00:19:05,130 --> 00:19:06,130 And, although we didn't show this 560 00:19:06,350 --> 00:19:07,520 earlier, it turns out that 561 00:19:07,630 --> 00:19:09,800 the optimization problem that the 562 00:19:10,070 --> 00:19:11,120 SVM has is a convex 563 00:19:12,320 --> 00:19:13,830 optimization problem and so the 564 00:19:14,410 --> 00:19:15,800 good SVM optimization software 565 00:19:16,160 --> 00:19:17,870 packages will always find 566 00:19:18,240 --> 00:19:21,370 the global minimum or something close to it. 567 00:19:21,720 --> 00:19:24,100 And so for the SVM you don't need to worry about local optima. 568 00:19:25,280 --> 00:19:26,440 In practice local optima aren't 569 00:19:26,580 --> 00:19:27,920 a huge problem for neural networks 570 00:19:28,090 --> 00:19:29,120 but they all solve, so this 571 00:19:29,310 --> 00:19:31,520 is one less thing to worry about if you're using an SVM. 572 00:19:33,350 --> 00:19:34,560 And depending on your problem, the neural 573 00:19:34,910 --> 00:19:37,050 network may be slower, especially 574 00:19:37,580 --> 00:19:41,020 in this sort of regime than the SVM. 575 00:19:41,420 --> 00:19:42,200 In case the guidelines they gave 576 00:19:42,520 --> 00:19:43,500 here, seem a little bit vague 577 00:19:43,860 --> 00:19:44,600 and if you're looking at some problems, you know, 578 00:19:46,930 --> 00:19:48,050 the guidelines are a bit 579 00:19:48,170 --> 00:19:49,190 vague, I'm still not entirely 580 00:19:49,570 --> 00:19:50,730 sure, should I use this 581 00:19:50,780 --> 00:19:52,690 algorithm or that algorithm, that's actually okay. 582 00:19:52,950 --> 00:19:54,100 When I face a machine learning 583 00:19:54,330 --> 00:19:55,570 problem, you know, sometimes its actually 584 00:19:55,730 --> 00:19:57,010 just not clear whether that's the 585 00:19:57,150 --> 00:19:58,700 best algorithm to use, but as 586 00:19:59,540 --> 00:20:00,590 you saw in the earlier videos, really, 587 00:20:01,200 --> 00:20:02,470 you know, the algorithm does 588 00:20:02,700 --> 00:20:03,920 matter, but what often matters 589 00:20:04,250 --> 00:20:06,400 even more is things like, how much data do you have. 590 00:20:07,090 --> 00:20:08,280 And how skilled are you, how 591 00:20:08,450 --> 00:20:09,500 good are you at doing error 592 00:20:09,750 --> 00:20:11,450 analysis and debugging learning 593 00:20:11,660 --> 00:20:13,090 algorithms, figuring out how 594 00:20:13,220 --> 00:20:15,120 to design new features and 595 00:20:15,280 --> 00:20:17,540 figuring out what other features to give you learning algorithms and so on. 596 00:20:17,960 --> 00:20:19,110 And often those things will matter 597 00:20:19,660 --> 00:20:20,700 more than what you are 598 00:20:20,840 --> 00:20:22,370 using logistic regression or an SVM. 599 00:20:23,280 --> 00:20:24,650 But having said that, 600 00:20:25,010 --> 00:20:26,180 the SVM is still widely 601 00:20:26,630 --> 00:20:27,890 perceived as one of 602 00:20:27,950 --> 00:20:29,600 the most powerful learning algorithms, and 603 00:20:29,740 --> 00:20:31,570 there is this regime of when there's 604 00:20:31,790 --> 00:20:34,340 a very effective way to learn complex non linear functions. 605 00:20:35,150 --> 00:20:36,840 And so I actually, together with 606 00:20:37,040 --> 00:20:38,930 logistic regressions, neural networks, SVM's, 607 00:20:39,090 --> 00:20:40,630 using those to speed 608 00:20:40,760 --> 00:20:42,170 learning algorithms you're I think 609 00:20:42,440 --> 00:20:43,610 very well positioned to build 610 00:20:44,120 --> 00:20:45,120 state of the art you know, 611 00:20:45,310 --> 00:20:46,710 machine learning systems for a wide 612 00:20:46,960 --> 00:20:49,110 region for applications and this 613 00:20:49,330 --> 00:20:52,460 is another very powerful tool to have in your arsenal. 614 00:20:53,160 --> 00:20:54,270 One that is used all 615 00:20:54,460 --> 00:20:55,850 over the place in Silicon Valley, 616 00:20:56,390 --> 00:20:58,030 or in industry and in 617 00:20:58,310 --> 00:20:59,860 the Academia, to build many 618 00:21:00,120 --> 00:21:01,680 high performance machine learning system.