1 00:00:00,750 --> 00:00:02,160 Sometimes people talk about support 2 00:00:02,520 --> 00:00:04,380 vector machines, as large margin 3 00:00:04,990 --> 00:00:06,950 classifiers, in this video I'd 4 00:00:07,080 --> 00:00:08,030 like to tell you what that 5 00:00:08,410 --> 00:00:09,500 means, and this will 6 00:00:09,780 --> 00:00:10,520 also give us a useful 7 00:00:11,030 --> 00:00:12,780 picture of what an SVM 8 00:00:13,020 --> 00:00:17,460 hypothesis may look like. 9 00:00:18,070 --> 00:00:19,290 Here's my cost function for the support vector machine 10 00:00:21,310 --> 00:00:22,290 where here on the left 11 00:00:22,790 --> 00:00:24,300 I've plotted my cost 1 12 00:00:24,560 --> 00:00:28,100 of z function that I used for positive examples and on the right I've plotted my 13 00:00:30,080 --> 00:00:31,510 zero of 'Z' function, where I have 14 00:00:31,950 --> 00:00:33,850 'Z' here on the horizontal axis. 15 00:00:34,380 --> 00:00:35,520 Now, let's think about what 16 00:00:35,650 --> 00:00:38,380 it takes to make these cost functions small. 17 00:00:39,660 --> 00:00:40,970 If you have a positive example, 18 00:00:41,950 --> 00:00:43,170 so if y is equal to 19 00:00:43,490 --> 00:00:45,060 1, then cost 1 of 20 00:00:45,200 --> 00:00:46,750 Z is zero only when 21 00:00:47,700 --> 00:00:50,070 Z is greater than or equal to 1. 22 00:00:50,180 --> 00:00:51,370 So in other words, if you 23 00:00:51,510 --> 00:00:52,860 have a positive example, we really 24 00:00:53,110 --> 00:00:54,550 want theta transpose x to be greater 25 00:00:54,870 --> 00:00:55,760 than or equal to 1 26 00:00:56,450 --> 00:00:58,030 and conversely if y is 27 00:00:58,150 --> 00:00:59,300 equal to zero, look this 28 00:00:59,510 --> 00:01:00,490 cost zero of z function, 29 00:01:01,560 --> 00:01:03,000 then it's only in 30 00:01:03,200 --> 00:01:04,310 this region where z is 31 00:01:04,460 --> 00:01:05,810 less than equal to 1 32 00:01:06,150 --> 00:01:07,320 we have the cost is zero 33 00:01:07,610 --> 00:01:10,150 as z is equals to zero, 34 00:01:10,640 --> 00:01:12,270 and this is an interesting property of the support 35 00:01:12,560 --> 00:01:13,630 vector machine right, which is 36 00:01:13,800 --> 00:01:15,060 that, if you have a positive 37 00:01:15,440 --> 00:01:17,650 example so if y is equal to one, 38 00:01:18,370 --> 00:01:19,250 then all we really need 39 00:01:19,550 --> 00:01:21,950 is that theta transpose x is greater than equal to zero. 40 00:01:22,970 --> 00:01:25,270 And that would mean that we classify correctly 41 00:01:25,860 --> 00:01:26,950 because if theta transpose x is greater than zero our 42 00:01:27,510 --> 00:01:28,980 hypothesis will predict zero. 43 00:01:29,840 --> 00:01:30,710 And similarly, if you have 44 00:01:31,340 --> 00:01:34,090 a negative example, then really all you want is that theta transpose x is 45 00:01:34,850 --> 00:01:37,290 less than zero and that will make sure we got the example right. 46 00:01:37,670 --> 00:01:40,230 But the support vector machine wants a bit more than that. 47 00:01:40,580 --> 00:01:43,360 It says, you know, don't just barely get the example right. 48 00:01:44,320 --> 00:01:45,990 So then don't just 49 00:01:46,240 --> 00:01:47,580 have it just a little bit bigger than zero. What 50 00:01:47,890 --> 00:01:48,870 i really want is for this to be 51 00:01:49,060 --> 00:01:50,370 quite a lot bigger than zero 52 00:01:50,490 --> 00:01:51,430 say maybe 53 00:01:51,680 --> 00:01:52,530 bit greater or equal to one 54 00:01:52,870 --> 00:01:54,400 and I want this to be much less than zero. 55 00:01:54,800 --> 00:01:55,970 Maybe I want it less than or 56 00:01:56,230 --> 00:01:58,140 equal to -1. 57 00:01:58,830 --> 00:02:00,000 And so this builds in an 58 00:02:00,120 --> 00:02:01,660 extra safety factor or safety 59 00:02:02,070 --> 00:02:03,630 margin factor into the support vector machine. 60 00:02:04,030 --> 00:02:05,700 Logistic regression 61 00:02:06,340 --> 00:02:07,620 does something similar too of course, 62 00:02:07,820 --> 00:02:08,900 but let's see what 63 00:02:09,110 --> 00:02:10,350 happens or let's see what 64 00:02:10,460 --> 00:02:11,290 the consequences of this are, in the 65 00:02:11,360 --> 00:02:13,180 context of the support vector machine. 66 00:02:14,830 --> 00:02:15,740 Concretely, what I'd like to do next is 67 00:02:16,010 --> 00:02:17,760 consider a case 68 00:02:17,900 --> 00:02:19,130 case where we set 69 00:02:19,460 --> 00:02:21,240 this constant C to be 70 00:02:21,400 --> 00:02:23,340 a very large value, so let's 71 00:02:23,530 --> 00:02:24,700 imagine we set C to 72 00:02:24,820 --> 00:02:28,080 a very large value, may be a hundred thousand, some huge number. 73 00:02:29,370 --> 00:02:31,290 Let's see what the support vector machine will do. 74 00:02:31,580 --> 00:02:33,510 If C is very, 75 00:02:33,820 --> 00:02:35,340 very large, then when minimizing 76 00:02:36,350 --> 00:02:38,080 this optimization objective, we're going 77 00:02:38,300 --> 00:02:39,640 to be highly motivated to choose 78 00:02:39,950 --> 00:02:41,240 a value, so that this 79 00:02:41,380 --> 00:02:43,180 first term is equal to zero. 80 00:02:44,810 --> 00:02:46,250 So let's try to 81 00:02:46,670 --> 00:02:48,320 understand the optimization problem in 82 00:02:48,430 --> 00:02:49,820 the context of, what would 83 00:02:50,050 --> 00:02:51,520 it take to make this 84 00:02:51,880 --> 00:02:53,060 first term in the objective 85 00:02:53,470 --> 00:02:54,890 equal to zero, because you 86 00:02:55,000 --> 00:02:56,100 know, maybe we'll set C to 87 00:02:56,250 --> 00:02:59,420 some huge constant, and this 88 00:02:59,590 --> 00:03:00,780 will hope, this should give us 89 00:03:01,300 --> 00:03:02,920 additional intuition about what 90 00:03:03,110 --> 00:03:05,520 sort of hypotheses a support vector machine learns. 91 00:03:06,440 --> 00:03:07,720 So we saw already that 92 00:03:08,140 --> 00:03:09,260 whenever you have a training 93 00:03:09,480 --> 00:03:11,350 example with a label 94 00:03:11,690 --> 00:03:13,850 of y=1 if you 95 00:03:13,950 --> 00:03:15,050 want to make that first term 96 00:03:15,240 --> 00:03:16,280 zero, what you need is 97 00:03:16,450 --> 00:03:17,680 is to find a value of theta 98 00:03:17,990 --> 00:03:20,380 so that theta transpose x i is greater 99 00:03:20,690 --> 00:03:22,800 than or equal to 1. 100 00:03:23,220 --> 00:03:24,250 And similarly, whenever we have an example, 101 00:03:24,960 --> 00:03:26,910 with label zero, in order 102 00:03:27,240 --> 00:03:28,060 to make sure that the cost, 103 00:03:29,000 --> 00:03:30,520 cost zero of Z, in order to 104 00:03:30,610 --> 00:03:31,530 make sure that cost is 105 00:03:31,790 --> 00:03:33,250 zero we need that theta transpose x i 106 00:03:33,810 --> 00:03:36,180 is less than or 107 00:03:37,900 --> 00:03:38,740 equal to -1. 108 00:03:39,510 --> 00:03:40,770 So, if we think 109 00:03:41,050 --> 00:03:43,030 of our optimization problem as 110 00:03:43,360 --> 00:03:45,000 now, really choosing parameters 111 00:03:45,710 --> 00:03:46,750 and show that this first 112 00:03:47,020 --> 00:03:48,170 term is equal to zero, 113 00:03:49,130 --> 00:03:50,230 what we're left with is 114 00:03:50,330 --> 00:03:51,670 the following optimization problem. 115 00:03:52,050 --> 00:03:53,720 We're going to minimize that first 116 00:03:53,980 --> 00:03:55,360 term zero, so C 117 00:03:55,590 --> 00:03:56,710 times zero, because we're going 118 00:03:56,870 --> 00:03:58,040 to choose parameters so that's equal 119 00:03:58,150 --> 00:03:59,710 to zero, plus one half 120 00:04:00,330 --> 00:04:01,330 and then you know that 121 00:04:01,460 --> 00:04:05,440 second term and this 122 00:04:05,620 --> 00:04:06,880 first term is 'C' times zero, 123 00:04:07,160 --> 00:04:08,020 so let's just cross that 124 00:04:08,130 --> 00:04:11,210 out because I know that's going to be zero. 125 00:04:11,380 --> 00:04:12,570 And this will be subject to the constraint 126 00:04:13,400 --> 00:04:15,410 that theta transpose x(i) 127 00:04:16,390 --> 00:04:17,560 is greater than or equal to 128 00:04:18,700 --> 00:04:20,930 one, if y(i) 129 00:04:22,180 --> 00:04:24,150 Is equal to one and 130 00:04:24,940 --> 00:04:26,560 theta transpose x(i) is less than 131 00:04:26,690 --> 00:04:28,060 or equal to minus one 132 00:04:29,030 --> 00:04:31,680 whenever you have 133 00:04:32,110 --> 00:04:34,460 a negative example and it 134 00:04:34,540 --> 00:04:35,520 turns out that when you 135 00:04:35,660 --> 00:04:37,930 solve this optimization problem, when you 136 00:04:38,070 --> 00:04:39,440 minimize this as a function of the parameters theta 137 00:04:40,710 --> 00:04:42,090 you get a very interesting decision 138 00:04:42,590 --> 00:04:44,870 boundary. Concretely, if you 139 00:04:45,010 --> 00:04:46,470 look at a data set 140 00:04:46,750 --> 00:04:49,660 like this with positive and negative examples, this data 141 00:04:50,920 --> 00:04:52,430 is linearly separable and by 142 00:04:52,710 --> 00:04:54,960 that, I mean that there exists, you know, a straight line, 143 00:04:55,530 --> 00:04:56,830 altough there is many a different straight lines, 144 00:04:56,920 --> 00:04:57,810 they can separate the positive and 145 00:04:58,720 --> 00:05:01,060 negative examples perfectly. 146 00:05:01,560 --> 00:05:02,710 For example, here is one decision boundary 147 00:05:04,270 --> 00:05:05,430 that separates the positive and 148 00:05:05,570 --> 00:05:06,840 negative examples, but somehow that 149 00:05:07,030 --> 00:05:07,810 doesn't look like a very 150 00:05:07,900 --> 00:05:09,680 natural one, right? Or by 151 00:05:09,810 --> 00:05:11,050 drawing an even worse one, you know 152 00:05:11,230 --> 00:05:13,540 here's another decision boundary that 153 00:05:13,710 --> 00:05:14,830 separates the positive and negative examples 154 00:05:14,900 --> 00:05:15,960 but just barely. 155 00:05:16,120 --> 00:05:18,530 But neither of those seem like particularly good choices. 156 00:05:20,420 --> 00:05:22,880 The Support Vector Machines will instead choose this 157 00:05:23,140 --> 00:05:26,450 decision boundary, which I'm drawing in black. 158 00:05:29,010 --> 00:05:30,030 And that seems like a much better decision boundary 159 00:05:30,760 --> 00:05:32,310 than either of 160 00:05:32,420 --> 00:05:34,450 the ones that I drew in magenta or in green. 161 00:05:34,750 --> 00:05:35,790 The black line seems like a more 162 00:05:36,050 --> 00:05:37,840 robust separator, it does 163 00:05:38,610 --> 00:05:39,710 a better job of separating the positive and negative examples. 164 00:05:39,800 --> 00:05:42,830 And mathematically, what that does is, 165 00:05:43,530 --> 00:05:45,680 this black decision boundary has a larger distance. 166 00:05:49,160 --> 00:05:50,580 That distance is called the margin, when I 167 00:05:50,760 --> 00:05:51,790 draw up this two extra 168 00:05:52,380 --> 00:05:54,320 blue lines, we see 169 00:05:54,540 --> 00:05:56,010 that the black decision boundary has 170 00:05:56,240 --> 00:05:59,990 some larger minimum distance from any of my training examples, 171 00:06:00,120 --> 00:06:01,350 whereas the magenta and the green lines 172 00:06:01,580 --> 00:06:02,600 they come awfully close to the training examples. 173 00:06:04,640 --> 00:06:06,100 and then that seems to do a less a good job separating 174 00:06:06,500 --> 00:06:08,910 the positive and negative classes than my black line. 175 00:06:09,850 --> 00:06:11,500 And so 176 00:06:11,800 --> 00:06:13,600 this distance is called 177 00:06:13,960 --> 00:06:16,500 the margin of the 178 00:06:16,600 --> 00:06:21,300 support vector machine and this 179 00:06:21,500 --> 00:06:22,480 gives the SVM a certain 180 00:06:22,940 --> 00:06:24,010 robustness, because it tries 181 00:06:24,360 --> 00:06:25,530 to separate the data with as 182 00:06:25,700 --> 00:06:27,440 a large a margin as possible. 183 00:06:29,210 --> 00:06:30,250 So the support vector machine is 184 00:06:30,380 --> 00:06:31,650 sometimes also called a large 185 00:06:31,830 --> 00:06:33,930 margin classifier and this 186 00:06:34,170 --> 00:06:36,180 is actually a consequence of 187 00:06:36,430 --> 00:06:39,370 the optimization problem we wrote down on the previous slide. 188 00:06:40,140 --> 00:06:40,950 I know that you might be 189 00:06:41,100 --> 00:06:42,250 wondering how is it that 190 00:06:42,400 --> 00:06:43,900 the optimization problem I wrote 191 00:06:44,070 --> 00:06:45,080 down in the previous slide, how 192 00:06:45,280 --> 00:06:47,270 does that lead to this large margin classifier. 193 00:06:48,350 --> 00:06:49,700 I know I haven't explained that yet. 194 00:06:50,520 --> 00:06:51,570 And in the next video 195 00:06:51,810 --> 00:06:53,340 I'm going to sketch a 196 00:06:53,500 --> 00:06:55,180 little bit of the intuition about why 197 00:06:55,430 --> 00:06:57,080 that optimization problem gives us 198 00:06:57,570 --> 00:06:59,630 this large margin classifier. But 199 00:06:59,790 --> 00:07:00,860 this is a useful feature to 200 00:07:00,970 --> 00:07:01,780 keep in mind if you are 201 00:07:01,920 --> 00:07:03,150 trying to understand what are the 202 00:07:03,290 --> 00:07:05,600 sorts of hypothesis that an SVM will choose. 203 00:07:06,140 --> 00:07:07,200 That is, trying to separate the 204 00:07:07,270 --> 00:07:10,310 positive and negative examples with as big a margin as possible. 205 00:07:12,890 --> 00:07:13,950 I want to say one last thing 206 00:07:14,180 --> 00:07:15,930 about large margin classifiers in 207 00:07:16,070 --> 00:07:17,900 this intuition, so we 208 00:07:18,030 --> 00:07:19,340 wrote out this large margin classification 209 00:07:20,010 --> 00:07:21,040 setting in the case 210 00:07:21,420 --> 00:07:23,640 of when C, that regularization concept, 211 00:07:24,160 --> 00:07:25,190 was very large, I think 212 00:07:25,390 --> 00:07:27,750 I set that to a hundred thousand or something. 213 00:07:28,310 --> 00:07:29,760 So given a dataset 214 00:07:30,110 --> 00:07:31,630 like this, maybe we'll choose 215 00:07:32,110 --> 00:07:34,000 that decision boundary that 216 00:07:34,140 --> 00:07:36,210 separate the positive and negative examples on large margin. 217 00:07:37,370 --> 00:07:39,020 Now, the SVM is actually sligthly 218 00:07:39,370 --> 00:07:41,120 more sophisticated than this large 219 00:07:41,440 --> 00:07:42,920 margin view might suggest. 220 00:07:43,630 --> 00:07:45,130 And in particular, if all you're 221 00:07:45,310 --> 00:07:46,490 doing is use a large 222 00:07:46,680 --> 00:07:48,850 margin classifier then your 223 00:07:49,020 --> 00:07:50,270 learning algorithms can be sensitive 224 00:07:50,920 --> 00:07:52,260 to outliers, so lets just 225 00:07:52,450 --> 00:07:53,990 add an extra positive example 226 00:07:54,520 --> 00:07:56,540 like that shown on the screen. 227 00:07:57,230 --> 00:07:58,830 If he had one example then 228 00:07:58,950 --> 00:08:00,060 it seems as if to separate 229 00:08:00,300 --> 00:08:01,410 data with a large margin, 230 00:08:02,680 --> 00:08:04,300 maybe I'll end up learning 231 00:08:05,270 --> 00:08:07,260 a decision boundary like that, right? 232 00:08:07,540 --> 00:08:09,130 that is the magenta line and 233 00:08:09,180 --> 00:08:10,210 it's really not clear that based 234 00:08:10,440 --> 00:08:11,950 on the single outlier based on 235 00:08:12,180 --> 00:08:13,560 a single example and it's 236 00:08:13,790 --> 00:08:14,720 really not clear that it's 237 00:08:14,890 --> 00:08:16,460 actually a good idea to change 238 00:08:17,060 --> 00:08:17,980 my decision boundary from the black 239 00:08:18,290 --> 00:08:19,960 one over to the magenta one. 240 00:08:20,980 --> 00:08:23,430 So, if C, if 241 00:08:23,640 --> 00:08:25,740 the regularization parameter C were very 242 00:08:25,970 --> 00:08:27,110 large, then this is 243 00:08:27,300 --> 00:08:28,130 actually what SVM will do, it will 244 00:08:28,360 --> 00:08:29,820 change the decision boundary 245 00:08:30,270 --> 00:08:31,530 from the black to the 246 00:08:31,650 --> 00:08:33,650 magenta one but if 247 00:08:33,810 --> 00:08:35,390 C were reasonably small if 248 00:08:35,550 --> 00:08:36,720 you were to use the C, 249 00:08:37,320 --> 00:08:39,090 not too large then you 250 00:08:39,260 --> 00:08:40,400 still end up with this 251 00:08:40,610 --> 00:08:44,500 black decision boundary. 252 00:08:44,830 --> 00:08:46,880 And of course if the data were not linearly separable so if you had some positive 253 00:08:47,250 --> 00:08:48,790 examples in here, or if 254 00:08:49,170 --> 00:08:50,440 you had some negative examples 255 00:08:50,980 --> 00:08:52,300 in here then the SVM 256 00:08:52,570 --> 00:08:53,830 will also do the right thing. 257 00:08:54,260 --> 00:08:55,710 And so this picture of 258 00:08:56,060 --> 00:08:57,770 a large margin classifier that's 259 00:08:58,090 --> 00:08:59,410 really, that's really the 260 00:08:59,530 --> 00:09:01,720 picture that gives better intuition 261 00:09:01,970 --> 00:09:03,440 only for the case of when the 262 00:09:03,560 --> 00:09:05,050 regulations parameter C is 263 00:09:05,190 --> 00:09:07,170 very large, and just 264 00:09:07,420 --> 00:09:08,810 to remind you this corresponds 265 00:09:09,650 --> 00:09:11,300 C plays a role similar to 266 00:09:11,850 --> 00:09:13,600 one over Lambda, where Lambda 267 00:09:13,930 --> 00:09:15,950 is the regularization parameter 268 00:09:16,110 --> 00:09:17,970 we had previously. And so it's 269 00:09:18,080 --> 00:09:18,880 only of one over Lambda 270 00:09:19,080 --> 00:09:21,060 is very large or equivalently 271 00:09:21,280 --> 00:09:23,110 if Lambda is very small that 272 00:09:23,560 --> 00:09:24,640 you end up with things like 273 00:09:24,850 --> 00:09:27,600 this Magenta decision boundary, but 274 00:09:28,870 --> 00:09:29,560 in practice when applying support vector machines, 275 00:09:30,190 --> 00:09:31,620 when C is not very very large 276 00:09:31,910 --> 00:09:33,180 like that, 277 00:09:34,840 --> 00:09:36,390 it can do a better job ignoring 278 00:09:36,980 --> 00:09:38,590 the few outliers like here. And 279 00:09:39,150 --> 00:09:40,320 also do fine and do reasonable things 280 00:09:40,620 --> 00:09:44,400 even if your data is not linearly separable. 281 00:09:44,690 --> 00:09:46,810 But when we talk about bias and variance in the context of support vector machines 282 00:09:46,980 --> 00:09:47,990 which will do 283 00:09:48,170 --> 00:09:50,170 a little bit later, hopefully all 284 00:09:50,410 --> 00:09:51,990 of of this trade-offs involving the regularization 285 00:09:52,410 --> 00:09:53,710 parameter will become clearer at 286 00:09:53,830 --> 00:09:55,280 that time. So I hope 287 00:09:55,580 --> 00:09:57,290 that gives some intuition about 288 00:09:57,600 --> 00:09:59,680 how this support vector machine functions as 289 00:09:59,850 --> 00:10:01,810 a large margin classifier that 290 00:10:01,950 --> 00:10:03,040 tries to separate the data with 291 00:10:03,610 --> 00:10:05,210 a large margin, technically this 292 00:10:06,140 --> 00:10:07,160 picture of this view is true 293 00:10:07,460 --> 00:10:08,710 only when the parameter C is very large, which is 294 00:10:10,230 --> 00:10:11,720 a useful way to think about support vector machines. 295 00:10:13,120 --> 00:10:14,450 There was one missing step in 296 00:10:14,560 --> 00:10:15,990 this video which is, why is 297 00:10:16,110 --> 00:10:17,670 it that the optimization problem we 298 00:10:17,770 --> 00:10:18,770 wrote down on these 299 00:10:19,040 --> 00:10:19,930 slides, how does that actually 300 00:10:20,740 --> 00:10:22,490 lead to the large margin classifier, I 301 00:10:22,600 --> 00:10:23,520 didn't do that in this video, 302 00:10:23,930 --> 00:10:25,830 in the next video I 303 00:10:25,870 --> 00:10:26,940 will sketch a little bit 304 00:10:27,120 --> 00:10:28,370 more of the math behind that 305 00:10:28,750 --> 00:10:29,750 to explain 306 00:10:29,850 --> 00:10:31,660 that separate reasoning of how 307 00:10:31,930 --> 00:10:33,410 the optimization problem we wrote out 308 00:10:33,840 --> 00:10:34,990 results in a large margin classifier.