1 00:00:00,570 --> 00:00:01,860 By now, you see the range 2 00:00:02,090 --> 00:00:04,860 of different learning algorithms. Within supervised learning, 3 00:00:05,280 --> 00:00:06,810 the performance of many supervised learning algorithms 4 00:00:07,300 --> 00:00:08,830 will be pretty similar 5 00:00:09,650 --> 00:00:10,740 and when that is less more often be 6 00:00:11,040 --> 00:00:12,140 whether you use 7 00:00:12,440 --> 00:00:13,450 learning algorithm A or learning algorithm 8 00:00:13,660 --> 00:00:15,020 B but when that 9 00:00:15,190 --> 00:00:16,190 is small there will often be 10 00:00:16,360 --> 00:00:17,100 things like the amount of data 11 00:00:17,330 --> 00:00:18,530 you are creating these algorithms on. 12 00:00:19,280 --> 00:00:20,480 That's always your skill in 13 00:00:20,600 --> 00:00:21,990 applying this algorithms. Seems like 14 00:00:23,150 --> 00:00:24,480 your choice of the features that you 15 00:00:24,660 --> 00:00:25,790 designed to give the learning 16 00:00:26,010 --> 00:00:27,030 algorithms and how you 17 00:00:27,200 --> 00:00:28,530 choose the regularization parameter 18 00:00:29,190 --> 00:00:31,690 and things like that. But there's 19 00:00:31,930 --> 00:00:34,110 one more algorithm that is very 20 00:00:34,380 --> 00:00:35,460 powerful and its very 21 00:00:35,580 --> 00:00:37,400 widely used both within industry 22 00:00:38,050 --> 00:00:39,590 and in Academia. And that's called the 23 00:00:39,850 --> 00:00:41,080 support vector machine, and compared to 24 00:00:41,200 --> 00:00:42,600 both the logistic regression and neural networks, the 25 00:00:46,770 --> 00:00:48,190 support vector machine or the SVM 26 00:00:48,440 --> 00:00:50,120 sometimes gives a cleaner 27 00:00:50,890 --> 00:00:52,040 and sometimes more powerful way 28 00:00:52,480 --> 00:00:53,250 of learning complex nonlinear functions. 29 00:00:54,970 --> 00:00:56,300 And so I'd like to take the next 30 00:00:56,480 --> 00:00:57,850 videos to 31 00:00:57,890 --> 00:01:00,100 talk about that. 32 00:01:00,400 --> 00:01:01,400 Later in this course, I will do 33 00:01:01,540 --> 00:01:02,710 a quick survey of the range 34 00:01:03,100 --> 00:01:04,340 of different supervised learning algorithms just 35 00:01:05,200 --> 00:01:06,790 to very briefly describe them 36 00:01:07,430 --> 00:01:08,870 but the support vector machine, given 37 00:01:09,370 --> 00:01:10,840 its popularity and how popular 38 00:01:10,980 --> 00:01:11,920 it is, this will be 39 00:01:12,060 --> 00:01:13,800 the last of the supervised learning algorithms 40 00:01:14,440 --> 00:01:16,710 that I'll spend a significant amount of time on in this course. 41 00:01:19,260 --> 00:01:20,440 As with our development of ever 42 00:01:20,670 --> 00:01:22,280 learning algorithms, we are going to start by talking 43 00:01:22,650 --> 00:01:23,940 about the optimization objective, 44 00:01:24,750 --> 00:01:26,420 so let's get started on 45 00:01:26,620 --> 00:01:27,920 this algorithm. 46 00:01:29,420 --> 00:01:30,960 In order to describe the support 47 00:01:31,270 --> 00:01:32,570 vector machine, I'm actually going 48 00:01:32,610 --> 00:01:34,020 to start with logistic regression 49 00:01:34,990 --> 00:01:35,990 and show how we can modify 50 00:01:36,820 --> 00:01:37,630 it a bit and get what 51 00:01:38,240 --> 00:01:39,260 is essentially the support vector machine. 52 00:01:40,290 --> 00:01:41,740 So, in logistic regression we have 53 00:01:41,950 --> 00:01:43,680 our familiar form of 54 00:01:43,740 --> 00:01:46,000 the hypotheses there and the 55 00:01:46,450 --> 00:01:48,590 sigmoid activation function shown on the right. 56 00:01:50,390 --> 00:01:51,330 And in order to explain 57 00:01:51,800 --> 00:01:52,650 some of the math, I'm going 58 00:01:52,850 --> 00:01:55,960 to use z to denote failure of transpose x here. 59 00:01:57,620 --> 00:01:58,650 Now let's think about what 60 00:01:58,900 --> 00:02:01,150 we will like the logistic regression to do. 61 00:02:01,270 --> 00:02:02,800 If we have an example with 62 00:02:03,070 --> 00:02:04,360 y equals 1, and by 63 00:02:04,540 --> 00:02:05,480 this I mean an example 64 00:02:06,100 --> 00:02:07,100 in either a training set 65 00:02:07,440 --> 00:02:11,780 or the test set, you know, order cross valuation set where y is equal to 1 then 66 00:02:12,030 --> 00:02:14,300 we are sort of hoping that h of x will be close to 1. 67 00:02:14,380 --> 00:02:15,760 So, right, we are hoping to 68 00:02:16,140 --> 00:02:17,330 correctly classify that example 69 00:02:18,520 --> 00:02:19,390 and what, having h of x 70 00:02:19,510 --> 00:02:20,710 close to 1, what that means 71 00:02:20,850 --> 00:02:22,080 is that theta transpose x 72 00:02:22,360 --> 00:02:23,380 must be much larger 73 00:02:23,770 --> 00:02:24,990 than 0, so there's 74 00:02:25,330 --> 00:02:26,680 greater than, greater than sign, that 75 00:02:26,900 --> 00:02:28,220 means much, much greater 76 00:02:28,530 --> 00:02:30,880 than 0 and that's 77 00:02:31,120 --> 00:02:32,840 because it is z, that is theta transpose 78 00:02:32,960 --> 00:02:34,750 x 79 00:02:34,940 --> 00:02:35,910 is when z is much bigger than 80 00:02:36,010 --> 00:02:37,240 0, is far to the 81 00:02:37,310 --> 00:02:39,060 right of this figure that, you know, the 82 00:02:39,360 --> 00:02:42,430 output of logistic regression becomes close to 1. 83 00:02:44,510 --> 00:02:45,580 Conversely, if we have 84 00:02:45,630 --> 00:02:46,870 an example where y is 85 00:02:47,000 --> 00:02:48,470 equal to 0 then what 86 00:02:48,750 --> 00:02:49,620 were hoping for is that the hypothesis 87 00:02:50,420 --> 00:02:51,890 will output the value to 88 00:02:52,010 --> 00:02:53,850 close to 0 and that corresponds to theta transpose x 89 00:02:54,650 --> 00:02:55,990 or z pretty much 90 00:02:56,250 --> 00:02:57,080 less than 0 because 91 00:02:57,440 --> 00:02:58,720 that corresponds to 92 00:02:59,160 --> 00:03:01,250 hypothesis of outputting a value close to 0. If 93 00:03:02,180 --> 00:03:03,590 you look at the 94 00:03:03,760 --> 00:03:06,300 cost function of logistic regression, what 95 00:03:06,440 --> 00:03:07,470 you find is that each 96 00:03:07,710 --> 00:03:09,400 example x, y, 97 00:03:10,190 --> 00:03:11,520 contributes a term like 98 00:03:11,700 --> 00:03:14,320 this to the overall cost function. 99 00:03:15,450 --> 00:03:16,900 All right. So, for the overall cost function, we usually, we will 100 00:03:17,390 --> 00:03:18,600 also have a sum over 101 00:03:18,890 --> 00:03:21,430 all the training examples and 1 over m term. 102 00:03:22,450 --> 00:03:22,740 But this 103 00:03:23,240 --> 00:03:24,150 expression here. That's 104 00:03:24,470 --> 00:03:25,450 the term that a single 105 00:03:26,220 --> 00:03:28,490 training example contributes to 106 00:03:28,780 --> 00:03:31,550 the overall objective function for logistic regression. 107 00:03:33,250 --> 00:03:34,350 Now, if I take the definition 108 00:03:35,190 --> 00:03:36,120 for the full of my hypothesis 109 00:03:37,030 --> 00:03:38,700 and plug it in, over here, 110 00:03:39,790 --> 00:03:40,710 the one I get is that 111 00:03:40,920 --> 00:03:43,130 each training example contributes this term, right? 112 00:03:44,270 --> 00:03:45,480 Ignoring the 1 over 113 00:03:45,720 --> 00:03:47,130 m but it contributes that term 114 00:03:47,470 --> 00:03:49,470 to be my overall cost function for 115 00:03:49,680 --> 00:03:52,260 logistic regression. Now let's 116 00:03:52,820 --> 00:03:54,310 consider the 2 cases 117 00:03:54,700 --> 00:03:55,970 of when y is equal to 1 118 00:03:56,040 --> 00:03:57,250 and when y is equal to 0. 119 00:03:57,820 --> 00:03:59,040 In the first case, let's 120 00:03:59,170 --> 00:04:00,260 suppose that y is equal 121 00:04:00,520 --> 00:04:01,960 to 1. In that case, 122 00:04:02,440 --> 00:04:04,850 only this first row in 123 00:04:04,980 --> 00:04:06,910 the objective matters because this 124 00:04:07,130 --> 00:04:08,830 1 minus y term will be equal 125 00:04:09,210 --> 00:04:10,510 to 0 if y is equal to 1. 126 00:04:13,640 --> 00:04:15,340 So, when y is equal to 127 00:04:15,400 --> 00:04:17,130 1 when in an example, x, 128 00:04:17,310 --> 00:04:18,240 y when y is equal to 129 00:04:18,420 --> 00:04:19,840 1, what we get is this 130 00:04:20,010 --> 00:04:21,340 term minus log 1 131 00:04:21,560 --> 00:04:22,370 over 1 plus e to the negative 132 00:04:22,860 --> 00:04:25,050 z. Where, similar to the last slide, 133 00:04:25,330 --> 00:04:26,480 I'm using z to denote 134 00:04:27,490 --> 00:04:29,430 data transpose x. And 135 00:04:29,640 --> 00:04:30,930 of course, in the cost we 136 00:04:31,040 --> 00:04:32,130 actually had this minus y 137 00:04:32,380 --> 00:04:33,490 but we just said that you know, if y is 138 00:04:33,540 --> 00:04:34,790 equal to 1. So that's equal 139 00:04:35,020 --> 00:04:36,500 to 1. I just simplified it 140 00:04:36,580 --> 00:04:38,010 a way in the expression that 141 00:04:38,300 --> 00:04:39,820 I have written down here. 142 00:04:41,950 --> 00:04:43,030 And if we plot this function, 143 00:04:43,580 --> 00:04:45,080 as a function of z, what 144 00:04:45,230 --> 00:04:46,320 you find is that you get 145 00:04:47,160 --> 00:04:48,630 this curve shown on the 146 00:04:49,220 --> 00:04:50,290 lower left of this line 147 00:04:51,120 --> 00:04:52,290 and thus we also see 148 00:04:52,640 --> 00:04:53,590 that when z is equal 149 00:04:53,860 --> 00:04:54,930 to large that is to when 150 00:04:55,440 --> 00:04:56,930 theta transpose x is large 151 00:04:57,800 --> 00:04:58,790 that corresponds to a 152 00:04:58,890 --> 00:04:59,900 value of z that gives 153 00:05:00,100 --> 00:05:02,050 us a very small value, a very 154 00:05:03,000 --> 00:05:04,650 small contribution to the 155 00:05:04,740 --> 00:05:06,120 cost function and this 156 00:05:06,270 --> 00:05:07,790 kind of explains why when 157 00:05:08,260 --> 00:05:10,020 logistic regression sees a positive example 158 00:05:10,640 --> 00:05:12,200 with y equals 1 it tries 159 00:05:12,860 --> 00:05:14,220 to set theta transpose x 160 00:05:14,650 --> 00:05:15,810 to be very large because that 161 00:05:15,980 --> 00:05:17,440 corresponds to this term 162 00:05:18,300 --> 00:05:21,490 in a cost function being small. Now, to build 163 00:05:21,760 --> 00:05:23,640 the Support Vector Machine, here is what we are going to do. 164 00:05:23,740 --> 00:05:24,780 We are going to take this cost function, this 165 00:05:25,740 --> 00:05:29,420 minus log 1 over 1 plus e to the negative z and modify it a little bit. 166 00:05:31,270 --> 00:05:32,450 Let me take this point 167 00:05:33,590 --> 00:05:35,120 1 over here and let 168 00:05:36,150 --> 00:05:37,200 me draw the course function that I'm going to 169 00:05:37,280 --> 00:05:38,510 use, the new cost function is gonna 170 00:05:38,870 --> 00:05:40,320 be flat from here on out 171 00:05:42,000 --> 00:05:42,980 and then I'm going to draw something 172 00:05:43,170 --> 00:05:45,720 that grows as a straight 173 00:05:46,280 --> 00:05:49,230 line similar to logistic 174 00:05:49,530 --> 00:05:50,710 regression but this is going to be the 175 00:05:50,950 --> 00:05:52,740 straight line in this posh inning. 176 00:05:52,870 --> 00:05:55,040 So the curve that 177 00:05:55,190 --> 00:05:57,580 I just drew in magenta. The curve that I just drew purple and magenta. 178 00:05:58,090 --> 00:05:59,580 So, it's a pretty 179 00:05:59,730 --> 00:06:01,840 close approximation to the 180 00:06:02,310 --> 00:06:03,480 cost function used by logistic 181 00:06:03,900 --> 00:06:05,060 regression except that it is 182 00:06:05,130 --> 00:06:06,590 now made out of two line segments. This 183 00:06:07,490 --> 00:06:09,110 is flat potion on the right 184 00:06:09,430 --> 00:06:11,590 and then this is a straight 185 00:06:11,860 --> 00:06:14,340 line portion on the 186 00:06:14,630 --> 00:06:16,460 left. And don't worry too much about the slope of the straight line portion. 187 00:06:16,930 --> 00:06:18,930 It doesn't matter that 188 00:06:19,180 --> 00:06:21,630 much but that's the 189 00:06:21,730 --> 00:06:23,910 new cost function we're going to use where y is equal to 1 and 190 00:06:24,100 --> 00:06:25,240 you can imagine you 191 00:06:25,340 --> 00:06:28,310 should do something pretty similar to logistic regression 192 00:06:29,190 --> 00:06:30,470 but it turns out that this will give the 193 00:06:30,750 --> 00:06:32,630 support vector machine computational advantage 194 00:06:33,690 --> 00:06:34,470 that will give us later on 195 00:06:34,890 --> 00:06:37,190 an easier optimization problem, that 196 00:06:37,570 --> 00:06:39,670 will be easier for stock trades and so on. 197 00:06:41,050 --> 00:06:41,990 We just talked about the case 198 00:06:42,120 --> 00:06:43,300 of y equals to 1. The other 199 00:06:43,370 --> 00:06:44,420 case is if y is equal 200 00:06:44,660 --> 00:06:46,120 to 0. In that case, 201 00:06:47,090 --> 00:06:47,870 if you look at the cost 202 00:06:48,510 --> 00:06:49,880 then only this second term 203 00:06:50,220 --> 00:06:51,470 will apply because the first 204 00:06:51,610 --> 00:06:52,800 term goes a way 205 00:06:53,330 --> 00:06:54,490 where if y is equal to 0 then nearly 206 00:06:54,640 --> 00:06:55,670 it was 0 here. So 207 00:06:55,800 --> 00:06:56,640 your left only with the second 208 00:06:57,040 --> 00:06:58,100 term of the expression above 209 00:06:59,150 --> 00:07:00,600 and so the cost of an 210 00:07:00,710 --> 00:07:01,960 example or the contribution 211 00:07:01,980 --> 00:07:03,620 of the cost function is going 212 00:07:03,840 --> 00:07:04,850 to be given by this term 213 00:07:05,180 --> 00:07:06,620 over here and if you 214 00:07:06,710 --> 00:07:07,860 plug that as a function 215 00:07:08,560 --> 00:07:09,750 z. So, I have here z on the 216 00:07:09,990 --> 00:07:11,290 horizontal axis, you end up 217 00:07:11,400 --> 00:07:13,370 with this group and for 218 00:07:13,470 --> 00:07:14,570 the support vector machine, once 219 00:07:14,790 --> 00:07:15,540 again we're going to replace 220 00:07:16,250 --> 00:07:17,860 this blue line with something similar 221 00:07:18,380 --> 00:07:20,060 and see if we can 222 00:07:20,670 --> 00:07:22,220 replace it with a new cost, there 223 00:07:23,480 --> 00:07:24,910 is flat out here. There's 0 out here and then 224 00:07:25,020 --> 00:07:26,230 it grows as a straight 225 00:07:27,900 --> 00:07:27,900 line like so. 226 00:07:29,070 --> 00:07:29,710 So, let me give 227 00:07:29,860 --> 00:07:31,950 these two functions names. 228 00:07:32,830 --> 00:07:33,910 This function on the left, I'm 229 00:07:34,080 --> 00:07:35,850 going to call 230 00:07:37,140 --> 00:07:38,360 cost subscript 1 of z. 231 00:07:38,800 --> 00:07:39,650 And this function on the right, I'm going to call 232 00:07:39,870 --> 00:07:41,700 cost subscript 0 233 00:07:42,980 --> 00:07:44,260 of z. And the subscript just refers 234 00:07:44,860 --> 00:07:46,740 to the cost corresponding to 235 00:07:47,070 --> 00:07:48,570 y is equal to 1 versus y is equal to 0. 236 00:07:49,930 --> 00:07:51,470 Armed with these definitions, we are 237 00:07:51,580 --> 00:07:54,730 now ready to build the support vector machine. 238 00:07:55,000 --> 00:07:56,030 Here is the cost function 239 00:07:56,300 --> 00:07:57,230 j of theta that we have for 240 00:07:57,340 --> 00:07:58,440 logistic regression. In case 241 00:07:58,770 --> 00:07:59,760 this the equation looks a 242 00:07:59,860 --> 00:08:02,220 bit unfamiliar is because previously we 243 00:08:02,360 --> 00:08:04,270 had a minor sign outside, but 244 00:08:04,800 --> 00:08:05,820 here what I did was I 245 00:08:05,930 --> 00:08:07,010 instead moved the minor signs 246 00:08:07,610 --> 00:08:08,800 inside this expression. So it 247 00:08:08,950 --> 00:08:09,920 just, you know, makes it look a 248 00:08:10,080 --> 00:08:12,970 little more different. For the support 249 00:08:13,340 --> 00:08:14,670 vector machine what we are 250 00:08:14,730 --> 00:08:16,550 going to do is essentially take 251 00:08:16,820 --> 00:08:18,460 this, and replace this with 252 00:08:19,080 --> 00:08:21,260 cost 1 of z, 253 00:08:21,740 --> 00:08:23,060 that is cost 1 of theta transpose x. 254 00:08:23,320 --> 00:08:25,240 I'm going 255 00:08:25,300 --> 00:08:27,250 to take this and replace it with cost 256 00:08:28,640 --> 00:08:31,420 0 of z. This is cost 0 of 257 00:08:32,060 --> 00:08:34,090 theta transpose x 258 00:08:35,030 --> 00:08:36,680 where the cost 1 function is 259 00:08:37,000 --> 00:08:37,740 what we had on the previous 260 00:08:38,170 --> 00:08:39,930 line that looks like this and 261 00:08:40,890 --> 00:08:42,540 the cost 0 function, again what 262 00:08:42,680 --> 00:08:44,420 we have on the previous line that 263 00:08:44,910 --> 00:08:46,730 looks like this. 264 00:08:46,860 --> 00:08:48,080 So, what we have for the support 265 00:08:48,420 --> 00:08:49,360 vector machine is an minimizationminimalization 266 00:08:49,910 --> 00:08:52,220 problem of one of 267 00:08:52,340 --> 00:08:55,210 1 over m, sum over 268 00:08:55,400 --> 00:08:58,650 my training examples of y(i) times cost 269 00:08:59,090 --> 00:09:01,050 1 of theta transpose 270 00:09:01,300 --> 00:09:03,910 x(i) plus 1 minus 271 00:09:04,650 --> 00:09:06,640 y(i) times cost zero of theta transpose x(i). 272 00:09:07,220 --> 00:09:10,490 And then 273 00:09:10,990 --> 00:09:13,470 plus my usual regularization 274 00:09:17,120 --> 00:09:23,280 parameter like so. Now 275 00:09:24,130 --> 00:09:25,280 by convention for the Support 276 00:09:25,570 --> 00:09:27,610 Vector Machine, we actually write 277 00:09:27,790 --> 00:09:29,510 things slightly differently. We parametrize 278 00:09:30,570 --> 00:09:31,690 this just very slightly differently. 279 00:09:31,850 --> 00:09:33,720 First, we're going 280 00:09:34,130 --> 00:09:35,360 to get rid of the 1 281 00:09:35,670 --> 00:09:36,860 over m terms and this just, 282 00:09:37,130 --> 00:09:38,480 this just happens 283 00:09:38,770 --> 00:09:40,380 to be a slightly different convention that people 284 00:09:40,640 --> 00:09:41,930 use for support vector machines 285 00:09:42,140 --> 00:09:43,400 compared to for logistic regression. But here's what 286 00:09:44,160 --> 00:09:46,180 I mean, you know, what I'm going 287 00:09:46,670 --> 00:09:47,960 to do is I am just gonna get 288 00:09:48,210 --> 00:09:49,450 rid of this 1 over m 289 00:09:50,070 --> 00:09:50,860 terms and this should give 290 00:09:51,070 --> 00:09:53,030 me the same optimal value for theta, right. 291 00:09:53,620 --> 00:09:55,020 Because 1 over m is just a constant. 292 00:09:56,420 --> 00:09:57,550 So, you know, whether I solve 293 00:09:57,930 --> 00:09:59,410 this minimization problem with 1 294 00:09:59,580 --> 00:10:00,430 over m in front or not, 295 00:10:01,100 --> 00:10:02,010 I should end up with the same 296 00:10:02,490 --> 00:10:03,510 optimal value of theta. 297 00:10:04,590 --> 00:10:05,450 Here is what I mean, to 298 00:10:05,590 --> 00:10:07,000 give you a concrete example, 299 00:10:08,010 --> 00:10:09,170 suppose I had a minimization 300 00:10:09,370 --> 00:10:11,040 problem that you know minimize over 301 00:10:11,460 --> 00:10:14,700 a real number u of u minus 5 squared, 302 00:10:17,080 --> 00:10:18,540 plus 1, right. Well, the 303 00:10:18,620 --> 00:10:20,040 minimum of this happens, happens 304 00:10:20,440 --> 00:10:21,900 to know the minimum of this is u equals 5. 305 00:10:23,090 --> 00:10:23,980 Now if I want to take 306 00:10:24,120 --> 00:10:25,800 this objective function and multiply 307 00:10:26,430 --> 00:10:28,240 it by 10, so 308 00:10:28,770 --> 00:10:29,850 here my minimization problem is 309 00:10:30,570 --> 00:10:33,510 minimum of u of 10, u minus 310 00:10:33,960 --> 00:10:35,270 5 squared plus 10. 311 00:10:35,920 --> 00:10:37,650 Well the value of u 312 00:10:37,670 --> 00:10:40,350 that minimizes this is still u equals 5, right. 313 00:10:40,940 --> 00:10:42,540 So, multiplying something that 314 00:10:42,640 --> 00:10:44,160 you are minimizing over by some 315 00:10:44,360 --> 00:10:45,540 constant, 10 in this case, 316 00:10:46,010 --> 00:10:47,710 it does not change the value 317 00:10:48,290 --> 00:10:51,450 of u that gives us, that minimizes this function. 318 00:10:52,650 --> 00:10:53,680 So the same way what I've 319 00:10:53,830 --> 00:10:55,120 done by crossing out this 320 00:10:55,430 --> 00:10:56,940 m is, all I 321 00:10:56,990 --> 00:10:58,770 am doing is multiplying my objective 322 00:10:59,240 --> 00:11:00,650 function by some constant m 323 00:11:00,940 --> 00:11:01,920 and it doesn't change the value 324 00:11:02,360 --> 00:11:04,310 of theta that achieves the minimum. 325 00:11:05,480 --> 00:11:07,190 The second bit of notational change, 326 00:11:07,470 --> 00:11:08,560 we're just designating the most 327 00:11:08,740 --> 00:11:10,630 standard convention, when using as 328 00:11:11,170 --> 00:11:13,250 the SVM, instead of logistic regression as a following. 329 00:11:14,210 --> 00:11:15,880 So, for logistic regression, we had 330 00:11:16,520 --> 00:11:18,270 two terms to our objective function. 331 00:11:19,340 --> 00:11:20,500 The first is this term 332 00:11:20,920 --> 00:11:22,020 which is the cost that comes 333 00:11:22,450 --> 00:11:23,910 from the training set and the 334 00:11:23,990 --> 00:11:25,730 second is this term, which 335 00:11:26,140 --> 00:11:28,330 is the regularization term 336 00:11:28,380 --> 00:11:29,460 and what we had, we had to 337 00:11:29,870 --> 00:11:30,900 control the trade off between 338 00:11:31,270 --> 00:11:32,600 these by saying, you know, that we 339 00:11:32,810 --> 00:11:34,760 wanted to minimize A plus 340 00:11:35,760 --> 00:11:38,240 and then my regularization parameter lambda, 341 00:11:39,370 --> 00:11:42,280 and then times some other 342 00:11:42,430 --> 00:11:43,430 term B, right? Where as I 343 00:11:43,510 --> 00:11:44,970 am using A to denote 344 00:11:45,080 --> 00:11:46,160 this first term, and I am 345 00:11:46,390 --> 00:11:48,280 using B to denote that 346 00:11:48,490 --> 00:11:49,560 second term, may be without the 347 00:11:49,650 --> 00:11:52,440 lambda, and instead of 348 00:11:53,140 --> 00:11:56,090 prioritizing this as A plus lambda B, 349 00:11:56,270 --> 00:11:57,950 we could, and so what we 350 00:11:58,200 --> 00:11:59,670 did was by setting different 351 00:12:00,010 --> 00:12:02,210 values for this regularization parameter lambda. 352 00:12:03,060 --> 00:12:04,180 We could trade off the relative 353 00:12:04,670 --> 00:12:05,720 way between how much we 354 00:12:05,900 --> 00:12:06,780 want to fit the training set well, 355 00:12:07,560 --> 00:12:09,390 as minimizing A, versus how 356 00:12:09,510 --> 00:12:12,930 much we care about keeping the values of the parameters small. 357 00:12:13,470 --> 00:12:14,530 So that would be 358 00:12:14,640 --> 00:12:16,170 for the parameters B. For the Support 359 00:12:16,380 --> 00:12:17,620 Vector Machine, just by convention 360 00:12:18,250 --> 00:12:19,150 we're going to use a different 361 00:12:19,570 --> 00:12:21,960 parameter. So instead of using lambda here 362 00:12:22,180 --> 00:12:23,220 to control the relative 363 00:12:23,640 --> 00:12:24,730 waiting between you know, the first and second terms, 364 00:12:24,810 --> 00:12:26,260 we are 365 00:12:26,300 --> 00:12:27,370 still going to use a different 366 00:12:27,710 --> 00:12:29,070 parameter which by convention 367 00:12:29,290 --> 00:12:31,530 is called C and 368 00:12:31,730 --> 00:12:33,550 we instead are going to minimize C times 369 00:12:34,430 --> 00:12:39,160 A plus B. So 370 00:12:39,380 --> 00:12:41,210 for logistic regression if we 371 00:12:41,340 --> 00:12:42,730 send a very large value of 372 00:12:42,990 --> 00:12:43,980 lambda, that means to give 373 00:12:44,260 --> 00:12:45,970 B a very high weight. Here 374 00:12:46,590 --> 00:12:47,640 is that if we set C 375 00:12:47,960 --> 00:12:49,750 to be a very small value. That 376 00:12:50,070 --> 00:12:51,510 corresponds to giving B 377 00:12:51,800 --> 00:12:53,530 much larger weight than C than A. 378 00:12:54,610 --> 00:12:55,730 So this is just a different 379 00:12:55,890 --> 00:12:57,330 way of controlling the trade off 380 00:12:57,630 --> 00:12:58,970 or just a different way of 381 00:12:59,060 --> 00:13:01,530 parametrizing how much we care about optimizing the first term versus how much we care about optimizing the second term. 382 00:13:05,290 --> 00:13:06,250 And if you want you can 383 00:13:06,380 --> 00:13:07,620 think of this as the parameter 384 00:13:08,180 --> 00:13:09,580 C playing a role 385 00:13:09,800 --> 00:13:11,570 similar to 1 over 386 00:13:11,890 --> 00:13:13,900 lambda and it's 387 00:13:14,080 --> 00:13:16,100 not that it's two equations 388 00:13:16,720 --> 00:13:17,900 or these two expressions will be 389 00:13:18,000 --> 00:13:19,500 equal, it's equals 1 over 390 00:13:19,650 --> 00:13:21,350 lambda and it's not that these two equations or these two expressions will be equal. It's equals t 1 over lambda. That's not the case where it bothers that if C is equal to 1 over lambda then 391 00:13:22,260 --> 00:13:24,510 these 392 00:13:24,710 --> 00:13:26,670 two optimization objectives should 393 00:13:26,940 --> 00:13:28,260 give you the same value, same 394 00:13:28,500 --> 00:13:29,460 optimal value of theta. So 395 00:13:30,350 --> 00:13:31,180 just filling that 396 00:13:31,400 --> 00:13:33,030 in. I'm gonna cross out lambda here 397 00:13:33,730 --> 00:13:34,940 and write in the constant C there. 398 00:13:35,030 --> 00:13:37,930 So,that's gives 399 00:13:38,170 --> 00:13:40,830 us our overall optimization objective 400 00:13:41,280 --> 00:13:42,650 function for the Support Vector 401 00:13:42,900 --> 00:13:43,970 Machine and where you 402 00:13:44,080 --> 00:13:46,200 minimize that function then what 403 00:13:46,340 --> 00:13:47,410 you have is the parameters 404 00:13:48,230 --> 00:13:52,800 learned by SVM. Finally on 405 00:13:52,940 --> 00:13:54,690 light of logistic regression, the Support 406 00:13:54,840 --> 00:13:56,110 Vector Machine doesn't output the 407 00:13:56,220 --> 00:13:57,850 probability. Instead what we 408 00:13:57,970 --> 00:13:58,910 have is, we have this cost 409 00:13:59,190 --> 00:14:00,600 function which we minimize to 410 00:14:00,730 --> 00:14:02,770 get the parameters theta and what 411 00:14:02,910 --> 00:14:03,900 the Support Vector Machine does, 412 00:14:05,130 --> 00:14:05,970 is it just makes the prediction 413 00:14:07,050 --> 00:14:08,650 of y being equal 1 414 00:14:08,690 --> 00:14:10,390 or 0 directly. So the hypothesis, 415 00:14:11,310 --> 00:14:12,920 where I predict, 1, if 416 00:14:14,150 --> 00:14:15,630 theta transpose x is 417 00:14:15,890 --> 00:14:17,680 greater than or equal to 418 00:14:18,230 --> 00:14:20,060 0 and I'll predict 0 otherwise. 419 00:14:20,320 --> 00:14:21,560 And so, having learned the 420 00:14:21,610 --> 00:14:23,010 parameters theta, this is 421 00:14:23,360 --> 00:14:25,980 the form of the hypothesis for the support vector machine. 422 00:14:26,850 --> 00:14:27,870 So, that was a 423 00:14:27,980 --> 00:14:29,670 mathematical definition of what 424 00:14:29,840 --> 00:14:31,520 a support vector machine does. 425 00:14:31,750 --> 00:14:32,870 In the next few videos, let's 426 00:14:33,100 --> 00:14:33,900 try to get back to 427 00:14:34,260 --> 00:14:36,030 intuition about what this 428 00:14:36,480 --> 00:14:37,660 optimization objective leads to and 429 00:14:37,820 --> 00:14:38,840 whether the source of the hypothesis 430 00:14:39,720 --> 00:14:41,300 a SVM will learn and also 431 00:14:41,700 --> 00:14:43,060 talk about how to modify 432 00:14:43,600 --> 00:14:44,640 this just a little bit to 433 00:14:44,920 --> 00:14:46,280 learn complex, nonlinear functions.