1 00:00:00,000 --> 00:00:04,666 [MUSIC] 2 00:00:04,666 --> 00:00:07,946 Now we've seen multiple ways that overfitting can be bad for 3 00:00:07,946 --> 00:00:11,100 classification especially from logistic regression and 4 00:00:11,100 --> 00:00:14,690 how very massive parameters can be a really bad thing. 5 00:00:14,690 --> 00:00:18,450 So, what we're going to do next is introduce a notion of regularization 6 00:00:18,450 --> 00:00:22,020 just like we did regression to penalize this really large parameters 7 00:00:22,020 --> 00:00:24,780 in order to get a more reasonable outcome. 8 00:00:26,170 --> 00:00:29,820 So we're still talking about the same logistical regression model 9 00:00:29,820 --> 00:00:33,150 where we take data we do some feature extraction to it we fit 10 00:00:33,150 --> 00:00:38,330 this model one over one plus e to the minus w transpose x. 11 00:00:38,330 --> 00:00:41,960 But the quality metric for this machine in the algorithm is going to change 12 00:00:41,960 --> 00:00:46,230 to push us away from really large coefficients. 13 00:00:46,230 --> 00:00:49,330 So in particular we're going to balance how well we fit the data 14 00:00:49,330 --> 00:00:53,560 with the magnitude of coefficients as to avoid this massive coefficients. 15 00:00:53,560 --> 00:00:57,050 In the context of logistic regression, 16 00:00:57,050 --> 00:01:00,510 we're balancing two things to measure total quality. 17 00:01:00,510 --> 00:01:01,830 The measure of fit. 18 00:01:01,830 --> 00:01:05,330 Which was the data likelihood, the thing that's bigger is better, 19 00:01:05,330 --> 00:01:09,110 how well I fit the data, and then the magnitude of the coefficients, 20 00:01:09,110 --> 00:01:12,480 where the coefficients are too big are problematic. 21 00:01:12,480 --> 00:01:15,610 So we have one thing that we want to be big, the likelihood, and 22 00:01:15,610 --> 00:01:19,210 the other thing we want to be small, which is the minds of coefficients, 23 00:01:19,210 --> 00:01:25,760 we're going to optimize the quality minus this complex metric here. 24 00:01:25,760 --> 00:01:28,320 And so we want to balance between the two. 25 00:01:29,500 --> 00:01:31,050 So what do those mean? 26 00:01:31,050 --> 00:01:34,530 Let's substantiate that more clearly In the context of logistic regression. 27 00:01:35,760 --> 00:01:39,260 The quality metric in logistic regression is the data likelihood, and 28 00:01:39,260 --> 00:01:42,100 we talked about it quite a bit in the previous module. 29 00:01:42,100 --> 00:01:45,940 Now one little side note here that we're going to use in this module 30 00:01:45,940 --> 00:01:50,330 is that we don't typically optimize the data likelihood directly. 31 00:01:50,330 --> 00:01:53,340 We optimize the log of the data likelihood, 32 00:01:53,340 --> 00:01:55,430 because that makes math a lot simpler. 33 00:01:56,680 --> 00:01:59,810 And it makes the gradients behave a lot better. 34 00:01:59,810 --> 00:02:02,510 In the options section in the previous model, 35 00:02:02,510 --> 00:02:06,070 we talked about this quite a bit, and we explored it in detail. 36 00:02:07,128 --> 00:02:08,130 If you skipped that section, 37 00:02:08,130 --> 00:02:12,730 just think about the log as a way to make those numbers less extreme. 38 00:02:12,730 --> 00:02:13,492 So we take the log. 39 00:02:13,492 --> 00:02:18,600 So the method for quality is going to be the log of the data likelihood, 40 00:02:18,600 --> 00:02:21,540 and we're going to make that log as big as possible. 41 00:02:23,030 --> 00:02:27,540 So we see that the likelihood is going to be what we're going to optimize when 42 00:02:27,540 --> 00:02:28,630 we try to make it big. 43 00:02:28,630 --> 00:02:30,370 But at the same time, we're trying to make something small, 44 00:02:30,370 --> 00:02:31,820 which is the magnitude of the coefficient. 45 00:02:31,820 --> 00:02:33,190 So there are different metrics for 46 00:02:33,190 --> 00:02:36,320 magnitude of coefficients, just like we explored in regression. 47 00:02:36,320 --> 00:02:38,890 There's two that we're going to use in this module. 48 00:02:38,890 --> 00:02:43,730 One is the sum of the squares, also called the L2 norm, the square of the L2 norm. 49 00:02:43,730 --> 00:02:51,140 And because it's noted by W2 squared and it's just very simple. 50 00:02:51,140 --> 00:02:54,460 It's the square of the first coefficient plus the square 51 00:02:54,460 --> 00:02:59,160 of the second coefficient plus the square of the third coefficient and so 52 00:02:59,160 --> 00:03:04,100 on plus the square of the last coefficient, w d squared. 53 00:03:05,720 --> 00:03:07,250 That's if you used the L2 norm. 54 00:03:07,250 --> 00:03:10,830 We can also use the sum of the absolute values, also called the L1 norm, 55 00:03:10,830 --> 00:03:15,560 and it's denoted by this here. 56 00:03:15,560 --> 00:03:19,710 And instead of being the squares, it's w0, absolute value, 57 00:03:19,710 --> 00:03:25,400 plus W one absolute value plus w two 58 00:03:25,400 --> 00:03:29,546 absolute value all the way to the absolute value of the last coefficient. 59 00:03:29,546 --> 00:03:36,100 Now in the regression course we explored these notions quite a bit but 60 00:03:36,100 --> 00:03:39,270 the main reason we take the square of the absolute value is that we want to 61 00:03:39,270 --> 00:03:44,100 make sure to penalize highly positive and highly negative numbers in the same way, 62 00:03:44,100 --> 00:03:48,740 so by doing the search squaring of some value, i'll make the output here positive. 63 00:03:48,740 --> 00:03:53,220 When I make this norms as low as possible. 64 00:03:53,220 --> 00:03:56,950 So both of these approaches are penalize larger weight. 65 00:03:59,870 --> 00:04:04,540 Actually, i should say penalize large coefficients. 66 00:04:07,140 --> 00:04:12,190 However, as we saw in the regression class by using the L one norm, 67 00:04:12,190 --> 00:04:16,850 I'm also going to get what's called a sparse solution. 68 00:04:16,850 --> 00:04:19,990 So the sparse doesn't point play in regression but 69 00:04:19,990 --> 00:04:22,530 it also plays a role in classification for example. 70 00:04:22,530 --> 00:04:27,120 And in this module we're going to explore a little bit of both of these concepts. 71 00:04:27,120 --> 00:04:31,090 And we're going to start with the L2 norm, or the sum of the squares. 72 00:04:32,150 --> 00:04:37,010 So now that we've reviewed these concepts, we can now formalize the problem, 73 00:04:37,010 --> 00:04:39,640 the quality that we're trying to maximize. 74 00:04:39,640 --> 00:04:43,110 And so I want to maximize 75 00:04:43,110 --> 00:04:47,390 over my choice parameter W's of the trade off between two things. 76 00:04:47,390 --> 00:04:51,900 The likelihood of my data and actually the log of it. 77 00:04:51,900 --> 00:04:56,020 So, log of the data likelihood. 78 00:04:57,270 --> 00:05:02,816 And some notion of penalty for the magnitude of the coefficients, 79 00:05:02,816 --> 00:05:06,985 which it will start with this L2 penalty notion. 80 00:05:06,985 --> 00:05:12,099 [MUSIC]