1 00:00:00,000 --> 00:00:04,719 [MUSIC] 2 00:00:04,719 --> 00:00:08,550 So far we've discussed how to regularize logistic regression. 3 00:00:08,550 --> 00:00:11,770 We briefly mentioned what happens when you we have L1 regularized 4 00:00:11,770 --> 00:00:12,710 logistic regression. 5 00:00:12,710 --> 00:00:15,160 Let's talk about it in a little bit more detail. 6 00:00:15,160 --> 00:00:16,190 Unlike in a regression course, 7 00:00:16,190 --> 00:00:20,920 we're not going to derive the learning algorithm for L1 regularize with 8 00:00:20,920 --> 00:00:24,350 the sync regression can be the similar thing to what you did with lasso. 9 00:00:24,350 --> 00:00:27,210 We're just going to show kind of the impact it has on our data. 10 00:00:28,410 --> 00:00:29,860 On our learn models. 11 00:00:31,370 --> 00:00:33,338 So recall the notion of sparsity. 12 00:00:33,338 --> 00:00:39,410 So a model sparse when many of those wj's are equal to zero and that can help us 13 00:00:39,410 --> 00:00:43,890 with both efficiency and interpretability of the models as we saw in regression. 14 00:00:43,890 --> 00:00:49,370 So for example, let's say that we have a lot of data and a lot of features so 15 00:00:49,370 --> 00:00:54,820 the number of w's that you have can be a 100 billion, 100 billion possible values. 16 00:00:54,820 --> 00:00:58,500 Things can in practice in all sorts of settings. 17 00:00:58,500 --> 00:01:02,040 For example, many of the spam filters out there have hundreds of billions of 18 00:01:02,040 --> 00:01:04,700 parameters in them, or coefficients they learn from data. 19 00:01:06,010 --> 00:01:08,590 So this has a couple problems. 20 00:01:08,590 --> 00:01:10,850 It can be expensive to make a prediction. 21 00:01:10,850 --> 00:01:12,719 You have to go through 100 billion values. 22 00:01:14,200 --> 00:01:18,200 However if I have a sparse solution where many of these 23 00:01:18,200 --> 00:01:22,500 w's are actually equal to zero then when I'm trying to make a prediction. 24 00:01:22,500 --> 00:01:28,810 So I'm judging will be at the sine of wj times the feature hj of xi. 25 00:01:28,810 --> 00:01:33,168 I only have to look at no zero quotations wj. 26 00:01:33,168 --> 00:01:34,247 Everything else can be ignored. 27 00:01:34,247 --> 00:01:38,579 So if I have a 100 billion coefficients but only say 100,000 of those are non zero 28 00:01:38,579 --> 00:01:41,150 then it's going to be much faster to make a prediction. 29 00:01:41,150 --> 00:01:42,810 This makes a huge difference in practice. 30 00:01:44,210 --> 00:01:48,320 The other impact that sparsity has having many zero coefficients being zero 31 00:01:48,320 --> 00:01:50,840 is that it can help you interpret the non zero coefficients. 32 00:01:50,840 --> 00:01:53,900 So you can look at the small number of non zero coefficients and 33 00:01:53,900 --> 00:01:57,170 try to make an interpretation, this is why a prediction gets made. 34 00:01:58,630 --> 00:02:02,580 Such interpretations can be used for practice in many ways, so 35 00:02:02,580 --> 00:02:08,350 how you learn logistic regression classifier with sparsity in that, 36 00:02:08,350 --> 00:02:10,390 sparsity inducing penalty. 37 00:02:10,390 --> 00:02:14,370 So what you do is take the same log-likelihood function Lw but 38 00:02:14,370 --> 00:02:17,870 we add extra L1 penalty. 39 00:02:17,870 --> 00:02:20,780 Which is the sum of the absolute value of w0, 40 00:02:20,780 --> 00:02:23,910 the absolute value of w1 all the way to the absolute value wd. 41 00:02:23,910 --> 00:02:26,420 So by just changing the squares, 42 00:02:26,420 --> 00:02:30,360 sum of squares to be sum of absolute values we go into what's called L1 43 00:02:30,360 --> 00:02:33,200 regularized logistic regression which gives you sparse solutions. 44 00:02:33,200 --> 00:02:36,000 So that small change leads to sparse solutions. 45 00:02:37,810 --> 00:02:41,440 So just like we did with L2 regularization. 46 00:02:41,440 --> 00:02:44,950 Here, we're also going to have a parameter 47 00:02:44,950 --> 00:02:49,460 lander which controls how much regularization we introduce. 48 00:02:49,460 --> 00:02:51,920 So how much penalty we're going to introduce. 49 00:02:51,920 --> 00:02:55,490 And objective becomes the log-likelihood of the data, 50 00:02:55,490 --> 00:02:59,540 minus lambda times the sum of these absolute values, the L1 penalty. 51 00:02:59,540 --> 00:03:05,930 When lambda equals to 0, we have no regularization, 52 00:03:09,430 --> 00:03:13,687 which leads us to the standard MLE 53 00:03:13,687 --> 00:03:20,110 solution. 54 00:03:20,110 --> 00:03:24,579 Just like we had in the case of L2 regularization. 55 00:03:28,690 --> 00:03:34,329 Now when lambda is equal to infinity, 56 00:03:34,329 --> 00:03:37,677 we have only penalty so 57 00:03:37,677 --> 00:03:42,799 all weight is on regularization. 58 00:03:46,920 --> 00:03:53,397 And that's going to lead to w hat being everything 0, all 0 coefficients. 59 00:03:53,397 --> 00:03:59,131 Now the case that we really care about was on lambda similar between 0 and 60 00:03:59,131 --> 00:04:03,846 infinity which leads to what are called Sparse Solutions. 61 00:04:07,476 --> 00:04:14,241 Where some wj's are now going to be equal to 0 but 62 00:04:14,241 --> 00:04:18,862 hopefully, many other wj's and 63 00:04:18,862 --> 00:04:25,973 this is the maximum wj hats are going to be exactly 0. 64 00:04:25,973 --> 00:04:29,377 So that's what we're going to try to aim for. 65 00:04:29,377 --> 00:04:31,623 So let's revisit those coefficient paths, and 66 00:04:31,623 --> 00:04:34,261 here I'm showing you coefficient paths of L2 penalty. 67 00:04:34,261 --> 00:04:41,363 You see that when the lambda parameter's low, you have large coefficients learned, 68 00:04:41,363 --> 00:04:47,699 and when the lambda parameters gets larger, you got smaller coefficients. 69 00:04:50,728 --> 00:04:55,934 So, they go from large to small, but they're never exactly 0. 70 00:04:58,441 --> 00:05:02,157 So, the coefficients never become exactly 0. 71 00:05:02,157 --> 00:05:05,959 If you look however at the coefficient paths when the regularization is L1, 72 00:05:05,959 --> 00:05:08,249 well guess if would be much more interesting. 73 00:05:08,249 --> 00:05:11,056 So, for example, in the beginning, 74 00:05:11,056 --> 00:05:16,120 the coefficient of the smiley face oops that should be frowny. 75 00:05:16,120 --> 00:05:22,252 That should be smiley face has a large positive value. 76 00:05:22,252 --> 00:05:29,580 But eventually becomes exactly zero from here on. 77 00:05:31,530 --> 00:05:37,380 And similarly, the coefficient for the frowney face 78 00:05:37,380 --> 00:05:42,860 is a large negative value, but eventually over here the frowney face 79 00:05:44,540 --> 00:05:49,580 has a coefficient that becomes 0. 80 00:05:49,580 --> 00:05:52,750 And so it goes from large all the way to exactly zero. 81 00:05:52,750 --> 00:05:54,650 And we see that for many of the other words. 82 00:05:54,650 --> 00:05:59,547 For example in the beginning the coefficient of the word hate is pretty 83 00:05:59,547 --> 00:06:05,357 high and that's a pretty important word but around here hate becomes irrelevant. 84 00:06:08,910 --> 00:06:12,284 And so as just a quick reminder, these are product reviews and trying to 85 00:06:12,284 --> 00:06:16,130 figure out whether it's a positive or negative review for the product. 86 00:06:16,130 --> 00:06:19,940 And work with, we can look at what coefficient stays non zero for 87 00:06:19,940 --> 00:06:21,370 the longest time. 88 00:06:21,370 --> 00:06:26,255 And this is exactly this line over here, where it never hits 0, 89 00:06:26,255 --> 00:06:28,160 never stays exactly 0. 90 00:06:28,160 --> 00:06:31,915 And this is a co-efficient of the word disappointed. 91 00:06:34,528 --> 00:06:37,468 So, you might be disappointed to learn that frowny face is 92 00:06:37,468 --> 00:06:38,909 not the one that becomes 0. 93 00:06:38,909 --> 00:06:42,367 But in the beginning, disappoint is not as, 94 00:06:42,367 --> 00:06:47,827 the coefficient is not as large, not as significant as a frowny face but 95 00:06:47,827 --> 00:06:52,018 it's the one that stays negative for the longest. 96 00:06:52,018 --> 00:06:55,190 And so frowny face is not, you might be disappointed 97 00:06:55,190 --> 00:06:59,098 to know that friendly face is not as important as disappointed. 98 00:06:59,098 --> 00:07:02,713 [LAUGH] And disappointed probably because it's prevalent in more reviews and 99 00:07:02,713 --> 00:07:06,350 when you say disappointed you're really like in a negative review. 100 00:07:06,350 --> 00:07:09,290 That coefficient goes on for a long time. 101 00:07:09,290 --> 00:07:10,778 So you see these transitions. 102 00:07:10,778 --> 00:07:15,334 So the coefficients of those small numbers like reviews goes to zero. 103 00:07:15,334 --> 00:07:20,105 Earlier on, the smiley face will last for a while then it becomes zero. 104 00:07:20,105 --> 00:07:23,311 Frowny face lasts for longer and then it becomes exactly zero. 105 00:07:23,311 --> 00:07:27,359 But propositionally large lambdas, all those are zero except for 106 00:07:27,359 --> 00:07:29,315 the coefficient at this point. 107 00:07:29,315 --> 00:07:33,459 [MUSIC]