[MUSIC] So far we've discussed how to regularize logistic regression. We briefly mentioned what happens when you we have L1 regularized logistic regression. Let's talk about it in a little bit more detail. Unlike in a regression course, we're not going to derive the learning algorithm for L1 regularize with the sync regression can be the similar thing to what you did with lasso. We're just going to show kind of the impact it has on our data. On our learn models. So recall the notion of sparsity. So a model sparse when many of those wj's are equal to zero and that can help us with both efficiency and interpretability of the models as we saw in regression. So for example, let's say that we have a lot of data and a lot of features so the number of w's that you have can be a 100 billion, 100 billion possible values. Things can in practice in all sorts of settings. For example, many of the spam filters out there have hundreds of billions of parameters in them, or coefficients they learn from data. So this has a couple problems. It can be expensive to make a prediction. You have to go through 100 billion values. However if I have a sparse solution where many of these w's are actually equal to zero then when I'm trying to make a prediction. So I'm judging will be at the sine of wj times the feature hj of xi. I only have to look at no zero quotations wj. Everything else can be ignored. So if I have a 100 billion coefficients but only say 100,000 of those are non zero then it's going to be much faster to make a prediction. This makes a huge difference in practice. The other impact that sparsity has having many zero coefficients being zero is that it can help you interpret the non zero coefficients. So you can look at the small number of non zero coefficients and try to make an interpretation, this is why a prediction gets made. Such interpretations can be used for practice in many ways, so how you learn logistic regression classifier with sparsity in that, sparsity inducing penalty. So what you do is take the same log-likelihood function Lw but we add extra L1 penalty. Which is the sum of the absolute value of w0, the absolute value of w1 all the way to the absolute value wd. So by just changing the squares, sum of squares to be sum of absolute values we go into what's called L1 regularized logistic regression which gives you sparse solutions. So that small change leads to sparse solutions. So just like we did with L2 regularization. Here, we're also going to have a parameter lander which controls how much regularization we introduce. So how much penalty we're going to introduce. And objective becomes the log-likelihood of the data, minus lambda times the sum of these absolute values, the L1 penalty. When lambda equals to 0, we have no regularization, which leads us to the standard MLE solution. Just like we had in the case of L2 regularization. Now when lambda is equal to infinity, we have only penalty so all weight is on regularization. And that's going to lead to w hat being everything 0, all 0 coefficients. Now the case that we really care about was on lambda similar between 0 and infinity which leads to what are called Sparse Solutions. Where some wj's are now going to be equal to 0 but hopefully, many other wj's and this is the maximum wj hats are going to be exactly 0. So that's what we're going to try to aim for. So let's revisit those coefficient paths, and here I'm showing you coefficient paths of L2 penalty. You see that when the lambda parameter's low, you have large coefficients learned, and when the lambda parameters gets larger, you got smaller coefficients. So, they go from large to small, but they're never exactly 0. So, the coefficients never become exactly 0. If you look however at the coefficient paths when the regularization is L1, well guess if would be much more interesting. So, for example, in the beginning, the coefficient of the smiley face oops that should be frowny. That should be smiley face has a large positive value. But eventually becomes exactly zero from here on. And similarly, the coefficient for the frowney face is a large negative value, but eventually over here the frowney face has a coefficient that becomes 0. And so it goes from large all the way to exactly zero. And we see that for many of the other words. For example in the beginning the coefficient of the word hate is pretty high and that's a pretty important word but around here hate becomes irrelevant. And so as just a quick reminder, these are product reviews and trying to figure out whether it's a positive or negative review for the product. And work with, we can look at what coefficient stays non zero for the longest time. And this is exactly this line over here, where it never hits 0, never stays exactly 0. And this is a co-efficient of the word disappointed. So, you might be disappointed to learn that frowny face is not the one that becomes 0. But in the beginning, disappoint is not as, the coefficient is not as large, not as significant as a frowny face but it's the one that stays negative for the longest. And so frowny face is not, you might be disappointed to know that friendly face is not as important as disappointed. [LAUGH] And disappointed probably because it's prevalent in more reviews and when you say disappointed you're really like in a negative review. That coefficient goes on for a long time. So you see these transitions. So the coefficients of those small numbers like reviews goes to zero. Earlier on, the smiley face will last for a while then it becomes zero. Frowny face lasts for longer and then it becomes exactly zero. But propositionally large lambdas, all those are zero except for the coefficient at this point. [MUSIC]