In the last video, we talked about precision and recall as an evaluation metric for classification problems with skew classes. For many applications, we'll want to somehow control the trade off between position and recall. Let me tell you how to do that and also show you some, even more effective ways to use precision and recall as an evaluation metric for learning algorithms. As a reminder, here are the definitions of precision and recall from the previous video. Let's continue our cancer classification example, where y equals one if the patient has cancer and y equals zero otherwise. And let's say we've trained in logistic regression classifier, which outputs probabilities between zero and one. So, as usual, we're going to predict one, y equals one if h of x is greater than or equal to 0.5 and predict zero if the hypothesis outputs a value less than 0.5 and this classifier may give us some value for precision and some value for recall. But now, suppose we want to predict that a patient has cancer only if we're very confident that they really do. Because you know if you go to a patient and you tell them that they have cancer, it's going to give them a huge shock because this is seriously bad news and they may end up going through a pretty painful treatment process and so on. And so maybe we want to tell someone that we think they have cancer only if they're very confident. One way to do this would be to modify the algorithm, so that instead of setting the threshold at 0.5, we might instead say that we'll predict that y is equal to 1, only if H of x is greater than or equal to 0.7. So this, I think will tell someone if they have cancer only if we think there's a greater than, greater than or equal to 70% that they have cancer. And if you do this then you're predicting some of this cancer only when you're more confident, and so you end up with a classifier that has higher precision, because all the patients that you're going to and say, you know, we think you have cancer, all of those patients are now pretty, once they hear, pretty confident they actually have cancer. And so, a higher fraction of the patients that you predict to have cancer, will actually turn out to have cancer, because in making those predictions we are pretty confident. But in contrast, this classifier will have lower recall, because now we are going to make predictions, we are going to predict y equals one, on a smaller number of patients. Now we could even take this further. Instead of setting the threshold at 0.7, we can set this at 0.9 and we'll predict y1 only if we are more than 90% certain that the patient has cancer, and so, you know, a large fraction that those patients will turn out to have cancer and so, this is the high precision classifier will have lower recall because we want to correctly detect that those patients have cancer. Now consider a different example. Suppose we want to avoid missing too many actual cases of cancer. So we want to avoid the false negatives. In particular, if a patient actually has cancer, but we fail to tell them that they have cancer, then that can be really bad. Because if we tell a patient that they don't have cancer then they are not going to go for treatment and if it turns out that they have cancer or we fail to tell them they have cancer, well they may not get treated at all. And so that would be a really bad outcome because he died because we told them they don't have cancer they failed to get treated, but it turns out that they actually have cancer. When in doubt, we want to predict that y equals one. So when in doubt, we want to predict that they have cancer so that at least they look further into it and this can get treated, in case they do turn out to have cancer. In this case, rather than setting higher probability threshold, we might instead take this value and this then sets it to a lower value, so maybe 0.3 like so. By doing so, we're saying that, you know what, if we think there's more than a 30% chance that they have caner, we better be more conservative and tell them that they may have cancer, so they can seek treatment if necessary. And in this case, what we would have is going to be a higher recall classifier, because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer, but we're going to end up with lower precision, because the higher fraction of the patients that we said have cancer, the higher fraction of them will turn out not to have cancer after all. And by the way, just as an aside, when I talk about this to other students up until before, it's pretty amazing. Some of my students say is how I can tell the story both ways. Why we might want to have higher precision or higher recall and the story actually seems to work both ways. But I hope the details of the algorithm is true and the more general principle is, depending on where you want, whether you want high precision, lower recall or higher recall, lower precision, you can end up predicting y equals one when h(x) is greater than some threshold. And so, in general, for most classifiers, there is going to be a trade off between precision and recall. And as you vary the value of this threshold, this value, this special that I have joined here, you can actually plot us some curve that trades off precision and recall, where a value up here, this would correspond to a very high value of the threshold, maybe threshold equals over 0.99, so that say, predict y equals 1 only where no more than 99 percent confident, at least 99 percent probability this once, so that will be a precision relatively low recall, whereas the point down here will correspond to a value of the threshold that's much lower, maybe 0.01. When in doubt at all, put down y1. And if you do that, you end up with a much lower precision higher recall classifier. And as you vary the threshold, if you want, you can actually trace all the curve from your classifier to see the range of different values you can get for precision recall. And by the way, the position recall curve can look like many different shapes. Sometimes it'll look this, sometimes it'll look like that. Now, there are many different possible shapes in the position of recall curve, depending on the details of the classifier. So this raises another interesting question which is, is there a way to choose this threshold automatically? Or, more generally, if we have a few different algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers? completely. Suppose we have three different learning algorithms, or actually maybe these are three different learning algorithms, may be these are the same algorithm, but just with different values for the threshold. How do we decide which of these algorithms is best? One of the things we talked about earlier is the importance of a single real number evaluation metric. And that is the idea of having a number that just tells you how well is your classifier doing. But by switching to the precision recall metric, we've actually lost that. We now have two real numbers. And so we often end up facing situations, like if we are trying to compare algorithm 1 to algorithm 2, we end up asking ourselves, Is a position of point five and a recall of point four, well is that better or worse than a position of point seven or a recall point one? If every time you try on a new algorithm you end up having to sit around and think well, maybe point five point four, is better than point seven point one, maybe not, I do not know. If you end up having to sit around and think and make these decisions. that really slows down your decision making process, for what changes are useful to incorporate into your algorithm. Where as in contrast, if we had a single real number evaluation metric, like a number that just tells us is either algorithm 1 or is algorithm 2 better. That helps us much more quickly decide which algorithm to go with and helps us as well to much more quickly evaluate different changes that we may be contemplating for an algorithm. So, how can we get a single real number evaluation metric. One natural thing that you might try is to look at the average between precision and recall. So using p and r to denote position and recall, what you could do is just compute the average and look at what classifier has the highest average value. But this turns out not to be such a good solution because, similar to the example we had earlier, it turns out that if we have a classifier that predicts y1 all the time, then if you do that, you can get a very high recall. That's you end up with a very low value of Vision. Conversely,if you have a classifier that predicts y=0 almost all the time, that is, if it predicts y one very sparingly. This corresponds to setting a very high threshold using the notation of previous line. And then you can actually end up with a very high precision with a very low recall. So the two extremes of either are a very high threshold or a very low threshold, neither of them would give it paticularary good classifier. And we recognize that is by seeing if we end up with a very low precision or a very low recall. And if you just take the average of people's ro2. One does the example the average is actually highest for algorithm 3. Even though you can get that sort of performance by predicting y1 all the time. And that is just not a very good classifier, right? You predict y equals one all the time is just not a useful classifier if all it does is prints out y equals one. And so algorithm one or algorithm two would be more useful than algorithm three, but in this example algorithm three has a higher average value of precision recall than algorithm one and two. So we usually think of this average of precision recall as not a particularly good way to evaluate our learning algorithm. In contrast, there is a different way of combining precision recall. It is called the f score and it uses that formula. So, in this example, here are the f scores. And so we would tell from these f scores and we'll say algorithm 1 has the highest f score. Algorithm 2 has the second highest and algorithm 3 has the lowest and so you know, if we go by the f score, we would pick probably algorithm of 1 over the others. The f score, which is also called the f1 score, is usually written f1 score that I have here, but often people will just say f score. It determines use is a little bit like taking the average of precision of recall, but it gives the lower value of precision and recall - whichever it is - it gives it a higher weight. And so, you see in the numerator here that the f score takes a product or position of equal. And so, if either position is 0 or recall is equal to 0, the f score will be equal to o. So in that sense, it kind of combines position and recall. but for the f score to be large, both position and recall have to be pretty large. I should say that there are many different possible formulas for combining position and recall. This f score formula is really, maybe just one out of a much larger number of possibilities, but historically or traditionally this is what people in machine learning use. And the term f score, you know, it doesn't really mean anything, so don't worry about why it's called f score or f1 score. But this usually gives you the effect that you want because if either position is 0 or recall is 0, this gives you a very low f score. And so, to have a high f score you can't need a preserve quality 1 and completely if p equals zero or i equals zero then this gives you the f score equals zero. Where as a perfect f score, so if position equals one and [xx] equals one that would give you an f score that's equal to one times one over two times two. So the f score will be equal to 1 if you have perfect precision and perfect recall. And intermediate values between 0 and 1, this usually gives a reasonable rank ordering of different classifiers. So this video we talked about the notion of trading off between position and recall and how we can vary the threshold that we use to decide whether to predict y equals one or y equals zero. This threshold that says do we need to be at least seventy percent confident or ninety percent confident or whatever before we predict y equals one and by varying the threshold you can control a trade off between precision and recall. Ross talked about the f score which takes precision and recall and gives you a single real number evaluation metric. And of course, if your goal is to automatically set that threshold, to decide which one of y equals 1 or y equals 0, one pretty reasonable way to do that would also be to try a range of different values of thresholds. So, try a range of values of thresholds and evaluate these different threshold on say your cross validation set, and then to pick whatever value of threshold gives you the highest f score on your cross validation setting. That would be a pretty reasonable way to automatically chose the threshold for your classifier as well.