Coins and dice provide a nice simple model of how to calculate probabilities, but everyday life is a lot more complicated and it's not taking up with gambling. At least I hope your life is not taking up with gambling. So in order to make probabilities more applicable to everyday life, we need to look at slightly more complicated methods. Now, because these methods are more complicated, this lecture is going to be an honors lecture. It's optional. It will not be on the quiz, so don't get worried about that. But it is still useful, and it's fascinating, and it'll help you avoid some mistakes that people make and that create a lot of problems. So I hope you will stick with it and listen to this lecture. And there will be exercises to help you figure out whether you understand the material or not. But don't get too worried, because it's not going to be on the quiz. The real problem that we'll be facing in this lecture, is the problem of tests. We use tests all the time, we use tests to figure out. Whether you have a certain medical condition we use test to predict the weather or to predict people's future behavior. We have certain indicators that how they're going to act. Either commit a crime or not commit a crime, but also whether they're going to pass, do well in school, or fail. we always use these tests when we don't know for certain, but we want some kind of evidence or some kind of indicator. The problem is, none of these tests are perfect. They always contain errors of various sorts, and what we're going to have to do is to see how to take those errors of different sorts and build them together into a method, and then a formula for calculating how reliable the method is for detecting the thing that we want to detect. This problem is a lot like a problem we faced earlier when we were talking about applying generalizations to particular cases, because here we're going to be applying probabilities to particular cases. So it'll seem familiar to you in certain parts but you'll see that this case is a little trickier. The best examples occur in medicine. So just imagine that you go to your doctor for a regular checkup. You don't have any special symptoms, but he decides to do a few screening tests. And unfortunately, and very worryingly, it turns out that you test positive on one test for a particular form of cancer, a certain kind of medical condition. Well, what that means is that you might have cancer. Might? Great. You want to know whether you do have cancer, but of course, finding out for sure whether or not you have cancer is going to take further tests, and those tests might be expensive. They might be dangerous, they're going to be invasive in various ways. So you really want to know what's the probability, given that you tested positive on this one test, that you really have cancer. Now clearly, that probability is going to depend on in a number of facts about this type of cancer, about the type of test, and so on. And I am not a doctor, I am not giving you medical advice. If you test positive on a test, go talk to your doctor, don't trust me because I am just making up numbers here. But let's do make a few number's and figure out what the likelihood is of having cancer, given that you tested positive. So, let's imagine that the base rate of this particular type of cancer in the population is 3%.. That is three out of 1,000 or 003.. And to say that's the base rate or it's sometimes called the prevalence of the condition in the population, that's simply to say that out of 1,000 people chosen randomly in the population, you'd have about three that have this condition. It's just a percentage of the general population. So, that's the condition, what about the test? Well the first thing we want to know is the sensitivity of the test. The sensitivity of the test, we're going to assume is 99,. and what that means is that out of 100 people, who have this condition, 99 of them will test positive. So, this test is pretty good at figuring out, from among the people who have the condition, which ones do. 99 of those 100 people who have the condition will test positive. The other feature is specificity. And what that means is the percentage of the people who don't have the condition who will test negative. The point here is that you're not going to get a positive result for people who don't have the condition. Right? Because you want it to be specific to this particular condition, and not get a bunch of positives for people who have other types of conditions, or no medical condition at all. So the specificity, we're going to assume, in this particular case we're talking about, is also 99 percent. Now, what we want to know is the probability, that you have the cancer, the condition, given that you tested positive on the test. But notice that the sensitivity tells you the probability that you will test positive, given that you have the condition. We want to know the opposite of that, the probability that you have the condition, given that you tested positive. And that's what we have to do a little calculation to figure out. But before we do that calculation, I want you to think about these figures that I've given you, the prevalence in the population, the sensitivity of the test, the specificity of the test, and just make a guess. Just start out by writing down on a piece of paper what you think the probability is that you would have the cancer, given that you tested positive on the test. Take a minute and think about it, and write it down. But we don't want to just guess about medical conditions, about probabilities that really matter as much as this one do. Instead, we want to calculate what the probability really is. So let's go through it carefully, and show you how to use what I'll call the Box Method, in order to calculate the real likelihood that you have the condition, given that you got a positive test result. What we need to do is to divide the population into four different groups. The group that has the condition and tested positive, the group that has the condition and tested negative, the group that doesn't have the condition and tested positive and the group that doesn't have the condition and tested negative. And this chart will show you a nice simple way of organizing all of that information. Because this row, the top row, tells you all the people who tested positive. The bottom row tells you the people who tested negative. Then, the left column gives you the people who do have the medical condition, in this case some kind of cancer, and the right column tells you the people who do not have that condition. Now, what we need to do is to start filling it out with numbers. Now, the first thing we need to specify is the population. In this case, we want to start with a big enough population that we're not going to have a lot of fractions in the other boxes. So let's just imagine that the population is 100,000. Make it one million or ten million. It doesn't matter, because we're going to be interested in the ratios of the different groups. We can use that 100,000 to fill out the other boxes if we know the prevalence or the base rate. Because the base rate tells you what percentage of that 100,000 actually do have the condition and don't have the condition. We imagined, remember we're just making up numbers here, but we imagined that the prevalence of this condition is 3%. and that means out of 100,000 people, there will be 300 who do have the medical condition. Well if there are 300 who have it and there are 100,000 total, we can figure out how many don't have the medical condition by just subtracting, which means, 99,700 do not have the medical condition. Okay? Now, we've divided the population into our two columns. The ones that do and the ones that don't have the medical condition. The next step is to figure out how many are going to test positive. And how many are going to test negative out of each of these groups. For that, we first need the sensitivity. The sensitivity tells us the percentage of the cases that have the condition who will test positive. So the people who have the condition are the 300. The ones who test positive are going to go up in this area. And we know from the sensitivity being 0.99 or 99% that the number in that area should be 99% of 300, or 297. And of course, if that's the number that tests positive then the remainder are going to test negative, and that means that we'll have three, which shouldn't surprise you because if 99% of the cases that have it test positive, then 1% will test negative and 1% of 300 is 3. Good. So we got the first column done. Now, the next question is going to be the specificity. We can use the specificity to figure out what goes in that next column. If the specificity is 99, and we know that 99,700 people do not have the condition out of our sample of 100,000. Well, that means that 99% of 99,700 are going to test negative. Becuase the specificities, the percentage of cases without the condition that test negative. And that means that we'll have 98,703 among people who do not have the condition who test negative. How many are you going to test positive? The rest of them. So 99,700 minus 98,703 is going to be 997. And of course that shouldn't be surprising again, because 1% of 99,700 is 997. We only got two boxes left to fill out. How do you fill out those? Well this box in the upper right is the total number of people in this population of 100,000 who test positive. And so we can get that by adding the ones that do have the condition and test positive and the ones that don't have the condition and test positive. Just add them together, and you get 1294. And you do the same on the next row because that blank is the area that has all the people who test negative, and three people who have the condition test negative. 98,703 people who do not have the condition test negative. So the total is going to be 98,706. And we can check to make sure that we got it right by just adding them together, 1,294 plus 98,706 is equal 100,000. [SOUND] We got it right, okay. So, now we divided the population into those people who have the condition, those people who don't have the condition and we know how many of each of those groups test positive and how many of each of those groups test negative. The real question is what's the probability that I have cancer or the medical condition given that I tested positive. How do we figure that out? Well, the total number of positive tests was 1294, and the people who tested positive who really had the condition was 297, so it looks like the probability of actually having the condition given that you tested positive is 297 out of 1294 or 23.. That's 23%, less that one out of four. Is that what you guessed? Most people, including most doctors, when they hear that the test is 99% sensitive and 99% specific will guess a lot higher than one and four. Oh my gosh, I'm a doctor and I never would have thought that. Now, don't worry, she's not a physician, she's a meta-physician. [INAUDIBLE] But in this case, the probability really is just one in four that you have that medical condition. Now, how did that happen? The reason was that the prevalence or the base rate was so low, that even a small rate of false positives, given the massive numbers of people who don't have the condition, will mean that there are more false positives, three times as many as there are true positives and that's why the probability is just one in four actually a little less than one in four that you have the medical condition even when you tested positive. I want to add a quick caveat here in order to avoid misinterpretation. Because the point here is that if you have a screening test for a condition with a very low base rate or prevalence. And you don't have any symptoms that put you in a special category then you need to get another test before you jump to any conclusions about having the medical condition. Because, if you had that other test, the fact that you test positive at first test puts you in smaller class with a much higher base rank for prevalence and now the probability is going to go up. Most doctors know that. And that's why after the first test they don't jump to conclusions and they order another test. But many patients don't realise that, and they get extremely worried after a single test, even when they don't have any symptoms. So that's the mistake that we're trying to avoid here. Now that's surprising, but it actually applies to many different areas of life. It applies for example to medical tests with all kinds of other diseases, not just cancer, or colon cancer, but pretty much every disease where the prevalence is extremely low. It applies also to drug tests. If somebody gets a positive drug test, does that mean that they really were using drugs? Well, if it's a population where the base rate or prevalence of drug use is quite low, then it might not. Of course, if you assume that the prevalence or base rate is quite high, then you're going to believe that drug test, but you need to know the facts about what the prevalence, or base rate really is in order to calculate accurately the probability that this person really was using drugs. Same applies to evidence in legal trials. Take eye witnesses for example. It's very tricky. Someone's trying to use their eyes as a test for what they see. They might identify a friend. Or they might just say that car that did the hit and run accident was a Porsche. Well, how good are they at identifying Porsches? If they get it right most of the time, but not always, and sometimes they don't get it right when it is a Porsche, then we've got the sensitivity and specificity of what they identify, and we can use that to calculate how likely it is that their evidence in the trial really is reliable or not. Another example is the prediction of future behavior. We might have some kind of marker. That a certain group of people with that marker have a certain likelihood of committing crimes, but if crimes are very rare in that community and every other, then a test which has a pretty good sensitivity and specificity still might not be good enough when we're talking about something like crime that's actually very rare, and has a very low prevalence, or base rate in most communities. And the same applies to failing out of school. Our SAT scores or GRE scores are going to be good predictors of, of who's going to fail out of school. Well, if very few people fail out of school so that the prevalence of base rate is very low, then even if they're pretty sensitive and specific, they might not be good predictors. So this same type of problem arises in a lot of different areas, and I'm not going to go through more examples right now, but we'll have plenty of examples in these exercises at the end of this chapter. I want to end, though, by saying a few things that are a bit more technical about this method. First, there's a lot of terminology to learn, because when you read about using this method in other areas for other types of topics, then you'll run into these terms, and it's a good idea to know them. So first. The cases where the person does have the condition and also tests positive are called hits or true positives, different people use different terms. The cases where. The person tests positive, but they don't have the condition, are called false positives or false alarms. The cases where a person really does have the condition but tests negative are called misses or false negatives. And the cases where the person does not have the condition and the test comes out negative are called true negatives. because they're negative and it's true that they don't have the condition. If we put together the false negatives and the true negatives. We get the total set of negatives. And if we put together the true positives and the false positives, we get the total set of positives. And of course we have the general population. And within that population a percentage that have the condition and a percentage that don't have the condition. Now, what's the base rate? The base rate, in this population, is simply the set that had the condition / the total population. Which is box seven / box nine. If we use e for the evidence, and h for the hypothesis being true. That the condition really does exists. Then that's the probability of h. And the sensitivity is going to be. The total number of true positives divided by the total number of people with the condition. Because it's the percentage of people who have the condition and test positive. Okay, so that's the probably of e given h, and it's box one divided by box seven. The specificity in contrast is the ratio of it being a true negative to the total number of people who do not have the condition. That is, the probability of Not E, that is, not having the evidence of a positive test result, given not H, given that you're in this second column, where the hypothesis is false, because you don't have the condition. So that's box five divided by box eight. That's the specificity. So we can define all of these in terms of each other. The hits divided by the total with that condition is going to be the sensitivity, and you can use this terminology to guide your way through this box. And the big question is, again, going to be, what's the solution? What's the probability of the hypothesis having the condition, given the evidence, that is a positive test result. That's going to be box one divided by box three, and as we saw in the case that we just went through, that gives you the probability of having the medical condition, or colon cancer, given a positive test result. That's called the posterior probability or, in symbols, the probability of the hypothesis, given the evidence. So I hope this terminology helps you understand some of the discussions of this I fyou go on and read about it. This procedure that we've been discussing is actually just an application of a famous theorem called Bayes' Theorem, after Thomas Bayes, an eighteenth century English clergymen, who was also a mathematician, and proved this extremely theorem in probability theory. Now, some of you out there will use the boxes and it'll make sense to you, but some Corsairians, I assume, are mathematicians and they want to see the mathematics behind it. So now, I want to show you how the derived base theorem from the rules of probability that we learned in earlier lectures. So for all you Math nerds out there, here it goes. You start with rule 2G, apply it to the probability that the evidence and the hypothesis are both true, and by the rule, that probability is equal to the probability of the evidence times the probability of the hypothesis, given the evidence. You have to have that conditional probability, because they're not independent. Then you simply divide both sides of that by the probability of the evidence, a little simple algebra. And you end up with the probability of the hypothesis, given the evidence, is equal to the probability of the evidence and the hypothesis divided by the probability of the evidence. Now we can do a little trick, this was ingenious. Substitute for E something that's logically equivalent to E, namely the evidence in the hypothesis or the evidence in not the hypothesis. Now if you think about it, you'll see that those are equivalent, because either the hypothesis has to be true or not the hypothesis is true. One or the other has to be true, and that means that the evidence and the hypothesis or the evidence and not the hypothesis is going to be equivalent to E. So this is equivalent to this, and because they are equivalent, we can substitute them within the formula for probability without affecting the truth values. So we just substitute this formula in here for the E up there. And we end up with the probability of the hypothesis, given the evidence, is equal to the probability of the evidence and the hypothesis divided by the probability of the evidence and the hypothesis, or the evidence and not the hypothesis. Now that's not supposed to make much sense, but it helps with the derivation. The next step is to apply rule three, because we have a disjunction. And notice that disjuncts are mutually exclusive. It cannot be true both that the evidence and the hypothesis is true, and also, that the evidence and not the hypothesis is true, because it can't be both h and not h. So, we can apply the simple version of rule three, and that means that the probability of E and H, or E and Not H, is equal to the probability of E and H plus the probability of E and Not H. We're just applying that rule three for dysjunction that we learned a few lectures ago. Now we apply rule 2G again, because we have the probability of a conjunction up in the top. And since these are not independent of each other, we hope not, if it's a hypothesis of the evidence for it, then we have to use the conditional probability. And using rule 2G, we find that the probability of the hypothesis, given the evidence, is equal to the probability of the hypothesis times the probability of the evidence given the hypothesis divided by the probability of the hypothesis times the probability of the evidence, given the hypothesis, plus the probability of the hypothesis being false. That is, the probability of not h times the probability of the evidence given, not H, or the hypothesis being false. And that's a mouthful, and its a long formula. But, that's the mathematical formula that Bayes proved in the 18th century, and that provides the mathematical bases for that whole system of boxes that we talked about before. But if you don't like the mathematical proof, if that's too confusing for you, then use the boxes. And if you don't like the boxes, use the mathematical proof. They're both going to work. Just pick the one that works for you. In fact, you don't have to pick either of them, because remember, this is an honors lecture. It's optional, and it won't be on the quiz. But if you do want to try this method and make sure you understand it, we'll have a bunch of exercises for you, where you can test your skills.