1 00:00:02,600 --> 00:00:09,282 Coins and dice provide a nice simple model of how to calculate probabilities, 2 00:00:09,282 --> 00:00:14,208 but everyday life is a lot more complicated and it's not taking up with 3 00:00:14,208 --> 00:00:17,561 gambling. At least I hope your life is not taking 4 00:00:17,561 --> 00:00:21,392 up with gambling. So in order to make probabilities more 5 00:00:21,392 --> 00:00:26,387 applicable to everyday life, we need to look at slightly more complicated 6 00:00:26,387 --> 00:00:29,329 methods. Now, because these methods are more 7 00:00:29,329 --> 00:00:33,366 complicated, this lecture is going to be an honors lecture. 8 00:00:33,366 --> 00:00:37,129 It's optional. It will not be on the quiz, so don't get 9 00:00:37,129 --> 00:00:40,757 worried about that. But it is still useful, and it's 10 00:00:40,757 --> 00:00:44,467 fascinating, and it'll help you avoid some mistakes 11 00:00:44,467 --> 00:00:48,177 that people make and that create a lot of problems. 12 00:00:48,177 --> 00:00:52,541 So I hope you will stick with it and listen to this lecture. 13 00:00:52,541 --> 00:00:58,215 And there will be exercises to help you figure out whether you understand the 14 00:00:58,215 --> 00:01:01,561 material or not. But don't get too worried, because it's 15 00:01:01,561 --> 00:01:07,337 not going to be on the quiz. The real problem that we'll be facing in 16 00:01:07,337 --> 00:01:13,301 this lecture, is the problem of tests. We use tests all the time, we use tests 17 00:01:13,301 --> 00:01:16,653 to figure out. Whether you have a certain medical 18 00:01:16,653 --> 00:01:21,483 condition we use test to predict the weather or to predict people's future 19 00:01:21,483 --> 00:01:24,380 behavior. We have certain indicators that how 20 00:01:24,380 --> 00:01:28,207 they're going to act. Either commit a crime or not commit a 21 00:01:28,207 --> 00:01:32,980 crime, but also whether they're going to pass, do well in school, or fail. 22 00:01:32,980 --> 00:01:39,237 we always use these tests when we don't know for certain, but we want some kind 23 00:01:39,237 --> 00:01:45,112 of evidence or some kind of indicator. The problem is, none of these tests are 24 00:01:45,112 --> 00:01:48,622 perfect. They always contain errors of various 25 00:01:48,622 --> 00:01:54,422 sorts, and what we're going to have to do is to see how to take those errors of 26 00:01:54,422 --> 00:02:00,297 different sorts and build them together into a method, and then a formula for 27 00:02:00,297 --> 00:02:06,325 calculating how reliable the method is for detecting the thing that we want to 28 00:02:06,325 --> 00:02:09,716 detect. This problem is a lot like a problem we 29 00:02:09,716 --> 00:02:15,031 faced earlier when we were talking about applying generalizations to particular 30 00:02:15,031 --> 00:02:19,482 cases, because here we're going to be applying probabilities to particular 31 00:02:19,482 --> 00:02:22,605 cases. So it'll seem familiar to you in certain 32 00:02:22,605 --> 00:02:26,591 parts but you'll see that this case is a little trickier. 33 00:02:26,591 --> 00:02:31,374 The best examples occur in medicine. So just imagine that you go to your 34 00:02:31,374 --> 00:02:35,958 doctor for a regular checkup. You don't have any special symptoms, but 35 00:02:35,958 --> 00:02:41,780 he decides to do a few screening tests. And unfortunately, and very worryingly, 36 00:02:41,780 --> 00:02:48,580 it turns out that you test positive on one test for a particular form of cancer, 37 00:02:48,580 --> 00:02:54,878 a certain kind of medical condition. Well, what that means is that you might 38 00:02:54,878 --> 00:02:56,169 have cancer. Might? 39 00:02:56,169 --> 00:02:59,255 Great. You want to know whether you do have 40 00:02:59,255 --> 00:03:04,708 cancer, but of course, finding out for sure whether or not you have cancer is 41 00:03:04,708 --> 00:03:10,233 going to take further tests, and those tests might be expensive. They might be 42 00:03:10,233 --> 00:03:14,179 dangerous, they're going to be invasive in various ways. 43 00:03:14,179 --> 00:03:19,345 So you really want to know what's the probability, given that you tested 44 00:03:19,345 --> 00:03:23,680 positive on this one test, that you really have cancer. 45 00:03:23,680 --> 00:03:29,847 Now clearly, that probability is going to depend on in a number of facts about this 46 00:03:29,847 --> 00:03:33,290 type of cancer, about the type of test, and so on. 47 00:03:33,290 --> 00:03:37,378 And I am not a doctor, I am not giving you medical advice. 48 00:03:37,378 --> 00:03:43,044 If you test positive on a test, go talk to your doctor, don't trust me because I 49 00:03:43,044 --> 00:03:47,921 am just making up numbers here. But let's do make a few number's and 50 00:03:47,921 --> 00:03:53,263 figure out what the likelihood is of having cancer, given that you tested 51 00:03:53,263 --> 00:03:56,715 positive. So, let's imagine that the base rate of 52 00:03:56,715 --> 00:04:00,682 this particular type of cancer in the population is 3%.. 53 00:04:00,682 --> 00:04:06,117 That is three out of 1,000 or 003.. And to say that's the base rate or it's 54 00:04:06,117 --> 00:04:11,552 sometimes called the prevalence of the condition in the population, that's 55 00:04:11,552 --> 00:04:17,355 simply to say that out of 1,000 people chosen randomly in the population, you'd 56 00:04:17,355 --> 00:04:21,240 have about three that have this condition. 57 00:04:21,240 --> 00:04:25,300 It's just a percentage of the general population. 58 00:04:25,300 --> 00:04:28,916 So, that's the condition, what about the test? 59 00:04:28,916 --> 00:04:34,505 Well the first thing we want to know is the sensitivity of the test. 60 00:04:34,505 --> 00:04:40,999 The sensitivity of the test, we're going to assume is 99,. and what that means is 61 00:04:40,999 --> 00:04:45,036 that out of 100 people, who have this condition, 62 00:04:45,036 --> 00:04:51,080 99 of them will test positive. So, this test is pretty good at figuring 63 00:04:51,080 --> 00:04:55,310 out, from among the people who have the condition, 64 00:04:55,310 --> 00:04:59,627 which ones do. 99 of those 100 people who have the 65 00:04:59,627 --> 00:05:04,980 condition will test positive. The other feature is specificity. 66 00:05:04,980 --> 00:05:11,974 And what that means is the percentage of the people who don't have the condition 67 00:05:11,974 --> 00:05:16,051 who will test negative. The point here is that you're not 68 00:05:16,051 --> 00:05:20,233 going to get a positive result for people who don't have the condition. 69 00:05:20,233 --> 00:05:22,939 Right? Because you want it to be specific to 70 00:05:22,939 --> 00:05:27,798 this particular condition, and not get a bunch of positives for people who have 71 00:05:27,798 --> 00:05:31,303 other types of conditions, or no medical condition at all. 72 00:05:31,303 --> 00:05:35,670 So the specificity, we're going to assume, in this particular case we're 73 00:05:35,670 --> 00:05:42,039 talking about, is also 99 percent. Now, what we want to know is the 74 00:05:42,039 --> 00:05:47,010 probability, that you have the cancer, the condition, 75 00:05:47,010 --> 00:05:52,720 given that you tested positive on the test. But notice that the sensitivity 76 00:05:52,720 --> 00:05:58,658 tells you the probability that you will test positive, given that you have the 77 00:05:58,658 --> 00:06:02,053 condition. We want to know the opposite of that, the 78 00:06:02,053 --> 00:06:06,614 probability that you have the condition, given that you tested positive. 79 00:06:06,614 --> 00:06:10,790 And that's what we have to do a little calculation to figure out. 80 00:06:10,790 --> 00:06:15,865 But before we do that calculation, I want you to think about these figures that 81 00:06:15,865 --> 00:06:20,747 I've given you, the prevalence in the population, the sensitivity of the test, 82 00:06:20,747 --> 00:06:23,960 the specificity of the test, and just make a guess. 83 00:06:23,960 --> 00:06:30,468 Just start out by writing down on a piece of paper what you think the probability 84 00:06:30,468 --> 00:06:36,580 is that you would have the cancer, given that you tested positive on the test. 85 00:06:36,580 --> 00:06:40,670 Take a minute and think about it, and write it down. 86 00:06:40,670 --> 00:06:46,367 But we don't want to just guess about medical conditions, about probabilities 87 00:06:46,367 --> 00:06:49,557 that really matter as much as this one do. 88 00:06:49,557 --> 00:06:54,875 Instead, we want to calculate what the probability really is. So let's go 89 00:06:54,875 --> 00:07:00,800 through it carefully, and show you how to use what I'll call the Box Method, in 90 00:07:00,800 --> 00:07:06,725 order to calculate the real likelihood that you have the condition, given that 91 00:07:06,725 --> 00:07:12,415 you got a positive test result. What we need to do is to divide the 92 00:07:12,415 --> 00:07:18,801 population into four different groups. The group that has the condition and 93 00:07:18,801 --> 00:07:22,393 tested positive, the group that has the condition and 94 00:07:22,393 --> 00:07:27,680 tested negative, the group that doesn't have the condition and tested positive 95 00:07:27,680 --> 00:07:32,763 and the group that doesn't have the condition and tested negative. And this 96 00:07:32,763 --> 00:07:37,914 chart will show you a nice simple way of organizing all of that information. 97 00:07:37,914 --> 00:07:43,681 Because this row, the top row, tells you all the people who tested 98 00:07:43,681 --> 00:07:47,603 positive. The bottom row tells you the people who 99 00:07:47,603 --> 00:07:51,684 tested negative. Then, the left column gives you the 100 00:07:51,684 --> 00:07:57,607 people who do have the medical condition, in this case some kind of cancer, 101 00:07:57,607 --> 00:08:03,450 and the right column tells you the people who do not have that condition. 102 00:08:03,450 --> 00:08:07,418 Now, what we need to do is to start filling it out with numbers. 103 00:08:07,418 --> 00:08:11,008 Now, the first thing we need to specify is the population. 104 00:08:11,008 --> 00:08:15,921 In this case, we want to start with a big enough population that we're not going to 105 00:08:15,921 --> 00:08:18,629 have a lot of fractions in the other boxes. 106 00:08:18,629 --> 00:08:21,967 So let's just imagine that the population is 100,000. 107 00:08:21,967 --> 00:08:26,124 Make it one million or ten million. It doesn't matter, because we're going to 108 00:08:26,124 --> 00:08:30,360 be interested in the ratios of the different groups. 109 00:08:30,360 --> 00:08:36,123 We can use that 100,000 to fill out the other boxes if we know the prevalence or 110 00:08:36,123 --> 00:08:39,707 the base rate. Because the base rate tells you what 111 00:08:39,707 --> 00:08:45,119 percentage of that 100,000 actually do have the condition and don't have the 112 00:08:45,119 --> 00:08:48,563 condition. We imagined, remember we're just making 113 00:08:48,563 --> 00:08:54,116 up numbers here, but we imagined that the prevalence of this condition is 3%. and 114 00:08:54,116 --> 00:08:59,317 that means out of 100,000 people, there will be 300 who do have the medical 115 00:08:59,317 --> 00:09:03,083 condition. Well if there are 300 who have it and 116 00:09:03,083 --> 00:09:07,531 there are 100,000 total, we can figure out how many don't have the medical 117 00:09:07,531 --> 00:09:11,438 condition by just subtracting, which means, 99,700 do not have the 118 00:09:11,438 --> 00:09:12,878 medical condition. Okay? 119 00:09:12,878 --> 00:09:16,753 Now, we've divided the population into our two columns. 120 00:09:16,753 --> 00:09:21,631 The ones that do and the ones that don't have the medical condition. 121 00:09:21,631 --> 00:09:26,223 The next step is to figure out how many are going to test positive. 122 00:09:26,223 --> 00:09:30,887 And how many are going to test negative out of each of these groups. 123 00:09:30,887 --> 00:09:36,555 For that, we first need the sensitivity. The sensitivity tells us the percentage 124 00:09:36,555 --> 00:09:40,860 of the cases that have the condition who will test positive. 125 00:09:40,860 --> 00:09:46,229 So the people who have the condition are the 300. 126 00:09:46,229 --> 00:09:52,365 The ones who test positive are going to go up in this area. 127 00:09:52,365 --> 00:10:01,021 And we know from the sensitivity being 0.99 or 99% that the number in that area 128 00:10:01,021 --> 00:10:07,389 should be 99% of 300, or 297. And of course, if that's the number that 129 00:10:07,389 --> 00:10:12,699 tests positive then the remainder are going to test negative, and that means 130 00:10:12,699 --> 00:10:17,209 that we'll have three, which shouldn't surprise you because if 131 00:10:17,209 --> 00:10:22,665 99% of the cases that have it test positive, then 1% will test negative and 132 00:10:22,665 --> 00:10:24,571 1% of 300 is 3. Good. 133 00:10:24,571 --> 00:10:30,899 So we got the first column done. Now, the next question is going to be the 134 00:10:30,899 --> 00:10:35,690 specificity. We can use the specificity to figure out 135 00:10:35,690 --> 00:10:40,864 what goes in that next column. If the specificity is 99, 136 00:10:40,864 --> 00:10:50,537 and we know that 99,700 people do not have the condition out of our sample of 137 00:10:50,537 --> 00:10:54,995 100,000. Well, that means that 99% of 99,700 are 138 00:10:54,995 --> 00:11:00,831 going to test negative. Becuase the specificities, the percentage 139 00:11:00,831 --> 00:11:05,391 of cases without the condition that test negative. 140 00:11:05,391 --> 00:11:11,866 And that means that we'll have 98,703 among people who do not have the 141 00:11:11,866 --> 00:11:19,818 condition who test negative. How many are you going to test positive? 142 00:11:19,818 --> 00:11:25,765 The rest of them. So 99,700 minus 98,703 is going to be 143 00:11:25,765 --> 00:11:30,048 997. And of course that shouldn't be 144 00:11:30,048 --> 00:11:36,160 surprising again, because 1% of 99,700 is 997. 145 00:11:36,160 --> 00:11:40,999 We only got two boxes left to fill out. How do you fill out those? 146 00:11:40,999 --> 00:11:47,104 Well this box in the upper right is the total number of people in this population 147 00:11:47,104 --> 00:11:52,390 of 100,000 who test positive. And so we can get that by adding the ones 148 00:11:52,390 --> 00:11:58,197 that do have the condition and test positive and the ones that don't have the 149 00:11:58,197 --> 00:12:01,920 condition and test positive. Just add them together, 150 00:12:01,920 --> 00:12:09,823 and you get 1294. And you do the same on the next row 151 00:12:09,823 --> 00:12:18,064 because that blank is the area that has all the people who test negative, 152 00:12:18,064 --> 00:12:23,764 and three people who have the condition test negative. 153 00:12:23,764 --> 00:12:29,887 98,703 people who do not have the condition test negative. 154 00:12:29,887 --> 00:12:37,700 So the total is going to be 98,706. And we can check to make sure that we got 155 00:12:37,700 --> 00:12:44,680 it right by just adding them together, 1,294 plus 98,706 is equal 100,000. 156 00:12:44,680 --> 00:12:49,343 [SOUND] We got it right, okay. So, now we divided the population into 157 00:12:49,343 --> 00:12:55,405 those people who have the condition, those people who don't have the condition 158 00:12:55,405 --> 00:13:01,622 and we know how many of each of those groups test positive and how many of each 159 00:13:01,622 --> 00:13:06,441 of those groups test negative. The real question is what's the 160 00:13:06,441 --> 00:13:12,348 probability that I have cancer or the medical condition given that I tested 161 00:13:12,348 --> 00:13:15,301 positive. How do we figure that out? 162 00:13:15,301 --> 00:13:20,183 Well, the total number of positive tests was 163 00:13:20,183 --> 00:13:25,132 1294, and the people who tested positive who 164 00:13:25,132 --> 00:13:33,877 really had the condition was 297, so it looks like the probability of actually 165 00:13:33,877 --> 00:13:42,062 having the condition given that you tested positive is 297 out of 1294 or 166 00:13:42,062 --> 00:13:46,547 23.. That's 23%, less that one out of four. 167 00:13:46,547 --> 00:13:52,727 Is that what you guessed? Most people, including most doctors, when 168 00:13:52,727 --> 00:13:59,011 they hear that the test is 99% sensitive and 99% specific will guess a lot higher 169 00:13:59,011 --> 00:14:03,489 than one and four. Oh my gosh, I'm a doctor and I never 170 00:14:03,489 --> 00:14:08,595 would have thought that. Now, don't worry, she's not a physician, 171 00:14:08,595 --> 00:14:13,501 she's a meta-physician. [INAUDIBLE] But in this case, the 172 00:14:13,501 --> 00:14:18,936 probability really is just one in four that you have that medical condition. 173 00:14:18,936 --> 00:14:23,584 Now, how did that happen? The reason was that the prevalence or the 174 00:14:23,584 --> 00:14:28,758 base rate was so low, that even a small rate of false positives, given the 175 00:14:28,758 --> 00:14:34,108 massive numbers of people who don't have the condition, will mean that there are 176 00:14:34,108 --> 00:14:39,458 more false positives, three times as many as there are true positives and that's 177 00:14:39,458 --> 00:14:44,808 why the probability is just one in four actually a little less than one in four 178 00:14:44,808 --> 00:14:49,222 that you have the medical condition even when you tested positive. 179 00:14:49,222 --> 00:14:53,770 I want to add a quick caveat here in order to avoid misinterpretation. 180 00:14:53,770 --> 00:14:59,063 Because the point here is that if you have a screening test for a condition 181 00:14:59,063 --> 00:15:05,094 with a very low base rate or prevalence. And you don't have any symptoms that put 182 00:15:05,094 --> 00:15:11,344 you in a special category then you need to get another test before you jump to 183 00:15:11,344 --> 00:15:15,840 any conclusions about having the medical condition. 184 00:15:15,840 --> 00:15:21,211 Because, if you had that other test, the fact that you test positive at first test 185 00:15:21,211 --> 00:15:26,384 puts you in smaller class with a much higher base rank for prevalence and now 186 00:15:26,384 --> 00:15:30,161 the probability is going to go up. Most doctors know that. 187 00:15:30,161 --> 00:15:35,591 And that's why after the first test they don't jump to conclusions and they order 188 00:15:35,591 --> 00:15:38,901 another test. But many patients don't realise that, 189 00:15:38,901 --> 00:15:44,000 and they get extremely worried after a single test, even when they don't have 190 00:15:44,000 --> 00:15:47,509 any symptoms. So that's the mistake that we're trying 191 00:15:47,509 --> 00:15:50,092 to avoid here. Now that's surprising, 192 00:15:50,092 --> 00:15:54,440 but it actually applies to many different areas of life. 193 00:15:54,440 --> 00:16:00,032 It applies for example to medical tests with all kinds of other diseases, not 194 00:16:00,032 --> 00:16:04,944 just cancer, or colon cancer, but pretty much every disease where the 195 00:16:04,944 --> 00:16:09,796 prevalence is extremely low. It applies also to drug tests. 196 00:16:09,796 --> 00:16:15,387 If somebody gets a positive drug test, does that mean that they really were 197 00:16:15,387 --> 00:16:19,338 using drugs? Well, if it's a population where the base 198 00:16:19,338 --> 00:16:23,960 rate or prevalence of drug use is quite low, then it might not. 199 00:16:23,960 --> 00:16:28,504 Of course, if you assume that the prevalence or base rate is quite high, 200 00:16:28,504 --> 00:16:33,316 then you're going to believe that drug test, but you need to know the facts 201 00:16:33,316 --> 00:16:38,062 about what the prevalence, or base rate really is in order to calculate 202 00:16:38,062 --> 00:16:42,980 accurately the probability that this person really was using drugs. 203 00:16:42,980 --> 00:16:49,001 Same applies to evidence in legal trials. Take eye witnesses for example. 204 00:16:49,001 --> 00:16:53,769 It's very tricky. Someone's trying to use their eyes as a 205 00:16:53,769 --> 00:16:58,118 test for what they see. They might identify a friend. 206 00:16:58,118 --> 00:17:04,809 Or they might just say that car that did the hit and run accident was a Porsche. 207 00:17:04,809 --> 00:17:09,180 Well, how good are they at identifying Porsches? 208 00:17:09,180 --> 00:17:13,116 If they get it right most of the time, but not always, 209 00:17:13,116 --> 00:17:18,910 and sometimes they don't get it right when it is a Porsche, then we've got the 210 00:17:18,910 --> 00:17:22,623 sensitivity and specificity of what they identify, 211 00:17:22,623 --> 00:17:28,343 and we can use that to calculate how likely it is that their evidence in the 212 00:17:28,343 --> 00:17:33,546 trial really is reliable or not. Another example is the prediction of 213 00:17:33,546 --> 00:17:37,790 future behavior. We might have some kind of marker. 214 00:17:37,790 --> 00:17:43,457 That a certain group of people with that marker have a certain likelihood of 215 00:17:43,457 --> 00:17:49,345 committing crimes, but if crimes are very rare in that community and every other, 216 00:17:49,345 --> 00:17:54,939 then a test which has a pretty good sensitivity and specificity still might 217 00:17:54,939 --> 00:18:00,238 not be good enough when we're talking about something like crime that's 218 00:18:00,238 --> 00:18:05,391 actually very rare, and has a very low prevalence, or base rate in most 219 00:18:05,391 --> 00:18:09,144 communities. And the same applies to failing out of 220 00:18:09,144 --> 00:18:13,106 school. Our SAT scores or GRE scores are going to 221 00:18:13,106 --> 00:18:17,104 be good predictors of, of who's going to fail out of school. 222 00:18:17,104 --> 00:18:22,118 Well, if very few people fail out of school so that the prevalence of base 223 00:18:22,118 --> 00:18:27,471 rate is very low, then even if they're pretty sensitive and specific, they might 224 00:18:27,471 --> 00:18:31,808 not be good predictors. So this same type of problem arises in a 225 00:18:31,808 --> 00:18:36,687 lot of different areas, and I'm not going to go through more examples right 226 00:18:36,687 --> 00:18:41,837 now, but we'll have plenty of examples in these exercises at the end of this 227 00:18:41,837 --> 00:18:45,727 chapter. I want to end, though, by saying a few 228 00:18:45,727 --> 00:18:49,573 things that are a bit more technical about this method. 229 00:18:49,573 --> 00:18:55,028 First, there's a lot of terminology to learn, because when you read about using 230 00:18:55,028 --> 00:19:00,203 this method in other areas for other types of topics, then you'll run into 231 00:19:00,203 --> 00:19:03,760 these terms, and it's a good idea to know them. 232 00:19:03,760 --> 00:19:11,917 So first. The cases where the person does have the 233 00:19:11,917 --> 00:19:17,505 condition and also tests positive are called hits or true positives, different 234 00:19:17,505 --> 00:19:24,040 people use different terms. The cases where. 235 00:19:24,040 --> 00:19:29,796 The person tests positive, but they don't have the condition, are called false 236 00:19:29,796 --> 00:19:38,050 positives or false alarms. The cases where a person really does have 237 00:19:38,050 --> 00:19:45,100 the condition but tests negative are called misses or false negatives. 238 00:19:47,180 --> 00:19:52,021 And the cases where the person does not have the condition and the test comes out 239 00:19:52,021 --> 00:19:56,153 negative are called true negatives. because they're negative and it's true 240 00:19:56,153 --> 00:20:02,151 that they don't have the condition. If we put together the false negatives 241 00:20:02,151 --> 00:20:06,961 and the true negatives. We get the total set of negatives. 242 00:20:06,961 --> 00:20:13,965 And if we put together the true positives and the false positives, we get the total 243 00:20:13,965 --> 00:20:18,190 set of positives. And of course we have the general 244 00:20:18,190 --> 00:20:21,505 population. And within that population a percentage 245 00:20:21,505 --> 00:20:26,820 that have the condition and a percentage that don't have the condition. 246 00:20:26,820 --> 00:20:32,504 Now, what's the base rate? The base rate, in this population, is 247 00:20:32,504 --> 00:20:38,189 simply the set that had the condition / the total population. 248 00:20:38,189 --> 00:20:44,620 Which is box seven / box nine. If we use e for the evidence, and h for 249 00:20:44,620 --> 00:20:50,584 the hypothesis being true. That the condition really does exists. 250 00:20:50,584 --> 00:20:58,780 Then that's the probability of h. And the sensitivity is going to be. 251 00:20:58,780 --> 00:21:04,457 The total number of true positives divided by the total number of people 252 00:21:04,457 --> 00:21:09,202 with the condition. Because it's the percentage of people who 253 00:21:09,202 --> 00:21:16,318 have the condition and test positive. Okay, so that's the probably of e given 254 00:21:16,318 --> 00:21:23,615 h, and it's box one divided by box seven. The specificity in contrast is the ratio 255 00:21:23,615 --> 00:21:30,541 of it being a true negative to the total number of people who do not have the 256 00:21:30,541 --> 00:21:34,360 condition. That is, the probability of Not E, 257 00:21:34,360 --> 00:21:40,505 that is, not having the evidence of a positive test result, given not H, 258 00:21:40,505 --> 00:21:47,352 given that you're in this second column, where the hypothesis is false, because 259 00:21:47,352 --> 00:21:53,600 you don't have the condition. So that's box five divided by box eight. 260 00:21:53,600 --> 00:21:58,567 That's the specificity. So we can define all of these in terms of 261 00:21:58,567 --> 00:22:02,465 each other. The hits divided by the total with that 262 00:22:02,465 --> 00:22:08,808 condition is going to be the sensitivity, and you can use this terminology to guide 263 00:22:08,808 --> 00:22:13,775 your way through this box. And the big question is, again, going to 264 00:22:13,775 --> 00:22:18,666 be, what's the solution? What's the probability of the hypothesis 265 00:22:18,666 --> 00:22:23,166 having the condition, given the evidence, that is a positive 266 00:22:23,166 --> 00:22:27,279 test result. That's going to be box one divided by box 267 00:22:27,279 --> 00:22:30,706 three, and as we saw in the case that we just 268 00:22:30,706 --> 00:22:35,732 went through, that gives you the probability of having the medical 269 00:22:35,732 --> 00:22:39,921 condition, or colon cancer, given a positive test result. 270 00:22:39,921 --> 00:22:45,785 That's called the posterior probability or, in symbols, the probability of the 271 00:22:45,785 --> 00:22:50,834 hypothesis, given the evidence. So I hope this terminology helps you 272 00:22:50,834 --> 00:22:56,364 understand some of the discussions of this I fyou go on and read about it. 273 00:22:56,364 --> 00:23:02,267 This procedure that we've been discussing is actually just an application of a 274 00:23:02,267 --> 00:23:08,319 famous theorem called Bayes' Theorem, after Thomas Bayes, an eighteenth century 275 00:23:08,319 --> 00:23:13,998 English clergymen, who was also a mathematician, and proved this extremely 276 00:23:13,998 --> 00:23:19,301 theorem in probability theory. Now, some of you out there will use the 277 00:23:19,301 --> 00:23:24,638 boxes and it'll make sense to you, but some Corsairians, I assume, are 278 00:23:24,638 --> 00:23:29,348 mathematicians and they want to see the mathematics behind it. 279 00:23:29,348 --> 00:23:34,921 So now, I want to show you how the derived base theorem from the rules of 280 00:23:34,921 --> 00:23:38,688 probability that we learned in earlier lectures. 281 00:23:38,688 --> 00:23:42,534 So for all you Math nerds out there, here it goes. 282 00:23:42,534 --> 00:23:47,672 You start with rule 2G, apply it to the probability that the 283 00:23:47,672 --> 00:23:50,977 evidence and the hypothesis are both true, 284 00:23:50,977 --> 00:23:57,037 and by the rule, that probability is equal to the probability of the evidence 285 00:23:57,037 --> 00:24:01,681 times the probability of the hypothesis, given the evidence. 286 00:24:01,681 --> 00:24:06,875 You have to have that conditional probability, because they're not 287 00:24:06,875 --> 00:24:13,162 independent. Then you simply divide both sides of that 288 00:24:13,162 --> 00:24:17,540 by the probability of the evidence, a little simple algebra. 289 00:24:17,540 --> 00:24:23,379 And you end up with the probability of the hypothesis, given the evidence, is 290 00:24:23,379 --> 00:24:29,140 equal to the probability of the evidence and the hypothesis divided by the 291 00:24:29,140 --> 00:24:34,825 probability of the evidence. Now we can do a little trick, this was 292 00:24:34,825 --> 00:24:38,205 ingenious. Substitute for E something that's 293 00:24:38,205 --> 00:24:43,401 logically equivalent to E, namely the evidence in the hypothesis or the 294 00:24:43,401 --> 00:24:48,162 evidence in not the hypothesis. Now if you think about it, you'll see 295 00:24:48,162 --> 00:24:53,905 that those are equivalent, because either the hypothesis has to be true or not the 296 00:24:53,905 --> 00:24:57,476 hypothesis is true. One or the other has to be true, 297 00:24:57,476 --> 00:25:03,078 and that means that the evidence and the hypothesis or the evidence and not the 298 00:25:03,078 --> 00:25:05,809 hypothesis is going to be equivalent to E. 299 00:25:05,809 --> 00:25:10,430 So this is equivalent to this, and because they are equivalent, we can 300 00:25:10,430 --> 00:25:15,689 substitute them within the formula for probability without affecting the truth 301 00:25:15,689 --> 00:25:18,723 values. So we just substitute this formula in 302 00:25:18,723 --> 00:25:23,276 here for the E up there. And we end up with the probability of the 303 00:25:23,276 --> 00:25:28,656 hypothesis, given the evidence, is equal to the probability of the evidence and 304 00:25:28,656 --> 00:25:34,174 the hypothesis divided by the probability of the evidence and the hypothesis, or 305 00:25:34,174 --> 00:25:39,209 the evidence and not the hypothesis. Now that's not supposed to make much 306 00:25:39,209 --> 00:25:45,685 sense, but it helps with the derivation. The next step is to apply rule three, 307 00:25:45,685 --> 00:25:51,028 because we have a disjunction. And notice that disjuncts are mutually 308 00:25:51,028 --> 00:25:54,977 exclusive. It cannot be true both that the evidence 309 00:25:54,977 --> 00:26:00,088 and the hypothesis is true, and also, that the evidence and not the 310 00:26:00,088 --> 00:26:04,425 hypothesis is true, because it can't be both h and not h. 311 00:26:04,425 --> 00:26:08,220 So, we can apply the simple version of rule three, 312 00:26:08,220 --> 00:26:14,655 and that means that the probability of E and H, or E and Not H, is equal to the 313 00:26:14,655 --> 00:26:19,522 probability of E and H plus the probability of E and Not H. 314 00:26:19,522 --> 00:26:25,627 We're just applying that rule three for dysjunction that we learned a few 315 00:26:25,627 --> 00:26:30,200 lectures ago. Now we apply rule 2G again, because we 316 00:26:30,200 --> 00:26:36,820 have the probability of a conjunction up in the top. 317 00:26:36,820 --> 00:26:42,476 And since these are not independent of each other, we hope not, if it's a 318 00:26:42,476 --> 00:26:47,897 hypothesis of the evidence for it, then we have to use the conditional 319 00:26:47,897 --> 00:26:51,422 probability. And using rule 2G, we find that the 320 00:26:51,422 --> 00:26:57,324 probability of the hypothesis, given the evidence, is equal to the probability of 321 00:26:57,324 --> 00:27:02,415 the hypothesis times the probability of the evidence given the hypothesis divided 322 00:27:02,415 --> 00:27:07,505 by the probability of the hypothesis times the probability of the evidence, 323 00:27:07,505 --> 00:27:12,375 given the hypothesis, plus the probability of the hypothesis being 324 00:27:12,375 --> 00:27:15,326 false. That is, the probability of not h times 325 00:27:15,326 --> 00:27:21,700 the probability of the evidence given, not H, or the hypothesis being false. 326 00:27:21,700 --> 00:27:24,811 And that's a mouthful, and its a long formula. 327 00:27:24,811 --> 00:27:30,481 But, that's the mathematical formula that Bayes proved in the 18th century, and 328 00:27:30,481 --> 00:27:36,221 that provides the mathematical bases for that whole system of boxes that we talked 329 00:27:36,221 --> 00:27:39,721 about before. But if you don't like the mathematical 330 00:27:39,721 --> 00:27:43,524 proof, if that's too confusing for you, then use the boxes. 331 00:27:43,524 --> 00:27:47,459 And if you don't like the boxes, use the mathematical proof. 332 00:27:47,459 --> 00:27:51,529 They're both going to work. Just pick the one that works for you. 333 00:27:51,529 --> 00:27:56,532 In fact, you don't have to pick either of them, because remember, this is an honors 334 00:27:56,532 --> 00:27:59,867 lecture. It's optional, and it won't be on the quiz. 335 00:27:59,867 --> 00:28:05,270 But if you do want to try this method and make sure you understand it, we'll have a 336 00:28:05,270 --> 00:28:09,140 bunch of exercises for you, where you can test your skills.