1
00:00:02,600 --> 00:00:09,282
Coins and dice provide a nice simple 
model of how to calculate probabilities, 

2
00:00:09,282 --> 00:00:14,208
but everyday life is a lot more 
complicated and it's not taking up with 

3
00:00:14,208 --> 00:00:17,561
gambling. 
At least I hope your life is not taking 

4
00:00:17,561 --> 00:00:21,392
up with gambling. 
So in order to make probabilities more 

5
00:00:21,392 --> 00:00:26,387
applicable to everyday life, we need to 
look at slightly more complicated 

6
00:00:26,387 --> 00:00:29,329
methods. 
Now, because these methods are more 

7
00:00:29,329 --> 00:00:33,366
complicated, this lecture is going to be 
an honors lecture. 

8
00:00:33,366 --> 00:00:37,129
It's optional. 
It will not be on the quiz, so don't get 

9
00:00:37,129 --> 00:00:40,757
worried about that. 
But it is still useful, and it's 

10
00:00:40,757 --> 00:00:44,467
fascinating, 
and it'll help you avoid some mistakes 

11
00:00:44,467 --> 00:00:48,177
that people make and that create a lot of 
problems. 

12
00:00:48,177 --> 00:00:52,541
So I hope you will stick with it and 
listen to this lecture. 

13
00:00:52,541 --> 00:00:58,215
And there will be exercises to help you 
figure out whether you understand the 

14
00:00:58,215 --> 00:01:01,561
material or not. 
But don't get too worried, because it's 

15
00:01:01,561 --> 00:01:07,337
not going to be on the quiz. 
The real problem that we'll be facing in 

16
00:01:07,337 --> 00:01:13,301
this lecture, is the problem of tests. 
We use tests all the time, we use tests 

17
00:01:13,301 --> 00:01:16,653
to figure out. 
Whether you have a certain medical 

18
00:01:16,653 --> 00:01:21,483
condition we use test to predict the 
weather or to predict people's future 

19
00:01:21,483 --> 00:01:24,380
behavior. 
We have certain indicators that how 

20
00:01:24,380 --> 00:01:28,207
they're going to act. 
Either commit a crime or not commit a 

21
00:01:28,207 --> 00:01:32,980
crime, but also whether they're going to 
pass, do well in school, or fail. 

22
00:01:32,980 --> 00:01:39,237
we always use these tests when we don't 
know for certain, but we want some kind 

23
00:01:39,237 --> 00:01:45,112
of evidence or some kind of indicator. 
The problem is, none of these tests are 

24
00:01:45,112 --> 00:01:48,622
perfect. 
They always contain errors of various 

25
00:01:48,622 --> 00:01:54,422
sorts, and what we're going to have to do 
is to see how to take those errors of 

26
00:01:54,422 --> 00:02:00,297
different sorts and build them together 
into a method, and then a formula for 

27
00:02:00,297 --> 00:02:06,325
calculating how reliable the method is 
for detecting the thing that we want to 

28
00:02:06,325 --> 00:02:09,716
detect. 
This problem is a lot like a problem we 

29
00:02:09,716 --> 00:02:15,031
faced earlier when we were talking about 
applying generalizations to particular 

30
00:02:15,031 --> 00:02:19,482
cases, because here we're going to be 
applying probabilities to particular 

31
00:02:19,482 --> 00:02:22,605
cases. 
So it'll seem familiar to you in certain 

32
00:02:22,605 --> 00:02:26,591
parts but you'll see that this case is a 
little trickier. 

33
00:02:26,591 --> 00:02:31,374
The best examples occur in medicine. 
So just imagine that you go to your 

34
00:02:31,374 --> 00:02:35,958
doctor for a regular checkup. 
You don't have any special symptoms, but 

35
00:02:35,958 --> 00:02:41,780
he decides to do a few screening tests. 
And unfortunately, and very worryingly, 

36
00:02:41,780 --> 00:02:48,580
it turns out that you test positive on 
one test for a particular form of cancer, 

37
00:02:48,580 --> 00:02:54,878
a certain kind of medical condition. 
Well, what that means is that you might 

38
00:02:54,878 --> 00:02:56,169
have cancer. 
Might? 

39
00:02:56,169 --> 00:02:59,255
Great. 
You want to know whether you do have 

40
00:02:59,255 --> 00:03:04,708
cancer, but of course, finding out for 
sure whether or not you have cancer is 

41
00:03:04,708 --> 00:03:10,233
going to take further tests, and those 
tests might be expensive. They might be 

42
00:03:10,233 --> 00:03:14,179
dangerous, they're going to be invasive 
in various ways. 

43
00:03:14,179 --> 00:03:19,345
So you really want to know what's the 
probability, given that you tested 

44
00:03:19,345 --> 00:03:23,680
positive on this one test, that you 
really have cancer. 

45
00:03:23,680 --> 00:03:29,847
Now clearly, that probability is going to 
depend on in a number of facts about this 

46
00:03:29,847 --> 00:03:33,290
type of cancer, about the type of test, 
and so on. 

47
00:03:33,290 --> 00:03:37,378
And I am not a doctor, I am not giving 
you medical advice. 

48
00:03:37,378 --> 00:03:43,044
If you test positive on a test, go talk 
to your doctor, don't trust me because I 

49
00:03:43,044 --> 00:03:47,921
am just making up numbers here. 
But let's do make a few number's and 

50
00:03:47,921 --> 00:03:53,263
figure out what the likelihood is of 
having cancer, given that you tested 

51
00:03:53,263 --> 00:03:56,715
positive. 
So, let's imagine that the base rate of 

52
00:03:56,715 --> 00:04:00,682
this particular type of cancer in the 
population is 3%.. 

53
00:04:00,682 --> 00:04:06,117
That is three out of 1,000 or 003.. 
And to say that's the base rate or it's 

54
00:04:06,117 --> 00:04:11,552
sometimes called the prevalence of the 
condition in the population, that's 

55
00:04:11,552 --> 00:04:17,355
simply to say that out of 1,000 people 
chosen randomly in the population, you'd 

56
00:04:17,355 --> 00:04:21,240
have about three that have this 
condition. 

57
00:04:21,240 --> 00:04:25,300
It's just a percentage of the general 
population. 

58
00:04:25,300 --> 00:04:28,916
So, that's the condition, what about the 
test? 

59
00:04:28,916 --> 00:04:34,505
Well the first thing we want to know is 
the sensitivity of the test. 

60
00:04:34,505 --> 00:04:40,999
The sensitivity of the test, we're going 
to assume is 99,. and what that means is 

61
00:04:40,999 --> 00:04:45,036
that out of 100 people, who have this 
condition, 

62
00:04:45,036 --> 00:04:51,080
99 of them will test positive. 
So, this test is pretty good at figuring 

63
00:04:51,080 --> 00:04:55,310
out, from among the people who have the 
condition, 

64
00:04:55,310 --> 00:04:59,627
which ones do. 
99 of those 100 people who have the 

65
00:04:59,627 --> 00:05:04,980
condition will test positive. 
The other feature is specificity. 

66
00:05:04,980 --> 00:05:11,974
And what that means is the percentage of 
the people who don't have the condition 

67
00:05:11,974 --> 00:05:16,051
who will test negative. 
The point here is that you're not 

68
00:05:16,051 --> 00:05:20,233
going to get a positive result for people 
who don't have the condition. 

69
00:05:20,233 --> 00:05:22,939
Right? 
Because you want it to be specific to 

70
00:05:22,939 --> 00:05:27,798
this particular condition, and not get a 
bunch of positives for people who have 

71
00:05:27,798 --> 00:05:31,303
other types of conditions, or no medical 
condition at all. 

72
00:05:31,303 --> 00:05:35,670
So the specificity, we're going to 
assume, in this particular case we're 

73
00:05:35,670 --> 00:05:42,039
talking about, is also 99 percent. 
Now, what we want to know is the 

74
00:05:42,039 --> 00:05:47,010
probability, that you have the cancer, 
the condition, 

75
00:05:47,010 --> 00:05:52,720
given that you tested positive on the 
test. But notice that the sensitivity 

76
00:05:52,720 --> 00:05:58,658
tells you the probability that you will 
test positive, given that you have the 

77
00:05:58,658 --> 00:06:02,053
condition. 
We want to know the opposite of that, the 

78
00:06:02,053 --> 00:06:06,614
probability that you have the condition, 
given that you tested positive. 

79
00:06:06,614 --> 00:06:10,790
And that's what we have to do a little 
calculation to figure out. 

80
00:06:10,790 --> 00:06:15,865
But before we do that calculation, I want 
you to think about these figures that 

81
00:06:15,865 --> 00:06:20,747
I've given you, the prevalence in the 
population, the sensitivity of the test, 

82
00:06:20,747 --> 00:06:23,960
the specificity of the test, and just 
make a guess. 

83
00:06:23,960 --> 00:06:30,468
Just start out by writing down on a piece 
of paper what you think the probability 

84
00:06:30,468 --> 00:06:36,580
is that you would have the cancer, given 
that you tested positive on the test. 

85
00:06:36,580 --> 00:06:40,670
Take a minute and think about it, and 
write it down. 

86
00:06:40,670 --> 00:06:46,367
But we don't want to just guess about 
medical conditions, about probabilities 

87
00:06:46,367 --> 00:06:49,557
that really matter as much as this one 
do. 

88
00:06:49,557 --> 00:06:54,875
Instead, we want to calculate what the 
probability really is. So let's go 

89
00:06:54,875 --> 00:07:00,800
through it carefully, and show you how to 
use what I'll call the Box Method, in 

90
00:07:00,800 --> 00:07:06,725
order to calculate the real likelihood 
that you have the condition, given that 

91
00:07:06,725 --> 00:07:12,415
you got a positive test result. 
What we need to do is to divide the 

92
00:07:12,415 --> 00:07:18,801
population into four different groups. 
The group that has the condition and 

93
00:07:18,801 --> 00:07:22,393
tested positive, 
the group that has the condition and 

94
00:07:22,393 --> 00:07:27,680
tested negative, the group that doesn't 
have the condition and tested positive 

95
00:07:27,680 --> 00:07:32,763
and the group that doesn't have the 
condition and tested negative. And this 

96
00:07:32,763 --> 00:07:37,914
chart will show you a nice simple way of 
organizing all of that information. 

97
00:07:37,914 --> 00:07:43,681
Because this row, the top row, 
tells you all the people who tested 

98
00:07:43,681 --> 00:07:47,603
positive. 
The bottom row tells you the people who 

99
00:07:47,603 --> 00:07:51,684
tested negative. 
Then, the left column gives you the 

100
00:07:51,684 --> 00:07:57,607
people who do have the medical condition, 
in this case some kind of cancer, 

101
00:07:57,607 --> 00:08:03,450
and the right column tells you the people 
who do not have that condition. 

102
00:08:03,450 --> 00:08:07,418
Now, what we need to do is to start 
filling it out with numbers. 

103
00:08:07,418 --> 00:08:11,008
Now, the first thing we need to specify 
is the population. 

104
00:08:11,008 --> 00:08:15,921
In this case, we want to start with a big 
enough population that we're not going to 

105
00:08:15,921 --> 00:08:18,629
have a lot of fractions in the other 
boxes. 

106
00:08:18,629 --> 00:08:21,967
So let's just imagine that the population 
is 100,000. 

107
00:08:21,967 --> 00:08:26,124
Make it one million or ten million. 
It doesn't matter, because we're going to 

108
00:08:26,124 --> 00:08:30,360
be interested in the ratios of the 
different groups. 

109
00:08:30,360 --> 00:08:36,123
We can use that 100,000 to fill out the 
other boxes if we know the prevalence or 

110
00:08:36,123 --> 00:08:39,707
the base rate. 
Because the base rate tells you what 

111
00:08:39,707 --> 00:08:45,119
percentage of that 100,000 actually do 
have the condition and don't have the 

112
00:08:45,119 --> 00:08:48,563
condition. 
We imagined, remember we're just making 

113
00:08:48,563 --> 00:08:54,116
up numbers here, but we imagined that the 
prevalence of this condition is 3%. and 

114
00:08:54,116 --> 00:08:59,317
that means out of 100,000 people, there 
will be 300 who do have the medical 

115
00:08:59,317 --> 00:09:03,083
condition. 
Well if there are 300 who have it and 

116
00:09:03,083 --> 00:09:07,531
there are 100,000 total, we can figure 
out how many don't have the medical 

117
00:09:07,531 --> 00:09:11,438
condition by just subtracting, 
which means, 99,700 do not have the 

118
00:09:11,438 --> 00:09:12,878
medical condition. 
Okay? 

119
00:09:12,878 --> 00:09:16,753
Now, we've divided the population into 
our two columns. 

120
00:09:16,753 --> 00:09:21,631
The ones that do and the ones that don't 
have the medical condition. 

121
00:09:21,631 --> 00:09:26,223
The next step is to figure out how many 
are going to test positive. 

122
00:09:26,223 --> 00:09:30,887
And how many are going to test negative 
out of each of these groups. 

123
00:09:30,887 --> 00:09:36,555
For that, we first need the sensitivity. 
The sensitivity tells us the percentage 

124
00:09:36,555 --> 00:09:40,860
of the cases that have the condition who 
will test positive. 

125
00:09:40,860 --> 00:09:46,229
So the people who have the condition are 
the 300. 

126
00:09:46,229 --> 00:09:52,365
The ones who test positive are going to 
go up in this area. 

127
00:09:52,365 --> 00:10:01,021
And we know from the sensitivity being 
0.99 or 99% that the number in that area 

128
00:10:01,021 --> 00:10:07,389
should be 99% of 300, or 297. 
And of course, if that's the number that 

129
00:10:07,389 --> 00:10:12,699
tests positive then the remainder are 
going to test negative, and that means 

130
00:10:12,699 --> 00:10:17,209
that we'll have three, 
which shouldn't surprise you because if 

131
00:10:17,209 --> 00:10:22,665
99% of the cases that have it test 
positive, then 1% will test negative and 

132
00:10:22,665 --> 00:10:24,571
1% of 300 is 3. 
Good. 

133
00:10:24,571 --> 00:10:30,899
So we got the first column done. 
Now, the next question is going to be the 

134
00:10:30,899 --> 00:10:35,690
specificity. 
We can use the specificity to figure out 

135
00:10:35,690 --> 00:10:40,864
what goes in that next column. 
If the specificity is 99, 

136
00:10:40,864 --> 00:10:50,537
and we know that 99,700 people do not 
have the condition out of our sample of 

137
00:10:50,537 --> 00:10:54,995
100,000. 
Well, that means that 99% of 99,700 are 

138
00:10:54,995 --> 00:11:00,831
going to test negative. 
Becuase the specificities, the percentage 

139
00:11:00,831 --> 00:11:05,391
of cases without the condition that test 
negative. 

140
00:11:05,391 --> 00:11:11,866
And that means that we'll have 98,703 
among people who do not have the 

141
00:11:11,866 --> 00:11:19,818
condition who test negative. 
How many are you going to test positive? 

142
00:11:19,818 --> 00:11:25,765
The rest of them. 
So 99,700 minus 98,703 is going to be 

143
00:11:25,765 --> 00:11:30,048
997. 
And of course that shouldn't be 

144
00:11:30,048 --> 00:11:36,160
surprising again, because 1% of 99,700 is 
997. 

145
00:11:36,160 --> 00:11:40,999
We only got two boxes left to fill out. 
How do you fill out those? 

146
00:11:40,999 --> 00:11:47,104
Well this box in the upper right is the 
total number of people in this population 

147
00:11:47,104 --> 00:11:52,390
of 100,000 who test positive. 
And so we can get that by adding the ones 

148
00:11:52,390 --> 00:11:58,197
that do have the condition and test 
positive and the ones that don't have the 

149
00:11:58,197 --> 00:12:01,920
condition and test positive. Just add 
them together, 

150
00:12:01,920 --> 00:12:09,823
and you get 1294. 
And you do the same on the next row 

151
00:12:09,823 --> 00:12:18,064
because that blank is the area that has 
all the people who test negative, 

152
00:12:18,064 --> 00:12:23,764
and three people who have the condition 
test negative. 

153
00:12:23,764 --> 00:12:29,887
98,703 people who do not have the 
condition test negative. 

154
00:12:29,887 --> 00:12:37,700
So the total is going to be 98,706. 
And we can check to make sure that we got 

155
00:12:37,700 --> 00:12:44,680
it right by just adding them together, 
1,294 plus 98,706 is equal 100,000. 

156
00:12:44,680 --> 00:12:49,343
[SOUND] We got it right, okay. 
So, now we divided the population into 

157
00:12:49,343 --> 00:12:55,405
those people who have the condition, 
those people who don't have the condition 

158
00:12:55,405 --> 00:13:01,622
and we know how many of each of those 
groups test positive and how many of each 

159
00:13:01,622 --> 00:13:06,441
of those groups test negative. 
The real question is what's the 

160
00:13:06,441 --> 00:13:12,348
probability that I have cancer or the 
medical condition given that I tested 

161
00:13:12,348 --> 00:13:15,301
positive. 
How do we figure that out? 

162
00:13:15,301 --> 00:13:20,183
Well, 
the total number of positive tests was 

163
00:13:20,183 --> 00:13:25,132
1294, 
and the people who tested positive who 

164
00:13:25,132 --> 00:13:33,877
really had the condition was 297, so it 
looks like the probability of actually 

165
00:13:33,877 --> 00:13:42,062
having the condition given that you 
tested positive is 297 out of 1294 or 

166
00:13:42,062 --> 00:13:46,547
23.. 
That's 23%, less that one out of four. 

167
00:13:46,547 --> 00:13:52,727
Is that what you guessed? 
Most people, including most doctors, when 

168
00:13:52,727 --> 00:13:59,011
they hear that the test is 99% sensitive 
and 99% specific will guess a lot higher 

169
00:13:59,011 --> 00:14:03,489
than one and four. 
Oh my gosh, I'm a doctor and I never 

170
00:14:03,489 --> 00:14:08,595
would have thought that. 
Now, don't worry, she's not a physician, 

171
00:14:08,595 --> 00:14:13,501
she's a meta-physician. 
[INAUDIBLE] But in this case, the 

172
00:14:13,501 --> 00:14:18,936
probability really is just one in four 
that you have that medical condition. 

173
00:14:18,936 --> 00:14:23,584
Now, how did that happen? 
The reason was that the prevalence or the 

174
00:14:23,584 --> 00:14:28,758
base rate was so low, that even a small 
rate of false positives, given the 

175
00:14:28,758 --> 00:14:34,108
massive numbers of people who don't have 
the condition, will mean that there are 

176
00:14:34,108 --> 00:14:39,458
more false positives, three times as many 
as there are true positives and that's 

177
00:14:39,458 --> 00:14:44,808
why the probability is just one in four 
actually a little less than one in four 

178
00:14:44,808 --> 00:14:49,222
that you have the medical condition even 
when you tested positive. 

179
00:14:49,222 --> 00:14:53,770
I want to add a quick caveat here in 
order to avoid misinterpretation. 

180
00:14:53,770 --> 00:14:59,063
Because the point here is that if you 
have a screening test for a condition 

181
00:14:59,063 --> 00:15:05,094
with a very low base rate or prevalence. 
And you don't have any symptoms that put 

182
00:15:05,094 --> 00:15:11,344
you in a special category then you need 
to get another test before you jump to 

183
00:15:11,344 --> 00:15:15,840
any conclusions about having the medical 
condition. 

184
00:15:15,840 --> 00:15:21,211
Because, if you had that other test, the 
fact that you test positive at first test 

185
00:15:21,211 --> 00:15:26,384
puts you in smaller class with a much 
higher base rank for prevalence and now 

186
00:15:26,384 --> 00:15:30,161
the probability is going to go up. 
Most doctors know that. 

187
00:15:30,161 --> 00:15:35,591
And that's why after the first test they 
don't jump to conclusions and they order 

188
00:15:35,591 --> 00:15:38,901
another test. 
But many patients don't realise that, 

189
00:15:38,901 --> 00:15:44,000
and they get extremely worried after a 
single test, even when they don't have 

190
00:15:44,000 --> 00:15:47,509
any symptoms. 
So that's the mistake that we're trying 

191
00:15:47,509 --> 00:15:50,092
to avoid here. 
Now that's surprising, 

192
00:15:50,092 --> 00:15:54,440
but it actually applies to many different 
areas of life. 

193
00:15:54,440 --> 00:16:00,032
It applies for example to medical tests 
with all kinds of other diseases, not 

194
00:16:00,032 --> 00:16:04,944
just cancer, or colon cancer, 
but pretty much every disease where the 

195
00:16:04,944 --> 00:16:09,796
prevalence is extremely low. 
It applies also to drug tests. 

196
00:16:09,796 --> 00:16:15,387
If somebody gets a positive drug test, 
does that mean that they really were 

197
00:16:15,387 --> 00:16:19,338
using drugs? 
Well, if it's a population where the base 

198
00:16:19,338 --> 00:16:23,960
rate or prevalence of drug use is quite 
low, then it might not. 

199
00:16:23,960 --> 00:16:28,504
Of course, if you assume that the 
prevalence or base rate is quite high, 

200
00:16:28,504 --> 00:16:33,316
then you're going to believe that drug 
test, but you need to know the facts 

201
00:16:33,316 --> 00:16:38,062
about what the prevalence, or base rate 
really is in order to calculate 

202
00:16:38,062 --> 00:16:42,980
accurately the probability that this 
person really was using drugs. 

203
00:16:42,980 --> 00:16:49,001
Same applies to evidence in legal trials. 
Take eye witnesses for example. 

204
00:16:49,001 --> 00:16:53,769
It's very tricky. 
Someone's trying to use their eyes as a 

205
00:16:53,769 --> 00:16:58,118
test for what they see. 
They might identify a friend. 

206
00:16:58,118 --> 00:17:04,809
Or they might just say that car that did 
the hit and run accident was a Porsche. 

207
00:17:04,809 --> 00:17:09,180
Well, how good are they at identifying 
Porsches? 

208
00:17:09,180 --> 00:17:13,116
If they get it right most of the time, 
but not always, 

209
00:17:13,116 --> 00:17:18,910
and sometimes they don't get it right 
when it is a Porsche, then we've got the 

210
00:17:18,910 --> 00:17:22,623
sensitivity and specificity of what they 
identify, 

211
00:17:22,623 --> 00:17:28,343
and we can use that to calculate how 
likely it is that their evidence in the 

212
00:17:28,343 --> 00:17:33,546
trial really is reliable or not. 
Another example is the prediction of 

213
00:17:33,546 --> 00:17:37,790
future behavior. 
We might have some kind of marker. 

214
00:17:37,790 --> 00:17:43,457
That a certain group of people with that 
marker have a certain likelihood of 

215
00:17:43,457 --> 00:17:49,345
committing crimes, but if crimes are very 
rare in that community and every other, 

216
00:17:49,345 --> 00:17:54,939
then a test which has a pretty good 
sensitivity and specificity still might 

217
00:17:54,939 --> 00:18:00,238
not be good enough when we're talking 
about something like crime that's 

218
00:18:00,238 --> 00:18:05,391
actually very rare, and has a very low 
prevalence, or base rate in most 

219
00:18:05,391 --> 00:18:09,144
communities. 
And the same applies to failing out of 

220
00:18:09,144 --> 00:18:13,106
school. 
Our SAT scores or GRE scores are going to 

221
00:18:13,106 --> 00:18:17,104
be good predictors of, of who's going to 
fail out of school. 

222
00:18:17,104 --> 00:18:22,118
Well, if very few people fail out of 
school so that the prevalence of base 

223
00:18:22,118 --> 00:18:27,471
rate is very low, then even if they're 
pretty sensitive and specific, they might 

224
00:18:27,471 --> 00:18:31,808
not be good predictors. 
So this same type of problem arises in a 

225
00:18:31,808 --> 00:18:36,687
lot of different areas, and I'm not 
going to go through more examples right 

226
00:18:36,687 --> 00:18:41,837
now, but we'll have plenty of examples in 
these exercises at the end of this 

227
00:18:41,837 --> 00:18:45,727
chapter. 
I want to end, though, by saying a few 

228
00:18:45,727 --> 00:18:49,573
things that are a bit more technical 
about this method. 

229
00:18:49,573 --> 00:18:55,028
First, there's a lot of terminology to 
learn, because when you read about using 

230
00:18:55,028 --> 00:19:00,203
this method in other areas for other 
types of topics, then you'll run into 

231
00:19:00,203 --> 00:19:03,760
these terms, and it's a good idea to know 
them. 

232
00:19:03,760 --> 00:19:11,917
So first. 
The cases where the person does have the 

233
00:19:11,917 --> 00:19:17,505
condition and also tests positive are 
called hits or true positives, different 

234
00:19:17,505 --> 00:19:24,040
people use different terms. 
The cases where. 

235
00:19:24,040 --> 00:19:29,796
The person tests positive, but they don't 
have the condition, are called false 

236
00:19:29,796 --> 00:19:38,050
positives or false alarms. 
The cases where a person really does have 

237
00:19:38,050 --> 00:19:45,100
the condition but tests negative are 
called misses or false negatives. 

238
00:19:47,180 --> 00:19:52,021
And the cases where the person does not 
have the condition and the test comes out 

239
00:19:52,021 --> 00:19:56,153
negative are called true negatives. 
because they're negative and it's true 

240
00:19:56,153 --> 00:20:02,151
that they don't have the condition. 
If we put together the false negatives 

241
00:20:02,151 --> 00:20:06,961
and the true negatives. 
We get the total set of negatives. 

242
00:20:06,961 --> 00:20:13,965
And if we put together the true positives 
and the false positives, we get the total 

243
00:20:13,965 --> 00:20:18,190
set of positives. 
And of course we have the general 

244
00:20:18,190 --> 00:20:21,505
population. 
And within that population a percentage 

245
00:20:21,505 --> 00:20:26,820
that have the condition and a percentage 
that don't have the condition. 

246
00:20:26,820 --> 00:20:32,504
Now, what's the base rate? 
The base rate, in this population, is 

247
00:20:32,504 --> 00:20:38,189
simply the set that had the condition / 
the total population. 

248
00:20:38,189 --> 00:20:44,620
Which is box seven / box nine. 
If we use e for the evidence, and h for 

249
00:20:44,620 --> 00:20:50,584
the hypothesis being true. 
That the condition really does exists. 

250
00:20:50,584 --> 00:20:58,780
Then that's the probability of h. 
And the sensitivity is going to be. 

251
00:20:58,780 --> 00:21:04,457
The total number of true positives 
divided by the total number of people 

252
00:21:04,457 --> 00:21:09,202
with the condition. 
Because it's the percentage of people who 

253
00:21:09,202 --> 00:21:16,318
have the condition and test positive. 
Okay, so that's the probably of e given 

254
00:21:16,318 --> 00:21:23,615
h, and it's box one divided by box seven. 
The specificity in contrast is the ratio 

255
00:21:23,615 --> 00:21:30,541
of it being a true negative to the total 
number of people who do not have the 

256
00:21:30,541 --> 00:21:34,360
condition. 
That is, the probability of Not E, 

257
00:21:34,360 --> 00:21:40,505
that is, not having the evidence of a 
positive test result, given not H, 

258
00:21:40,505 --> 00:21:47,352
given that you're in this second column, 
where the hypothesis is false, because 

259
00:21:47,352 --> 00:21:53,600
you don't have the condition. 
So that's box five divided by box eight. 

260
00:21:53,600 --> 00:21:58,567
That's the specificity. 
So we can define all of these in terms of 

261
00:21:58,567 --> 00:22:02,465
each other. 
The hits divided by the total with that 

262
00:22:02,465 --> 00:22:08,808
condition is going to be the sensitivity, 
and you can use this terminology to guide 

263
00:22:08,808 --> 00:22:13,775
your way through this box. 
And the big question is, again, going to 

264
00:22:13,775 --> 00:22:18,666
be, what's the solution? 
What's the probability of the hypothesis 

265
00:22:18,666 --> 00:22:23,166
having the condition, 
given the evidence, that is a positive 

266
00:22:23,166 --> 00:22:27,279
test result. 
That's going to be box one divided by box 

267
00:22:27,279 --> 00:22:30,706
three, 
and as we saw in the case that we just 

268
00:22:30,706 --> 00:22:35,732
went through, that gives you the 
probability of having the medical 

269
00:22:35,732 --> 00:22:39,921
condition, or colon cancer, given a 
positive test result. 

270
00:22:39,921 --> 00:22:45,785
That's called the posterior probability 
or, in symbols, the probability of the 

271
00:22:45,785 --> 00:22:50,834
hypothesis, given the evidence. 
So I hope this terminology helps you 

272
00:22:50,834 --> 00:22:56,364
understand some of the discussions of 
this I fyou go on and read about it. 

273
00:22:56,364 --> 00:23:02,267
This procedure that we've been discussing 
is actually just an application of a 

274
00:23:02,267 --> 00:23:08,319
famous theorem called Bayes' Theorem, 
after Thomas Bayes, an eighteenth century 

275
00:23:08,319 --> 00:23:13,998
English clergymen, who was also a 
mathematician, and proved this extremely 

276
00:23:13,998 --> 00:23:19,301
theorem in probability theory. 
Now, some of you out there will use the 

277
00:23:19,301 --> 00:23:24,638
boxes and it'll make sense to you, but 
some Corsairians, I assume, are 

278
00:23:24,638 --> 00:23:29,348
mathematicians and they want to see the 
mathematics behind it. 

279
00:23:29,348 --> 00:23:34,921
So now, I want to show you how the 
derived base theorem from the rules of 

280
00:23:34,921 --> 00:23:38,688
probability that we learned in earlier 
lectures. 

281
00:23:38,688 --> 00:23:42,534
So for all you Math nerds out there, here 
it goes. 

282
00:23:42,534 --> 00:23:47,672
You start with rule 2G, 
apply it to the probability that the 

283
00:23:47,672 --> 00:23:50,977
evidence and the hypothesis are both 
true, 

284
00:23:50,977 --> 00:23:57,037
and by the rule, that probability is 
equal to the probability of the evidence 

285
00:23:57,037 --> 00:24:01,681
times the probability of the hypothesis, 
given the evidence. 

286
00:24:01,681 --> 00:24:06,875
You have to have that conditional 
probability, because they're not 

287
00:24:06,875 --> 00:24:13,162
independent. 
Then you simply divide both sides of that 

288
00:24:13,162 --> 00:24:17,540
by the probability of the evidence, a 
little simple algebra. 

289
00:24:17,540 --> 00:24:23,379
And you end up with the probability of 
the hypothesis, given the evidence, is 

290
00:24:23,379 --> 00:24:29,140
equal to the probability of the evidence 
and the hypothesis divided by the 

291
00:24:29,140 --> 00:24:34,825
probability of the evidence. 
Now we can do a little trick, this was 

292
00:24:34,825 --> 00:24:38,205
ingenious. 
Substitute for E something that's 

293
00:24:38,205 --> 00:24:43,401
logically equivalent to E, namely the 
evidence in the hypothesis or the 

294
00:24:43,401 --> 00:24:48,162
evidence in not the hypothesis. 
Now if you think about it, you'll see 

295
00:24:48,162 --> 00:24:53,905
that those are equivalent, because either 
the hypothesis has to be true or not the 

296
00:24:53,905 --> 00:24:57,476
hypothesis is true. 
One or the other has to be true, 

297
00:24:57,476 --> 00:25:03,078
and that means that the evidence and the 
hypothesis or the evidence and not the 

298
00:25:03,078 --> 00:25:05,809
hypothesis is going to be equivalent to 
E. 

299
00:25:05,809 --> 00:25:10,430
So this is equivalent to this, and 
because they are equivalent, we can 

300
00:25:10,430 --> 00:25:15,689
substitute them within the formula for 
probability without affecting the truth 

301
00:25:15,689 --> 00:25:18,723
values. 
So we just substitute this formula in 

302
00:25:18,723 --> 00:25:23,276
here for the E up there. 
And we end up with the probability of the 

303
00:25:23,276 --> 00:25:28,656
hypothesis, given the evidence, is equal 
to the probability of the evidence and 

304
00:25:28,656 --> 00:25:34,174
the hypothesis divided by the probability 
of the evidence and the hypothesis, or 

305
00:25:34,174 --> 00:25:39,209
the evidence and not the hypothesis. 
Now that's not supposed to make much 

306
00:25:39,209 --> 00:25:45,685
sense, but it helps with the derivation. 
The next step is to apply rule three, 

307
00:25:45,685 --> 00:25:51,028
because we have a disjunction. 
And notice that disjuncts are mutually 

308
00:25:51,028 --> 00:25:54,977
exclusive. 
It cannot be true both that the evidence 

309
00:25:54,977 --> 00:26:00,088
and the hypothesis is true, 
and also, that the evidence and not the 

310
00:26:00,088 --> 00:26:04,425
hypothesis is true, 
because it can't be both h and not h. 

311
00:26:04,425 --> 00:26:08,220
So, we can apply the simple version of 
rule three, 

312
00:26:08,220 --> 00:26:14,655
and that means that the probability of E 
and H, or E and Not H, is equal to the 

313
00:26:14,655 --> 00:26:19,522
probability of E and H plus the 
probability of E and Not H. 

314
00:26:19,522 --> 00:26:25,627
We're just applying that rule three for 
dysjunction that we learned a few 

315
00:26:25,627 --> 00:26:30,200
lectures ago. 
Now we apply rule 2G again, because we 

316
00:26:30,200 --> 00:26:36,820
have the probability of a conjunction up 
in the top. 

317
00:26:36,820 --> 00:26:42,476
And since these are not independent of 
each other, we hope not, if it's a 

318
00:26:42,476 --> 00:26:47,897
hypothesis of the evidence for it, then 
we have to use the conditional 

319
00:26:47,897 --> 00:26:51,422
probability. 
And using rule 2G, we find that the 

320
00:26:51,422 --> 00:26:57,324
probability of the hypothesis, given the 
evidence, is equal to the probability of 

321
00:26:57,324 --> 00:27:02,415
the hypothesis times the probability of 
the evidence given the hypothesis divided 

322
00:27:02,415 --> 00:27:07,505
by the probability of the hypothesis 
times the probability of the evidence, 

323
00:27:07,505 --> 00:27:12,375
given the hypothesis, plus the 
probability of the hypothesis being 

324
00:27:12,375 --> 00:27:15,326
false. 
That is, the probability of not h times 

325
00:27:15,326 --> 00:27:21,700
the probability of the evidence given, 
not H, or the hypothesis being false. 

326
00:27:21,700 --> 00:27:24,811
And that's a mouthful, 
and its a long formula. 

327
00:27:24,811 --> 00:27:30,481
But, that's the mathematical formula that 
Bayes proved in the 18th century, and 

328
00:27:30,481 --> 00:27:36,221
that provides the mathematical bases for 
that whole system of boxes that we talked 

329
00:27:36,221 --> 00:27:39,721
about before. 
But if you don't like the mathematical 

330
00:27:39,721 --> 00:27:43,524
proof, if that's too confusing for you, 
then use the boxes. 

331
00:27:43,524 --> 00:27:47,459
And if you don't like the boxes, use the 
mathematical proof. 

332
00:27:47,459 --> 00:27:51,529
They're both going to work. 
Just pick the one that works for you. 

333
00:27:51,529 --> 00:27:56,532
In fact, you don't have to pick either of 
them, because remember, this is an honors 

334
00:27:56,532 --> 00:27:59,867
lecture. It's optional, and it won't be 
on the quiz. 

335
00:27:59,867 --> 00:28:05,270
But if you do want to try this method and 
make sure you understand it, we'll have a 

336
00:28:05,270 --> 00:28:09,140
bunch of exercises for you, where you can 
test your skills.