1
00:00:00,008 --> 00:00:03,400
The term human-level
performance is sometimes used

2
00:00:03,400 --> 00:00:05,820
casually in research articles.

3
00:00:05,820 --> 00:00:09,070
But let me show you how we can
define it a bit more precisely.

4
00:00:09,070 --> 00:00:13,350
And in particular, use the definition
of the phrase, human-level performance,

5
00:00:13,350 --> 00:00:17,430
that is most useful for helping you drive
progress in your machine learning project.

6
00:00:19,370 --> 00:00:23,455
So remember from our last video that
one of the uses of this phrase,

7
00:00:23,455 --> 00:00:28,820
human-level error, is that it gives
us a way of estimating Bayes error.

8
00:00:28,820 --> 00:00:31,914
What is the best possible
error any function could,

9
00:00:31,914 --> 00:00:35,380
either now or in the future,
ever, ever achieve?

10
00:00:35,380 --> 00:00:40,190
So bearing that in mind, let's look at
a medical image classification example.

11
00:00:40,190 --> 00:00:43,900
Let's say that you want to look
at a radiology image like this,

12
00:00:43,900 --> 00:00:48,010
and make a diagnosis
classification decision.

13
00:00:49,155 --> 00:00:52,541
And suppose that a typical human,
untrained human,

14
00:00:52,541 --> 00:00:55,350
achieves 3% error on this task.

15
00:00:55,350 --> 00:01:02,460
A typical doctor, maybe a typical
radiologist doctor, achieves 1% error.

16
00:01:02,460 --> 00:01:06,330
An experienced doctor does even better,
0.7% error.

17
00:01:06,330 --> 00:01:11,370
And a team of experienced doctors, that is
if you get a team of experienced doctors

18
00:01:11,370 --> 00:01:14,790
and have them all look at the image and
discuss and debate the image,

19
00:01:14,790 --> 00:01:20,085
together their consensus
opinion achieves 0.5% error.

20
00:01:20,085 --> 00:01:25,247
So the question I want to pose to you is,
how should you define human-level error?

21
00:01:25,247 --> 00:01:31,080
Is human-level error 3%, 1%, 0.7% or 0.5%?

22
00:01:31,080 --> 00:01:34,910
Feel free to pause this video
to think about it if you wish.

23
00:01:34,910 --> 00:01:39,690
And to answer that question, I would urge
you to bear in mind that one of the most

24
00:01:39,690 --> 00:01:45,990
useful ways to think of human error is as
a proxy or an estimate for Bayes error.

25
00:01:45,990 --> 00:01:50,360
So please feel free to pause this video to
think about it for a while if you wish.

26
00:01:50,360 --> 00:01:55,300
But here's how I would
define human-level error.

27
00:01:55,300 --> 00:02:00,014
Which is if you want a proxy or
an estimate for Bayes error,

28
00:02:00,014 --> 00:02:04,924
then given that a team of
experienced doctors discussing and

29
00:02:04,924 --> 00:02:08,102
debating can achieve 0.5% error,

30
00:02:08,102 --> 00:02:12,649
we know that Bayes error
is less than equal to 0.5%.

31
00:02:12,649 --> 00:02:17,801
So because some system, team of these
doctors can achieve 0.5% error,

32
00:02:17,801 --> 00:02:23,366
so by definition, this directly,
optimal error has got to be 0.5% or lower.

33
00:02:23,366 --> 00:02:26,680
We don't know how much better it is,
maybe there's a even larger team

34
00:02:26,680 --> 00:02:29,488
of even more experienced doctors
who could do even better,

35
00:02:29,488 --> 00:02:32,087
so maybe it's even a little
bit better than 0.5%.

36
00:02:32,087 --> 00:02:36,325
But we know the optimal error
cannot be higher than 0.5%.

37
00:02:36,325 --> 00:02:43,265
So what I would do in this setting is use
0.5% as our estimate for Bayes error.

38
00:02:43,265 --> 00:02:48,548
So I would define human-level
performance as 0.5%.

39
00:02:48,548 --> 00:02:52,928
At least if you're hoping to use
human-level error in the analysis of bias

40
00:02:52,928 --> 00:02:55,300
and variance as we saw in the last video.

41
00:02:56,330 --> 00:02:59,370
Now, for the purpose of publishing
a research paper or for

42
00:02:59,370 --> 00:03:03,535
the purpose of deploying a system,
maybe there's a different definition of

43
00:03:03,535 --> 00:03:06,725
human-level error that
you can use which is so

44
00:03:06,725 --> 00:03:10,675
long as you surpass the performance
of a typical doctor.

45
00:03:10,675 --> 00:03:13,865
That seems like maybe a very
useful result if accomplished, and

46
00:03:13,865 --> 00:03:18,002
maybe surpassing a single radiologist,
a single doctor's performance

47
00:03:18,002 --> 00:03:21,312
might mean the system is good
enough to deploy in some context.

48
00:03:22,342 --> 00:03:26,412
So maybe the takeaway from this is to
be clear about what your purpose is

49
00:03:26,412 --> 00:03:28,902
in defining the term human-level error.

50
00:03:28,902 --> 00:03:34,006
And if it is to show that you can surpass
a single human and therefore argue for

51
00:03:34,006 --> 00:03:39,126
deploying your system in some context,
maybe this is the appropriate definition.

52
00:03:39,126 --> 00:03:41,642
But if your goal is the proxy for
Bayes error,

53
00:03:41,642 --> 00:03:44,976
then this is the appropriate definition.

54
00:03:44,976 --> 00:03:48,662
To see why this matters,
let's look at an error analysis example.

55
00:03:51,642 --> 00:03:55,830
Let's say, for
a medical imaging diagnosis example,

56
00:03:55,830 --> 00:04:00,320
that your training error is 5% and
your dev error is 6%.

57
00:04:00,320 --> 00:04:05,260
And in the example from the previous
slide, our human-level performance,

58
00:04:05,260 --> 00:04:08,688
and I'm going to think of this
as proxy for Bayes error.

59
00:04:12,460 --> 00:04:17,577
Depending on whether you defined it
as a typical doctor's performance or

60
00:04:17,577 --> 00:04:22,362
experienced doctor or team of doctors,
you would have either 1% or

61
00:04:22,362 --> 00:04:24,599
0.7% or 0.5% for this.

62
00:04:24,599 --> 00:04:28,476
And remember also our definitions
from the previous video,

63
00:04:28,476 --> 00:04:32,504
that this gap between Bayes error or
estimate of Bayes error and

64
00:04:32,504 --> 00:04:36,625
training error is calling that
a measure of the avoidable bias.

65
00:04:36,625 --> 00:04:40,633
And this as a measure or an estimate of
how much of a variance problem you have in

66
00:04:40,633 --> 00:04:42,067
your learning algorithm.

67
00:04:44,417 --> 00:04:49,008
So in this first example,
whichever of these choices you make,

68
00:04:49,008 --> 00:04:53,271
the measure of avoidable bias
will be something like 4%.

69
00:04:53,271 --> 00:04:56,784
It will be somewhere between I guess, 4%,

70
00:04:56,784 --> 00:05:02,526
if you take that to 4.5%,
if you use 0.5%, whereas this is 1%.

71
00:05:06,108 --> 00:05:08,013
So in this example, I would say,

72
00:05:08,013 --> 00:05:12,780
it doesn't really matter which of the
definitions of human-level error you use,

73
00:05:12,780 --> 00:05:15,435
whether you use the typical
doctor's error or

74
00:05:15,435 --> 00:05:20,361
the single experienced doctor's error or
the team of experienced doctor's error.

75
00:05:20,361 --> 00:05:27,520
Whether this is 4% or 4.5%, this is
clearly bigger than the variance problem.

76
00:05:27,520 --> 00:05:29,020
And so in this case,

77
00:05:29,020 --> 00:05:34,490
you should focus on bias reduction
techniques such as train a bigger network.

78
00:05:34,490 --> 00:05:36,970
Now let's look at a second example.

79
00:05:36,970 --> 00:05:42,880
Let's see your training error is 1% and
your dev error is 5%.

80
00:05:42,880 --> 00:05:45,386
Then again it doesn't really matter,
seems but

81
00:05:45,386 --> 00:05:49,617
academic whether the human-level
performance is 1% or 0.7% or 0.5%.

82
00:05:49,617 --> 00:05:54,599
Because whichever of these definitions
you use, your measure of avoidable bias

83
00:05:54,599 --> 00:05:59,517
will be, I guess somewhere between
0% if you use that, to 0.5%, right?

84
00:05:59,517 --> 00:06:03,268
That's the gap between the human-level
performance and your training error,

85
00:06:03,268 --> 00:06:04,516
whereas this gap is 4%.

86
00:06:04,516 --> 00:06:08,976
So this 4% is going to be much bigger
than the avoidable bias either way.

87
00:06:08,976 --> 00:06:13,355
And so they'll just suggest you should
focus on variance reduction techniques

88
00:06:13,355 --> 00:06:16,571
such as regularization or
getting a bigger training set.

89
00:06:16,571 --> 00:06:20,946
But where it really matters will
be if your training error is 0.7%.

90
00:06:20,946 --> 00:06:26,830
So you're doing really well now,
and your dev error is 0.8%.

91
00:06:26,830 --> 00:06:33,597
In this case, it really matters that you
use your estimate for Bayes error as 0.5%.

92
00:06:36,027 --> 00:06:41,303
Because in this case, your measure of
how much avoidable bias you have is 0.2%

93
00:06:41,303 --> 00:06:46,512
which is twice as big as your measure for
your variance, which is just 0.1%.

94
00:06:48,592 --> 00:06:52,226
And so this suggests that maybe both the
bias and variance are both problems but

95
00:06:52,226 --> 00:06:54,880
maybe the avoidable bias is
a bit bigger of a problem.

96
00:06:54,880 --> 00:07:00,000
And in this example, 0.5% as we discussed
on the previous slide was the best measure

97
00:07:00,000 --> 00:07:04,130
of Bayes error, because a team of human
doctors could achieve that performance.

98
00:07:04,130 --> 00:07:08,711
If you use 0.7 as your proxy for
Bayes error, you would have estimated

99
00:07:08,711 --> 00:07:13,200
avoidable bias as pretty much 0%,
and you might have missed that.

100
00:07:13,200 --> 00:07:15,870
You actually should try to do
better on your training set.

101
00:07:18,085 --> 00:07:22,640
So I hope this gives a sense also of why
making progress in a machine learning

102
00:07:22,640 --> 00:07:27,660
problem gets harder as you achieve or
as you approach human-level performance.

103
00:07:27,660 --> 00:07:31,600
In this example,
once you've approached 0.7% error,

104
00:07:31,600 --> 00:07:35,300
unless you're very careful
about estimating Bayes error,

105
00:07:35,300 --> 00:07:38,620
you might not know how far
away you are from Bayes error.

106
00:07:38,620 --> 00:07:42,999
And therefore how much you should
be trying to reduce aviodable bias.

107
00:07:42,999 --> 00:07:47,691
In fact, if all you knew was that a single
typical doctor achieves 1% error, and

108
00:07:47,691 --> 00:07:52,247
it might be very difficult to know if you
should be trying to fit your training set

109
00:07:52,247 --> 00:07:53,070
even better.

110
00:07:54,860 --> 00:07:58,720
And this problem arose only when you're
doing very well on your problem already,

111
00:07:58,720 --> 00:08:02,764
only when you're doing 0.7%, 0.8%,
really close to human-level performance.

112
00:08:04,430 --> 00:08:09,254
Whereas in the two examples on the left,
when you are further away human-level

113
00:08:09,254 --> 00:08:13,530
performance, it was easier to target
your focus on bias or variance.

114
00:08:13,530 --> 00:08:17,209
So this is maybe an illustration of why
as your pro human-level performance is

115
00:08:17,209 --> 00:08:20,071
actually harder to tease out the bias and
variance effects.

116
00:08:20,071 --> 00:08:23,555
And therefore why progress on your machine
learning project just gets harder as

117
00:08:23,555 --> 00:08:24,810
you're doing really well.

118
00:08:25,930 --> 00:08:28,590
So just to summarize
what we've talked about.

119
00:08:28,590 --> 00:08:31,863
If you're trying to understand bias and
variance where

120
00:08:31,863 --> 00:08:36,734
you have an estimate of human-level error
for a task that humans can do quite well,

121
00:08:36,734 --> 00:08:41,419
you can use human-level error as a proxy
or as a approximation for Bayes error.

122
00:08:47,130 --> 00:08:51,917
And so the difference between your
estimate of Bayes error tells you how

123
00:08:51,917 --> 00:08:56,568
much avoidable bias is a problem,
how much avoidable bias there is.

124
00:08:56,568 --> 00:08:59,767
And the difference between
training error and dev error,

125
00:08:59,767 --> 00:09:04,075
that tells you how much variance is
a problem, whether your algorithm's able

126
00:09:04,075 --> 00:09:07,500
to generalize from the training
set to the dev set.

127
00:09:07,500 --> 00:09:10,708
And the big difference between
our discussion here and

128
00:09:10,708 --> 00:09:15,740
what we saw in an earlier course was that
instead of comparing training error to 0%,

129
00:09:18,471 --> 00:09:23,553
And just calling that
the estimate of the bias.

130
00:09:23,553 --> 00:09:28,400
In contrast, in this video we have a more
nuanced analysis in which there is no

131
00:09:28,400 --> 00:09:31,999
particular expectation that
you should get 0% error.

132
00:09:31,999 --> 00:09:36,269
Because sometimes Bayes error is non zero
and sometimes it's just not possible for

133
00:09:36,269 --> 00:09:39,198
anything to do better than
a certain threshold of error.

134
00:09:41,723 --> 00:09:46,305
And so in the earlier course,
we were measuring training error, and

135
00:09:46,305 --> 00:09:49,895
seeing how much bigger
training error was than zero.

136
00:09:49,895 --> 00:09:53,720
And just using that to try to
understand how big our bias is.

137
00:09:53,720 --> 00:09:58,458
And that turns out to work just fine for
problems where Bayes error is nearly 0%,

138
00:09:58,458 --> 00:10:00,085
such as recognizing cats.

139
00:10:00,085 --> 00:10:04,005
Humans are near perfect for that, so
Bayes error is also near perfect for that.

140
00:10:04,005 --> 00:10:07,525
So that actually works okay when
Bayes error is nearly zero.

141
00:10:07,525 --> 00:10:11,441
But for problems where the data is noisy,
like speech recognition on very noisy

142
00:10:11,441 --> 00:10:14,831
audio where it's just impossible
sometimes to hear what was said and

143
00:10:14,831 --> 00:10:16,594
to get the correct transcription.

144
00:10:16,594 --> 00:10:19,239
For problems like that,
having a better estimate for

145
00:10:19,239 --> 00:10:22,787
Bayes error can help you better
estimate avoidable bias and variance.

146
00:10:22,787 --> 00:10:26,569
And therefore make better decisions
on whether to focus on bias reduction

147
00:10:26,569 --> 00:10:28,955
tactics, or on variance reduction tactics.

148
00:10:30,915 --> 00:10:34,855
So to recap, having an estimate of
human-level performance gives you

149
00:10:34,855 --> 00:10:36,442
an estimate of Bayes error.

150
00:10:36,442 --> 00:10:40,468
And this allows you to more quickly make
decisions as to whether you should focus

151
00:10:40,468 --> 00:10:44,390
on trying to reduce a bias or trying to
reduce the variance of your algorithm.

152
00:10:45,690 --> 00:10:50,190
And these techniques will tend to work
well until you surpass human-level

153
00:10:50,190 --> 00:10:54,750
performance, whereupon you might no longer
have a good estimate of Bayes error that

154
00:10:54,750 --> 00:10:57,420
still helps you make this
decision really clearly.

155
00:10:58,470 --> 00:11:01,980
Now, one of the exciting developments
in deep learning has been that for

156
00:11:01,980 --> 00:11:06,360
more and more tasks we're actually able
to surpass human-level performance.

157
00:11:06,360 --> 00:11:07,630
In the next video,

158
00:11:07,630 --> 00:11:11,394
let's talk more about the process of
surpassing human-level performance.