1
00:00:00,146 --> 00:00:02,515
In this video, I would like
to talk about how to

2
00:00:02,523 --> 00:00:06,662
evaluate a hypothesis that has
been learned by your algorithm.

3
00:00:06,685 --> 00:00:09,200
In later videos,
we will build on this

4
00:00:09,231 --> 00:00:11,846
to talk about how to
prevent in the problems of

5
00:00:11,869 --> 00:00:14,908
overfitting and underfitting as well.

6
00:00:15,615 --> 00:00:19,023
When we fit the parameters
of our learning algorithm

7
00:00:19,038 --> 00:00:23,154
we think about choosing the parameters
to minimize the training error.

8
00:00:23,169 --> 00:00:26,077
One might think that
getting a really low value of

9
00:00:26,100 --> 00:00:28,108
training error
might be a good thing,

10
00:00:28,108 --> 00:00:29,562
but we have already seen that

11
00:00:29,562 --> 00:00:32,400
just because a hypothesis
has low training error,

12
00:00:32,400 --> 00:00:35,254
that doesn't mean it is
necessarily a good hypothesis.

13
00:00:35,254 --> 00:00:40,223
And we've already seen the
example of how a hypothesis can overfit.

14
00:00:40,415 --> 00:00:45,785
And therefore fail to generalize the
new examples not in the training set.

15
00:00:45,962 --> 00:00:50,000
So how do you tell if
the hypothesis might be overfitting.

16
00:00:50,015 --> 00:00:54,346
In this simple example 
we could plot the hypothesis h of x

17
00:00:54,365 --> 00:00:56,338
and just see what was going on.

18
00:00:56,346 --> 00:01:00,538
But in general for problems with
more features than just one feature,

19
00:01:00,554 --> 00:01:03,531
for problems with a large
number of features like these

20
00:01:03,546 --> 00:01:06,692
it becomes hard or
may be impossible

21
00:01:06,708 --> 00:01:09,515
to plot what the hypothesis looks like

22
00:01:09,531 --> 00:01:13,046
and so we need some other way
to evaluate our hypothesis.

23
00:01:13,062 --> 00:01:17,315
The standard way to evaluate
a learned hypothesis is as follows.

24
00:01:17,331 --> 00:01:19,308
Suppose we have
a data set like this.

25
00:01:19,323 --> 00:01:21,977
Here I have just shown 
10 training examples,

26
00:01:21,992 --> 00:01:23,969
but of course usually we may have

27
00:01:23,985 --> 00:01:27,254
dozens or hundreds or maybe
thousands of training examples.

28
00:01:27,269 --> 00:01:30,246
In order to make sure we
can evaluate our hypothesis,

29
00:01:30,262 --> 00:01:32,808
what we are going to do is split

30
00:01:32,823 --> 00:01:35,554
the data we have into two portions.

31
00:01:35,569 --> 00:01:40,723
The first portion is going 
to be our usual training set

32
00:01:42,638 --> 00:01:47,446
and the second portion 
is going to be our test set,

33
00:01:47,462 --> 00:01:50,398
and a pretty typical split of this

34
00:01:50,413 --> 00:01:53,482
all the data we have into a
training set and test set

35
00:01:53,498 --> 00:01:57,936
might be around
say a 70%, 30% split.

36
00:01:57,952 --> 00:02:00,052
Worth more today to
grade the training set

37
00:02:00,067 --> 00:02:02,367
and relatively less to the test set.

38
00:02:02,382 --> 00:02:05,782
And so now, if
we have some data set,

39
00:02:05,790 --> 00:02:08,459
we run a sine of say 70%

40
00:02:08,475 --> 00:02:11,529
of the data to be our
training set where here "m"

41
00:02:11,544 --> 00:02:14,336
is as usual our
number of training examples

42
00:02:14,352 --> 00:02:16,913
and the remainder of our data

43
00:02:16,929 --> 00:02:19,310
might then be assigned
to become our test set.

44
00:02:19,325 --> 00:02:23,410
And here, I'm going to use
the notation m subscript test

45
00:02:23,425 --> 00:02:27,187
to denote the number of test examples.

46
00:02:27,202 --> 00:02:32,225
And so in general, this
subscript test is going to denote

47
00:02:32,241 --> 00:02:34,987
examples that come
from a test set so that

48
00:02:35,002 --> 00:02:40,810
x1 subscript test, 
y1 subscript test is my first

49
00:02:40,825 --> 00:02:43,648
test example which 
I guess in this example

50
00:02:43,664 --> 00:02:45,656
might be this example over here.

51
00:02:45,671 --> 00:02:47,495
Finally, one last detail

52
00:02:47,510 --> 00:02:50,795
whereas here I've drawn this
as though the first 70%

53
00:02:50,810 --> 00:02:54,479
goes to the training set and
the last 30% to the test set.

54
00:02:54,495 --> 00:02:57,518
If there is any sort of
ordinary to the data.

55
00:02:57,533 --> 00:03:01,048
That should be better
to send a random 70%

56
00:03:01,048 --> 00:03:02,948
of your data to
the training set and a

57
00:03:02,964 --> 00:03:05,556
random 30% of
your data to the test set.

58
00:03:05,571 --> 00:03:08,579
So if your data were
already randomly sorted,

59
00:03:08,595 --> 00:03:12,110
you could just take
the first 70% and last 30%

60
00:03:12,125 --> 00:03:14,718
that if your data
were not randomly ordered,

61
00:03:14,733 --> 00:03:16,756
it would be better to randomly shuffle or

62
00:03:16,771 --> 00:03:19,718
to randomly reorder
the examples in your training set.

63
00:03:19,733 --> 00:03:23,310
Before you know sending
the first 70% in the training set

64
00:03:23,325 --> 00:03:26,669
and the last 30% of the test set.

65
00:03:27,054 --> 00:03:30,169
Here then is a
fairly typical procedure

66
00:03:30,185 --> 00:03:32,008
for how you
would train and test

67
00:03:32,023 --> 00:03:34,492
the learning algorithm
and the learning regression.

68
00:03:34,508 --> 00:03:38,115
First, you learn the parameters
theta from the training set

69
00:03:38,131 --> 00:03:41,798
so you minimize the usual 
training error objective j of theta,

70
00:03:41,813 --> 00:03:44,713
where j of theta 
here was defined using that

71
00:03:44,729 --> 00:03:47,059
70% of all the data you have.

72
00:03:47,075 --> 00:03:49,759
There is only the training data.

73
00:03:49,882 --> 00:03:52,167
And then you would
compute the test error.

74
00:03:52,182 --> 00:03:56,298
And I am going to denote
the test error as j subscript test.

75
00:03:56,313 --> 00:03:59,229
And so what you do is
take your parameter theta

76
00:03:59,259 --> 00:04:02,190
that you have learned from 
the training set, and plug it in here

77
00:04:02,205 --> 00:04:04,875
and compute your test set error.

78
00:04:04,890 --> 00:04:08,529
Which I am going to write as follows.

79
00:04:08,698 --> 00:04:11,275
So this is basically

80
00:04:11,290 --> 00:04:15,244
the average squared error

81
00:04:15,269 --> 00:04:18,154
as measured on your test set.

82
00:04:18,169 --> 00:04:19,915
It's pretty much what you'd expect.

83
00:04:19,931 --> 00:04:23,415
So if we run every test
example through your hypothesis

84
00:04:23,431 --> 00:04:28,008
with parameter theta and 
just measure the squared error

85
00:04:28,023 --> 00:04:33,338
that your hypothesis has on
your m subscript test, test examples.

86
00:04:33,354 --> 00:04:37,054
And of course, this is
the definition of the

87
00:04:37,069 --> 00:04:40,815
test set error if 
we are using linear regression

88
00:04:40,831 --> 00:04:44,362
and using
the squared error metric.

89
00:04:44,377 --> 00:04:47,477
How about if we were doing
a classification problem

90
00:04:47,492 --> 00:04:50,654
and say using
logistic regression instead.

91
00:04:50,669 --> 00:04:53,877
In that case,
the procedure for training

92
00:04:53,892 --> 00:04:57,085
and testing say
logistic regression is pretty similar

93
00:04:57,100 --> 00:04:59,985
first we will do the parameters
from the training data,

94
00:05:00,000 --> 00:05:02,331
that first 70% of the data.

95
00:05:02,346 --> 00:05:05,115
And it will compute
the test error as follows.

96
00:05:05,131 --> 00:05:07,015
It's the same objective function

97
00:05:07,031 --> 00:05:09,592
as we always use but
we just logistic regression,

98
00:05:09,608 --> 00:05:11,569
except that now is define using

99
00:05:11,585 --> 00:05:15,115
our m subscript test,
test examples.

100
00:05:15,131 --> 00:05:17,600
While this definition of
the test set error

101
00:05:17,631 --> 00:05:20,238
j subscript test is
perfectly reasonable.

102
00:05:20,254 --> 00:05:22,231
Sometimes there is an alternative

103
00:05:22,246 --> 00:05:25,469
test sets metric that
might be easier to interpret,

104
00:05:25,485 --> 00:05:27,877
and that's the misclassification error.

105
00:05:27,892 --> 00:05:30,792
It's also called the zero one
misclassification error,

106
00:05:30,808 --> 00:05:32,692
with zero one denoting that

107
00:05:32,708 --> 00:05:36,146
you either get an example right
or you get an example wrong.

108
00:05:36,162 --> 00:05:37,910
Here's what I mean.

109
00:05:37,925 --> 00:05:41,795
Let me define
the error of a prediction.

110
00:05:41,825 --> 00:05:44,202
That is h of x.

111
00:05:44,218 --> 00:05:47,518
And given the label y as

112
00:05:47,533 --> 00:05:51,848
equal to one
if my hypothesis

113
00:05:51,864 --> 00:05:54,633
outputs the value
greater than equal to five

114
00:05:54,641 --> 00:05:57,510
and Y is equal to zero

115
00:05:57,525 --> 00:06:03,718
or if my hypothesis outputs 
a value of less than 0.5

116
00:06:03,733 --> 00:06:05,402
and y is equal to one,

117
00:06:05,418 --> 00:06:08,118
right, so both of these
cases basic respond

118
00:06:08,133 --> 00:06:11,833
to if your hypothesis
mislabeled the example

119
00:06:11,833 --> 00:06:14,518
assuming your threshold at an 0.5.

120
00:06:14,533 --> 00:06:18,171
So either thought it was more
likely to be 1, but it was actually 0,

121
00:06:18,187 --> 00:06:20,733
or your hypothesis
stored was more likely

122
00:06:20,748 --> 00:06:23,556
to be 0, but the
label was actually 1.

123
00:06:23,571 --> 00:06:28,471
And otherwise, we define
this error function to be zero.

124
00:06:28,487 --> 00:06:34,841
If your hypothesis basically
classified the example y correctly.

125
00:06:34,864 --> 00:06:38,841
We could then
define the test error,

126
00:06:38,856 --> 00:06:42,371
using the misclassification
error metric to be

127
00:06:42,387 --> 00:06:46,779
one of the m tests
of sum from i equals one

128
00:06:46,795 --> 00:06:49,941
to m subscript test of the

129
00:06:49,956 --> 00:06:55,164
error of h of x(i) test

130
00:06:55,179 --> 00:06:57,971
comma y(i).

131
00:06:57,987 --> 00:07:02,010
And so that's just my way of
writing out that this is exactly

132
00:07:02,025 --> 00:07:05,587
the fraction of
the examples in my test set

133
00:07:05,602 --> 00:07:08,864
that my hypothesis has mislabeled.

134
00:07:08,871 --> 00:07:10,602
And so that's the definition of

135
00:07:10,618 --> 00:07:13,687
the test set error using the
misclassification error

136
00:07:13,718 --> 00:07:16,948
of the 0 1 
misclassification metric.

137
00:07:16,971 --> 00:07:19,995
So that's the standard
technique for evaluating

138
00:07:20,010 --> 00:07:22,833
how good a learned hypothesis is.

139
00:07:22,848 --> 00:07:25,579
In the next video, 
we will adapt these ideas

140
00:07:25,595 --> 00:07:28,525
to helping us do things
like choose what features

141
00:07:28,541 --> 00:07:31,641
like the degree polynomial
to use with the learning algorithm

142
00:07:31,656 --> 00:07:34,964
or choose the regularization
parameter for learning algorithm.