1
00:00:00,070 --> 00:00:04,667
[MUSIC]

2
00:00:04,667 --> 00:00:09,647
So, let's compare a boosted
serial stump to a decision tree.

3
00:00:09,647 --> 00:00:15,126
So, this is the decision tree blog that
we saw in the decision tree model.

4
00:00:15,126 --> 00:00:21,368
So this is on real a real dataset is
based on the loan applications and

5
00:00:21,368 --> 00:00:26,185
we see the training error as
we make the decision tree

6
00:00:26,185 --> 00:00:31,240
deeper tends to go down,
down and down and we saw that.

7
00:00:31,240 --> 00:00:36,947
But the test error, which is kind of
related to true error goes down for

8
00:00:36,947 --> 00:00:42,095
a while and you have here the best depth,
which is maybe seven.

9
00:00:42,095 --> 00:00:46,378
But eventually, it goes backup.

10
00:00:46,378 --> 00:00:52,090
So let's say after that 18 over here,
the training error's down to 8%.

11
00:00:52,090 --> 00:00:54,796
So it's a really low training error, but

12
00:00:54,796 --> 00:00:58,754
the test error has gone up to 39% and
we observe over 15.

13
00:00:58,754 --> 00:01:01,900
And in fact, we observe a huge gap here.

14
00:01:01,900 --> 00:01:07,612
There are things now,
by the way is the best decision tree has

15
00:01:07,612 --> 00:01:13,219
classification error in the test
set of about 34% or so.

16
00:01:13,219 --> 00:01:17,746
Now, let's see what happens
with the senior stamps that

17
00:01:17,746 --> 00:01:20,115
boosted on the same data set.

18
00:01:20,115 --> 00:01:22,485
You get a plot that
looks kind of like this,

19
00:01:22,485 --> 00:01:26,233
you see that the training arrow
keeps decreasing per iteration.

20
00:01:26,233 --> 00:01:28,706
So, this is the train error.

21
00:01:28,706 --> 00:01:35,030
So as the theorem predicts,
it decreases as and

22
00:01:35,030 --> 00:01:40,088
the test error in this case is actually

23
00:01:40,088 --> 00:01:44,054
going down with iterations.

24
00:01:44,054 --> 00:01:46,933
So, we're not observing
over 15 at least yet.

25
00:01:46,933 --> 00:01:52,782
And after 18 rounds of boosting,
so 18 decision stumps,

26
00:01:52,782 --> 00:01:58,972
we have 32% test error,
which is better than a decision tree.

27
00:01:58,972 --> 00:02:04,509
Yet, in fact, that's the best decision
tree and the over fitting is much letter.

28
00:02:04,509 --> 00:02:06,940
So, this gap here is much smaller.

29
00:02:08,490 --> 00:02:11,154
So, the gap between
your training error and

30
00:02:11,154 --> 00:02:15,770
the test error is kind of related to
that over fitting quantity that we have.

31
00:02:15,770 --> 00:02:18,625
Now we said,
we're not observing over 15 yet.

32
00:02:18,625 --> 00:02:23,107
So, let's run booster stamps for
more iterations and see what happens.

33
00:02:23,107 --> 00:02:26,109
So let's see what happens when we
keep on boosting and adding more

34
00:02:26,109 --> 00:02:29,712
decision stems on the x-axis, this is
adding more and more decision stems and

35
00:02:29,712 --> 00:02:33,608
more tree instead of only one tree and see
what happens to the classification error.

36
00:02:33,608 --> 00:02:36,950
We see here that the training
error keeps decreasing.

37
00:02:39,284 --> 00:02:46,807
Just like we expected by the theorem,
but the test error has stabilized.

38
00:02:46,807 --> 00:02:50,046
So, the test performance
tends to stay constant.

39
00:02:50,046 --> 00:02:52,867
And in fact, it stays constant for
many iterations.

40
00:02:52,867 --> 00:02:59,043
So if I pick T anywhere in this range,
we will do about the same.

41
00:02:59,043 --> 00:03:07,057
So, any of these values for

42
00:03:07,057 --> 00:03:11,941
T would be fine.

43
00:03:14,564 --> 00:03:20,646
So now we seen how boosting the sound
is seems to be stabilizing,

44
00:03:20,646 --> 00:03:25,744
but the question is do we observe
overfitting and boosting?

45
00:03:25,744 --> 00:03:29,670
And in fact,
we do observe over 15 with boosting,

46
00:03:29,670 --> 00:03:33,789
but boosting tends to be
quite robust in overfitting.

47
00:03:33,789 --> 00:03:36,831
So if you use too many trees or
too few trees, there's a huge range, but

48
00:03:36,831 --> 00:03:37,817
that's really okay.

49
00:03:37,817 --> 00:03:41,548
So here,
I'm going up to 5,000 decisions stamps,

50
00:03:41,548 --> 00:03:44,748
which is this type of
boosting we are doing here.

51
00:03:44,748 --> 00:03:47,372
The best test error was about 31%.

52
00:03:47,372 --> 00:03:52,267
But if I kept going too far,
I go to 5000 overfitting that test

53
00:03:52,267 --> 00:03:56,102
error is going up, but
is doesn't go up that much.

54
00:03:56,102 --> 00:03:59,388
It goes up to 33%, so
there's some over overfitting here.

55
00:03:59,388 --> 00:04:05,159
Very importantly in these examples,
I've been showing you the test error and

56
00:04:05,159 --> 00:04:08,838
talking about what happens
overfitting mostly.

57
00:04:08,838 --> 00:04:09,735
But as we know,

58
00:04:09,735 --> 00:04:13,957
we need to be very careful of how we
pick the parameters of our algorithm.

59
00:04:13,957 --> 00:04:17,719
So how do we pick capital T,
which is when we stop boosting?

60
00:04:17,719 --> 00:04:20,714
We have five decision stamps,
5,000 decision stamps.

61
00:04:20,714 --> 00:04:25,521
And this is just like selecting the magic
parameters or all sorts of algorithms and

62
00:04:25,521 --> 00:04:30,187
almost every algorithm out there model
has a parameter trades off complexity,

63
00:04:30,187 --> 00:04:33,044
that for decision tree,
number of features, or

64
00:04:33,044 --> 00:04:35,901
magnitude of weights in
logistic regression and

65
00:04:35,901 --> 00:04:40,260
here is the number of rounds of
boosting with the quality of the fit.

66
00:04:40,260 --> 00:04:42,170
So, just like lambda in regularization.

67
00:04:43,820 --> 00:04:47,648
We can't use the training data,
because the training tends to go down with

68
00:04:47,648 --> 00:04:50,892
iterations of boosting, so
you say that T should be infinite.

69
00:04:50,892 --> 00:04:55,553
Shouldn't be here really big and
should never, never, never,

70
00:04:55,553 --> 00:04:59,971
ever, ever, ever,
ever use the test data, so that's bad.

71
00:04:59,971 --> 00:05:04,093
I was just showing you an illustrative
examples, but you should never do that.

72
00:05:04,093 --> 00:05:05,578
So, what should we do?

73
00:05:05,578 --> 00:05:10,454
Well, you should either use a validation
set if you have a lot of data.

74
00:05:10,454 --> 00:05:12,017
If you have a big dataset,

75
00:05:12,017 --> 00:05:15,795
you select a subpart of that to
just pick the magic parameters.

76
00:05:15,795 --> 00:05:19,150
And if your dataset is not that large
you should use cross-validation.

77
00:05:19,150 --> 00:05:23,214
And in the regression course, we talked
about how to use validation sets and

78
00:05:23,214 --> 00:05:26,963
how to use cross-validations to
pick magic parameters like lambda,

79
00:05:26,963 --> 00:05:30,480
the def decision tree or
the number rounds of boosting Capital T.

80
00:05:30,480 --> 00:05:35,339
[MUSIC]