1
00:00:00,000 --> 00:00:05,421
[MUSIC]

2
00:00:05,421 --> 00:00:09,378
Now let's take that third example we've
used to illustrate different machine

3
00:00:09,378 --> 00:00:13,290
learning algorithms in this module and
explore it in the context of AdaBoost.

4
00:00:13,290 --> 00:00:16,890
And it's going to give us a lot of insight
as to how boosting works in practice.

5
00:00:18,420 --> 00:00:20,770
So for the first classifier f1,

6
00:00:20,770 --> 00:00:25,066
we work directly off the original data,
all points have the same weight.

7
00:00:25,066 --> 00:00:27,844
That's right.

8
00:00:27,844 --> 00:00:32,120
And so the learning process we have
is going to be standard learning.

9
00:00:33,440 --> 00:00:36,280
So nothing changes in
your learning algorithm

10
00:00:36,280 --> 00:00:38,250
since every data point
has the same weight.

11
00:00:38,250 --> 00:00:43,330
And in this case we're learning a decision
stump and so here is the decision boundary

12
00:00:43,330 --> 00:00:48,140
that does it's best to try to separate
positive examples from negative examples.

13
00:00:48,140 --> 00:00:50,044
It splits right around 0.

14
00:00:51,740 --> 00:00:57,120
It's actually minus 0.07, if you remember
from the decision tree classifier.

15
00:00:57,120 --> 00:00:59,610
So this is the first decision stump, f1.

16
00:00:59,610 --> 00:01:03,100
Now to learn the second decision stump,

17
00:01:03,100 --> 00:01:08,680
f2, we have to reweigh our data based
on how much f1 did, so how well f1 did.

18
00:01:08,680 --> 00:01:12,790
So we're going to look at
our decision boundary and

19
00:01:12,790 --> 00:01:17,170
we're going to weigh data points
that were mistakes higher.

20
00:01:17,170 --> 00:01:21,710
And here in the picture I'm denoting them
by bigger minus signs and plus signs.

21
00:01:21,710 --> 00:01:26,640
So if you look at the data points here
on the left, they were mistakes or

22
00:01:26,640 --> 00:01:30,500
the minus on this side, and
this plus is over here.

23
00:01:31,638 --> 00:01:34,710
They were also made mistakes,
so we increased our weight and

24
00:01:34,710 --> 00:01:38,200
we decreased the weight of everybody
else and we see that the pluses here

25
00:01:38,200 --> 00:01:42,690
became here bigger and
the minuses in this region became larger.

26
00:01:42,690 --> 00:01:44,490
So that's how we update our weight.

27
00:01:45,940 --> 00:01:49,640
And now let's look at the next step.

28
00:01:49,640 --> 00:01:55,720
Learning the classifier f2 in the second
Iteration based on this weighted data.

29
00:01:56,930 --> 00:02:00,120
Using the weighted data,
we'll learn the following decision stump.

30
00:02:00,120 --> 00:02:02,730
And you see that now we're
still having a vertical split,

31
00:02:02,730 --> 00:02:07,288
we have a horizontal split, and
it's a better split for weighted data.

32
00:02:07,288 --> 00:02:12,888
Split for these weights on the left and

33
00:02:12,888 --> 00:02:15,873
it's kind of cool.

34
00:02:15,873 --> 00:02:20,040
So in the first iteration
we decided to split on x 1.

35
00:02:20,040 --> 00:02:23,378
In the second one we split on x2, and

36
00:02:23,378 --> 00:02:27,930
this is x2 greater than or
less than 1.3 or so.

37
00:02:27,930 --> 00:02:33,300
And you'll see that it gets all the
minuses correct on top but it makes some

38
00:02:33,300 --> 00:02:36,750
mistakes on the minuses in the bottom, but
it gets the pluses correct in the bottom.

39
00:02:38,110 --> 00:02:42,820
So as opposed to the vertical split here,
we now have a horizontal split.

40
00:02:42,820 --> 00:02:45,070
So now we've learned there
are decision stems f1 and f2,

41
00:02:45,070 --> 00:02:48,170
and the question here is
how do we combine them?

42
00:02:48,170 --> 00:02:52,410
So if you go through with the AdaBoost
formula you'll see that the w hat 1

43
00:02:52,410 --> 00:02:58,150
the weight of the first decision
stop is going to be 0.61,

44
00:02:58,150 --> 00:02:59,090
then w hat 2 is going to 0.53.

45
00:02:59,090 --> 00:03:02,410
So we trust the first decision stamp
a little bit more than we trust the second

46
00:03:02,410 --> 00:03:03,510
one which makes sense.

47
00:03:03,510 --> 00:03:06,720
The second one doesn't seem as good,
but when you add them together,

48
00:03:06,720 --> 00:03:09,960
you start getting a very
interesting decision boundary.

49
00:03:09,960 --> 00:03:15,270
So you get the points in the top
left here are ones where we

50
00:03:15,270 --> 00:03:21,650
definitely think that y hear is minus 1,
so definitely negatives.

51
00:03:21,650 --> 00:03:29,070
On the bottom right here, it's some
definite positives y hat equals plus 1.

52
00:03:29,070 --> 00:03:32,800
And then for the other two regions,

53
00:03:32,800 --> 00:03:37,320
we can think about these as being
regions of higher uncertainty.

54
00:03:37,320 --> 00:03:44,480
So these are uncertain which
right now makes sense,

55
00:03:44,480 --> 00:03:47,510
but as you add more decision stumps
we're going to be more sure that some of

56
00:03:47,510 --> 00:03:52,040
the points on the left tier bottom
are negative and right top are negative.

57
00:03:52,040 --> 00:03:57,260
Now, if you keep our numbers going for
30 iterations the first thing

58
00:03:57,260 --> 00:04:01,870
that we notice is that we get all the data
points right, so our training error is 0.

59
00:04:01,870 --> 00:04:08,376
The second thing you'll notice, and
here I'm going to use a technical term for

60
00:04:08,376 --> 00:04:12,392
this, is that the decision
boundary is crazy.

61
00:04:12,392 --> 00:04:16,818
This is our technical term, and then if
you combine these two insides We figure

62
00:04:16,818 --> 00:04:19,769
out, okay we don't really
trust this classifier,

63
00:04:19,769 --> 00:04:22,200
we're probably over-fitting the data.

64
00:04:24,370 --> 00:04:25,890
So it fits perfectly on the train later,

65
00:04:25,890 --> 00:04:29,520
it maybe doesn't do as
well with a little error.

66
00:04:29,520 --> 00:04:34,050
So overfitting something that
will happen in boasting,

67
00:04:34,050 --> 00:04:35,930
we'll talk about a little bit next.

68
00:04:37,180 --> 00:04:40,320
So let's take a deep breath and
summarize what we've done so far.

69
00:04:40,320 --> 00:04:44,280
We described simple classifiers, and we
said that we're going to learn the simple

70
00:04:44,280 --> 00:04:47,440
classifiers and take the volt
between them to make predictions.

71
00:04:47,440 --> 00:04:49,550
And then we described
this AdaBoost algorithm,

72
00:04:49,550 --> 00:04:52,360
which is a pretty simple
approach to learning

73
00:04:52,360 --> 00:04:55,200
a non-simple classifier using
this technique of boosting

74
00:04:55,200 --> 00:04:59,030
where you're boosting up the weight of
data points when we're making mistakes.

75
00:04:59,030 --> 00:05:05,235
And simple to implement from practice.

76
00:05:05,235 --> 00:05:06,669
[MUSIC]