1
00:00:00,000 --> 00:00:04,821
[MUSIC]

2
00:00:04,821 --> 00:00:08,840
At the core of boosting is the idea
of an ensemble classifier.

3
00:00:08,840 --> 00:00:11,640
So let's revisit the idea of,
say, decision stump.

4
00:00:11,640 --> 00:00:14,955
A decision stump may take our data as
input and might look at some particular

5
00:00:14,955 --> 00:00:19,015
feature, let's say, income and
ask, is income greater than 100K?

6
00:00:19,015 --> 00:00:23,095
If yes, we say a loan is safe, and
if not, we might say the loan is risky.

7
00:00:23,095 --> 00:00:27,401
And so the output here, y hat, is +1 or

8
00:00:27,401 --> 00:00:30,212
-1, depending on whether we
think the loan is safe or risky.

9
00:00:30,212 --> 00:00:34,622
We're going to use the following notation
just to be clear in today's module.

10
00:00:34,622 --> 00:00:38,588
We're going to use f(x) to denote one
of these weak simple classifiers or

11
00:00:38,588 --> 00:00:41,002
a classifier in general, so that's the f.

12
00:00:41,002 --> 00:00:45,880
And so what boosting does starts not
just from a single decision stump,

13
00:00:45,880 --> 00:00:51,920
but it combines many of them together
in order to improve the prediction.

14
00:00:51,920 --> 00:00:55,054
So, for example, you may have
a decision stump that splits on income,

15
00:00:55,054 --> 00:00:58,817
another one that splits on credit history,
another one on how much savings you have,

16
00:00:58,817 --> 00:01:01,050
and the market conditions, and so on.

17
00:01:01,050 --> 00:01:06,422
So if you're giving a particular input,
for example, your income is 120K,

18
00:01:06,422 --> 00:01:10,135
so the first decision boost
here is going to vote with f1,

19
00:01:10,135 --> 00:01:13,410
that the loan is safe, so
it's going to vote +1.

20
00:01:13,410 --> 00:01:15,483
But your credit history is bad, so

21
00:01:15,483 --> 00:01:19,550
the second decision stump
here is going to vote -1, f2.

22
00:01:19,550 --> 00:01:22,441
And, say, the third one,
because your savings are low,

23
00:01:22,441 --> 00:01:24,683
is going to vote risky -1,
and the third one,

24
00:01:24,683 --> 00:01:28,760
market conditions, turned out to be
good so you're going to vote +1.

25
00:01:28,760 --> 00:01:33,884
And the question here the boosting asks
is, how do you combine in an ensemble

26
00:01:33,884 --> 00:01:39,904
way this -1, +1 votes from different
classifiers, f1 through f4, in this case.

27
00:01:39,904 --> 00:01:44,723
And so the way that we combine is by
having some sort of weights that deal with

28
00:01:44,723 --> 00:01:47,110
what is called the ensemble models.

29
00:01:47,110 --> 00:01:50,190
So the weights here are w1, w2, w3, w4.

30
00:01:50,190 --> 00:01:55,290
And we multiply the outputs from each of
the classifiers, add them up together and

31
00:01:55,290 --> 00:01:56,430
then take the sign of that.

32
00:01:56,430 --> 00:01:58,540
If the sign is positive, we output +1.

33
00:01:58,540 --> 00:02:00,578
If the sign is negative, we output -1.

34
00:02:00,578 --> 00:02:03,210
So it's kind of a weighted voting schema.

35
00:02:03,210 --> 00:02:07,630
This weighted voting schema,
amazing, really awesome.

36
00:02:07,630 --> 00:02:09,690
And let's see a little bit
more of what that looks like.

37
00:02:11,290 --> 00:02:14,172
So let's suppose that we have
a particular set of weights and

38
00:02:14,172 --> 00:02:18,144
we have multiple decision stumps, so
classifiers that have provided their vote.

39
00:02:18,144 --> 00:02:22,158
Let's see more explicitly what
that output value would be.

40
00:02:22,158 --> 00:02:26,925
So, for example, if the weights of each
classifier is shown on the right here,

41
00:02:26,925 --> 00:02:31,409
then we're going to say that output that
we're going to predict is going to be

42
00:02:31,409 --> 00:02:33,209
the sign of w1, which is 2.

43
00:02:33,209 --> 00:02:39,293
Then multiply the vote of f1,
which is +1 plus w2, which is 1.5, so

44
00:02:39,293 --> 00:02:45,669
we believe that the second decision stump
is not as important as the first one,

45
00:02:45,669 --> 00:02:50,130
and the vote there was -1,
so that multiplies -1.

46
00:02:50,130 --> 00:02:54,732
Then, for the third one, again,
we say that the weight is 1.5, and

47
00:02:54,732 --> 00:02:58,338
again we say it's not as
important as the first one, but

48
00:02:58,338 --> 00:03:02,742
as important as the second one,
and the prediction there was -1.

49
00:03:02,742 --> 00:03:05,278
And finally, for the last one,
we might say that the weight is 0.5.

50
00:03:05,278 --> 00:03:09,890
So this is the least important
of all of these classifiers.

51
00:03:09,890 --> 00:03:15,360
And the function here that, the output
of f4 here was +1, was a safe loan.

52
00:03:16,430 --> 00:03:21,960
So we have these votes, these weights,
we multiply them together and

53
00:03:21,960 --> 00:03:25,980
we see that the output
of this F classifier for

54
00:03:25,980 --> 00:03:33,510
the input xi is the sign of -0.5,
which is equal to or

55
00:03:33,510 --> 00:03:38,690
implies that y hat i is -1.

56
00:03:38,690 --> 00:03:41,670
So even though there was two positive
votes and two negative votes, and the most

57
00:03:41,670 --> 00:03:46,670
important classifier, the first one, was
a positive vote, when you add it up and

58
00:03:46,670 --> 00:03:50,860
you average the results,
you get a risky loan as an output.

59
00:03:52,010 --> 00:03:56,010
Now this is a simple example of what's
called an ensemble classifier or

60
00:03:56,010 --> 00:03:57,950
the combination multiple classifiers.

61
00:03:57,950 --> 00:04:00,280
As we'll see, this kind of ensemble model,

62
00:04:00,280 --> 00:04:05,220
this kind of combination is what
everybody uses in industry to be able to

63
00:04:05,220 --> 00:04:11,470
solve complex decision problems,
complex classification problems.

64
00:04:11,470 --> 00:04:13,140
Just to make sure we're
all on the same page,

65
00:04:13,140 --> 00:04:17,270
I'm going to just formally define
the ensemble learning problem.

66
00:04:17,270 --> 00:04:20,271
So we're given some data,
where y is either +1 or -1,

67
00:04:20,271 --> 00:04:24,379
there's also multiclass version of this,
we're just going to talk about +1 or

68
00:04:24,379 --> 00:04:27,770
-1 in today's module, and
we have some input, x.

69
00:04:27,770 --> 00:04:35,780
We have some data that allows us to
learn f1, f2, all the way to fT,

70
00:04:35,780 --> 00:04:40,140
which are the T weight classifiers, or
just classifiers, that we learn from data.

71
00:04:40,140 --> 00:04:44,091
And then some coefficients that
we learn from data, w-hat1,

72
00:04:44,091 --> 00:04:46,370
w-hat2, all the way to w-hatT.

73
00:04:46,370 --> 00:04:49,260
And once we learn them,
making a prediction is very similar

74
00:04:49,260 --> 00:04:52,010
to what you do for logistic regression or
a linear classifier.

75
00:04:52,010 --> 00:04:55,720
It's just the sign of the weighted sum
of the votes from each classifier.

76
00:04:55,720 --> 00:05:00,900
And so if you look at this, it looks
a lot like a linear classification model,

77
00:05:00,900 --> 00:05:04,090
logistic regression and
all of those, exactly the same.

78
00:05:04,090 --> 00:05:08,180
However, not only are we
learning the ws from data,

79
00:05:08,180 --> 00:05:09,800
we're actually learning the features.

80
00:05:09,800 --> 00:05:13,110
In those models,
we had hs to represent our features.

81
00:05:13,110 --> 00:05:14,529
Here, the feature are these fs,

82
00:05:14,529 --> 00:05:18,170
which are the weight classifiers
that we're going to learn from data.

83
00:05:18,170 --> 00:05:23,320
So we can think about boosting as
an approach to learn features from data.

84
00:05:23,320 --> 00:05:24,469
And it's really super exciting.

85
00:05:24,469 --> 00:05:29,509
[MUSIC]