1
00:00:00,205 --> 00:00:04,246
[MUSIC]

2
00:00:04,246 --> 00:00:07,745
I'd like to take a moment now to
summarize we've seen boosting and

3
00:00:07,745 --> 00:00:10,890
what impacts it's had in the real world.

4
00:00:10,890 --> 00:00:12,560
Let's talk about AdaBoost.

5
00:00:12,560 --> 00:00:15,740
So AdaBoost is one of the early
types of boosting algorithms.

6
00:00:15,740 --> 00:00:19,380
Extremely useful, but
there other algorithms out there.

7
00:00:19,380 --> 00:00:20,950
In particular,
there's one called gradient boosting,

8
00:00:20,950 --> 00:00:23,430
which is slightly more complicated but
extremely similar.

9
00:00:24,800 --> 00:00:27,780
And it's like AdaBoost, but
it can be useful not not just for

10
00:00:27,780 --> 00:00:31,030
basic classification, but for
other of types of loss functions for

11
00:00:31,030 --> 00:00:34,842
the types of problems and it's what
most people use, so gradient boosting.

12
00:00:34,842 --> 00:00:38,170
It's kind of generalization kind of boost,
you can think about it that way.

13
00:00:38,170 --> 00:00:41,150
Then there's other related
ways to learn ensembles,

14
00:00:41,150 --> 00:00:44,560
the most popular one is
called random forests.

15
00:00:44,560 --> 00:00:48,870
So random forest is a lot like boost
in the sense that it tries to learn

16
00:00:48,870 --> 00:00:52,050
an example of classifiers in this case,
decision trees, but

17
00:00:52,050 --> 00:00:53,960
it could be other types of classifiers.

18
00:00:53,960 --> 00:00:57,240
And instead of using boosting,
it uses an approach called bagging.

19
00:00:57,240 --> 00:01:01,400
And just very briefly, what you do with
bagging is, you take your data set, and

20
00:01:01,400 --> 00:01:04,650
you sample different subset of the data,
which is kind of like learning on

21
00:01:04,650 --> 00:01:08,405
different sub datasets, and we learn
the decision tree on each one of them.

22
00:01:08,405 --> 00:01:09,930
You just average the outputs.

23
00:01:09,930 --> 00:01:12,540
So you're not optimizing
the coefficients that we had, and

24
00:01:12,540 --> 00:01:14,550
we're learning from
different subset of data.

25
00:01:14,550 --> 00:01:19,220
It's easier to parallelize, but it tends
not perform as well as boosting for

26
00:01:19,220 --> 00:01:20,000
a fixed number of trees.

27
00:01:20,000 --> 00:01:22,499
So for 100 trees, or 100 serial stumps,

28
00:01:22,499 --> 00:01:25,383
boosting tends to perform
better than random forest.

29
00:01:25,383 --> 00:01:28,460
Now let's take a moment to discuss impact

30
00:01:28,460 --> 00:01:30,650
boosting has had in
the machine learning world.

31
00:01:30,650 --> 00:01:33,940
And hint, hint, it's been huge.

32
00:01:33,940 --> 00:01:38,150
It's amongst the most useful machine
learning approaches out there.

33
00:01:39,290 --> 00:01:44,030
It's useful in a wide range of fields so,
for example, in computer vision a lot of

34
00:01:44,030 --> 00:01:48,480
the default algorithm in
computer vision is boosting,

35
00:01:48,480 --> 00:01:51,140
like face detection algorithms
where you take your camera,

36
00:01:51,140 --> 00:01:53,990
you point it at something and
tries to detect your face.

37
00:01:53,990 --> 00:01:57,260
Of which their that uses
boosting very useful.

38
00:01:57,260 --> 00:02:00,880
If you look at machinery competition
they've become very popular the last

39
00:02:00,880 --> 00:02:05,080
two or three years from places
like Kaggle or KDD Cup.

40
00:02:05,080 --> 00:02:07,780
Most winners, so this is more than half,

41
00:02:07,780 --> 00:02:12,460
I think it's about 70% of winners actually
use boosting to win their competition.

42
00:02:12,460 --> 00:02:17,754
If fact, they use Boosta Freeze, and
this looks at wide range of tasks

43
00:02:17,754 --> 00:02:22,961
like malware detection or
fraud detection or ranking web searches,

44
00:02:22,961 --> 00:02:28,910
and even interesting physics tasks
like detecting the Higgs Boson.

45
00:02:28,910 --> 00:02:29,940
All those problems and

46
00:02:29,940 --> 00:02:33,940
all those challenges have been
won by boosting decision trees.

47
00:02:33,940 --> 00:02:39,120
And this is perhaps one of
the most deployed advanced

48
00:02:39,120 --> 00:02:41,260
machinery methods out there.

49
00:02:41,260 --> 00:02:43,250
Particularly the notion of ensembles.

50
00:02:43,250 --> 00:02:47,142
So for example,
if you know about Netflix which is

51
00:02:47,142 --> 00:02:52,510
an online place we can
watch movies online.

52
00:02:52,510 --> 00:02:56,820
This kind of company, they recommend
what movie you might want to watch next.

53
00:02:56,820 --> 00:02:59,450
That system uses boosting.

54
00:02:59,450 --> 00:03:01,390
Actually uses an example of crossfires.

55
00:03:01,390 --> 00:03:04,540
But more interestingly, they had a
competition a few years ago, where people

56
00:03:04,540 --> 00:03:09,160
tried to provide better recommendations
and the winner was one that treated

57
00:03:09,160 --> 00:03:13,420
assemble of many, many, many classifiers
in order to create better recommendations.

58
00:03:13,420 --> 00:03:15,800
So assembles, you'll see them everywhere.

59
00:03:15,800 --> 00:03:17,150
Sometimes they optimize boosting,

60
00:03:17,150 --> 00:03:20,148
sometimes they optimize with
different techniques like bagging.

61
00:03:20,148 --> 00:03:23,750
And sometimes people just by
hand tuned away to say okay,

62
00:03:23,750 --> 00:03:25,920
I'm going to give you one to this,
half to that.

63
00:03:25,920 --> 00:03:29,133
I don't recommend the last approach,
I recommend boosting as a one to use.

64
00:03:30,190 --> 00:03:34,980
Great, so in this module we've explored
the notion of an ensemble classifier and

65
00:03:34,980 --> 00:03:40,010
we formalized ensembles as the way to
combination of the loads of different

66
00:03:40,010 --> 00:03:43,890
classifiers and we discussed generally
the boosting algorithm where

67
00:03:43,890 --> 00:03:47,070
the next classifier focuses on
the mistakes that we made so far.

68
00:03:47,070 --> 00:03:50,700
As well as Adaboost,
which is a special case for classification

69
00:03:50,700 --> 00:03:55,590
where we show you how to come up with
the coefficients of each classifier and

70
00:03:55,590 --> 00:03:57,640
the weights on the data.

71
00:03:57,640 --> 00:04:02,340
We've discussed how to
implement decision stumps,

72
00:04:02,340 --> 00:04:05,380
with the decision stumps,
extremely easy to do.

73
00:04:05,380 --> 00:04:08,274
And then talked a little bit about
the conversions property of how

74
00:04:08,274 --> 00:04:11,798
the AdaBoosting tends to go to 0, but you
have to be concerned a little bit about

75
00:04:11,798 --> 00:04:14,913
the over 50, although it tends to
be a robust over 50 in practice.

76
00:04:14,913 --> 00:04:19,249
[MUSIC]