1
00:00:01,030 --> 00:00:05,942
Hello everyone, this is Marios
Michailidis, and this will be the first

2
00:00:05,942 --> 00:00:10,452
video in a series that we will be
discussing on ensemble methods for

3
00:00:10,452 --> 00:00:11,835
machine learning.

4
00:00:11,835 --> 00:00:18,165
To tell you a bit about me, I work as
Research Data Scientist for H2Oai.

5
00:00:18,165 --> 00:00:21,976
In fact,
my PhD is about assemble methods, and

6
00:00:21,976 --> 00:00:25,501
they used to be ranked
number one in cargo and

7
00:00:25,501 --> 00:00:30,600
ensemble methods have greatly
helped me to achieve this spot.

8
00:00:30,600 --> 00:00:32,800
So you might find the course interesting.

9
00:00:34,480 --> 00:00:37,077
So what is ensemble modelling?

10
00:00:37,077 --> 00:00:41,947
I think with this term, we refer to
combining many different machine learning

11
00:00:41,947 --> 00:00:45,620
models in order to get
a more powerful prediction.

12
00:00:45,620 --> 00:00:48,997
And later on we will see
examples that this happens,

13
00:00:48,997 --> 00:00:53,386
that we combine different models and
we do get better predictions.

14
00:00:53,386 --> 00:00:56,175
There are various ensemble methods.

15
00:00:56,175 --> 00:01:01,240
Here we'll discuss a few, those that
we encounter quite often, in predictive

16
00:01:01,240 --> 00:01:06,471
modelling competitions, and they tend
to be, in general, quite competitive.

17
00:01:06,471 --> 00:01:10,924
We will start with simple averaging
methods, then we'll go to weighted

18
00:01:10,924 --> 00:01:15,311
averaging methods, and we will also
examine conditional averaging.

19
00:01:15,311 --> 00:01:19,950
And then we will move to some more
typical ones like bagging, or

20
00:01:19,950 --> 00:01:24,942
the very, very popular, boosting,
then stacking and StackNet,

21
00:01:24,942 --> 00:01:27,590
which is the result of my research.

22
00:01:30,350 --> 00:01:34,160
But as I said,
these will be a series of videos, and

23
00:01:34,160 --> 00:01:38,163
we will initially start
with the averaging methods.

24
00:01:41,060 --> 00:01:45,366
So, in order to help you understand
a bit more about the averaging methods,

25
00:01:45,366 --> 00:01:46,791
let's take an example.

26
00:01:46,791 --> 00:01:51,622
Let's say we have a variable called age,
as in age years,

27
00:01:51,622 --> 00:01:54,150
and we try to predict this.

28
00:01:54,150 --> 00:01:57,241
We have a model that yields prediction for
age.

29
00:01:57,241 --> 00:02:01,386
Let's assume that
the relationship between the two,

30
00:02:01,386 --> 00:02:08,010
the actual age in our prediction,
looks like in the graph, as in the graph.

31
00:02:08,010 --> 00:02:15,660
So you can see that the model boasts
quite a higher square of a value of 0.91,

32
00:02:15,660 --> 00:02:19,980
but it doesn't do so
well in the whole range of values.

33
00:02:19,980 --> 00:02:25,680
So when age is less than 50,
the model actually does quite well.

34
00:02:25,680 --> 00:02:28,856
But when age is more than 50,

35
00:02:28,856 --> 00:02:33,505
you can see that the average
error is higher.

36
00:02:33,505 --> 00:02:35,960
Now let's take another example.

37
00:02:35,960 --> 00:02:40,962
Let's assume we have a second model
that also tries to predict age,

38
00:02:40,962 --> 00:02:43,167
but this one looks like that.

39
00:02:43,167 --> 00:02:48,988
As you can see, this model does quite
well when age is higher than 50,

40
00:02:48,988 --> 00:02:56,020
but not so well when age is less than 50,
nevertheless, it scores again 0.91.

41
00:02:56,020 --> 00:03:01,200
So we have two models that have
a similar predictive power,

42
00:03:01,200 --> 00:03:04,007
but they look quite different.

43
00:03:04,007 --> 00:03:08,682
It's quite obvious that they do
better in different parts of

44
00:03:08,682 --> 00:03:10,707
the distribution of age.

45
00:03:10,707 --> 00:03:14,394
So what will happen if we
were to try to combine

46
00:03:14,394 --> 00:03:19,148
this two with a simple averaging method,
in other words,

47
00:03:19,148 --> 00:03:25,540
just say (model 1 + model two) / 2,
so a simple averaging method.

48
00:03:25,540 --> 00:03:28,920
The end result will look
as in the new graph.

49
00:03:28,920 --> 00:03:34,592
So, our square has moved to 0.95,
which is a considerable

50
00:03:34,592 --> 00:03:40,692
improvement versus the 0.91 we had before,
and as you can see,

51
00:03:40,692 --> 00:03:46,059
on average, the points tend to
be closer with the reality.

52
00:03:46,059 --> 00:03:49,723
So the average error is smaller.

53
00:03:49,723 --> 00:03:56,052
However, as you can see, the model doesn't
do better as an individual models for

54
00:03:56,052 --> 00:03:59,998
the areas where the models
were doing really well,

55
00:03:59,998 --> 00:04:03,410
nevertheless, it does better on average.

56
00:04:03,410 --> 00:04:06,584
This is something we need to understand,

57
00:04:06,584 --> 00:04:12,195
that there is potentially a better
way to combine these models.

58
00:04:12,195 --> 00:04:15,354
We could try to take a weighting average.

59
00:04:15,354 --> 00:04:19,976
So say, I'm going to take 70% of
the first model prediction and

60
00:04:19,976 --> 00:04:22,893
30% of the second model prediction.

61
00:04:22,893 --> 00:04:28,853
In other words,
(model 1x0.7 + model 2x0.3),

62
00:04:28,853 --> 00:04:33,393
and the end result would
look as in the graph.

63
00:04:33,393 --> 00:04:38,849
So you can see their square is no better
and that makes sense, because the models

64
00:04:38,849 --> 00:04:44,560
have quite similar predictive power and
it doesn't make sense to rely more in one.

65
00:04:46,280 --> 00:04:51,215
And also it is quite clear that
it looks more with model 1,

66
00:04:51,215 --> 00:04:56,452
because it has better predictions
when age is less than 50,

67
00:04:56,452 --> 00:05:00,699
and worse predictions
when age is more than 50.

68
00:05:00,699 --> 00:05:08,250
As a theoretical exercise, what is the
theoretical best we could get out of this?

69
00:05:08,250 --> 00:05:13,250
We know we have a model that scores
really well when age is less than 50,

70
00:05:13,250 --> 00:05:17,820
and another model that scores really
well when age is more than 50.

71
00:05:17,820 --> 00:05:21,776
So ideally, we would like to
get to something like that.

72
00:05:21,776 --> 00:05:26,420
This is how we leverage the two
models in the best possible way

73
00:05:26,420 --> 00:05:29,891
here by using a simple
conditioning method.

74
00:05:29,891 --> 00:05:35,187
So if less than 50 is one I'll just
use the other, and we will see later

75
00:05:35,187 --> 00:05:40,310
on that there are ensemble methods
that are very good at finding these

76
00:05:40,310 --> 00:05:46,510
relationships of two or more predictions
in respect to the target variable.

77
00:05:46,510 --> 00:05:49,210
But, this will be a topic for
another discussion.

78
00:05:49,210 --> 00:05:53,340
Here we discuss simple averaging methods,

79
00:05:53,340 --> 00:05:58,250
hopefully you found it useful, and
stay here for the next session to come.

80
00:05:58,250 --> 00:05:59,170
Thank you very much.