1
00:00:00,148 --> 00:00:04,377
[MUSIC]

2
00:00:04,377 --> 00:00:08,372
In this video we will see two
approaches to statistics,

3
00:00:08,372 --> 00:00:11,435
the Frequentist one and the Bayesian one.

4
00:00:11,435 --> 00:00:15,906
So they differ in a lot of aspects.

5
00:00:15,906 --> 00:00:20,234
And the most important one is
that the Frequentists treat

6
00:00:20,234 --> 00:00:24,934
the objective and
Bayesians treat it as subjective.

7
00:00:24,934 --> 00:00:26,523
Let me explain it to you.

8
00:00:26,523 --> 00:00:29,114
Imagine you have a coin and you toss it.

9
00:00:29,114 --> 00:00:33,870
What Frequentist would say that with
probability one half it will end

10
00:00:33,870 --> 00:00:37,974
heads up and with probability
one half it will end tails up.

11
00:00:37,974 --> 00:00:40,755
And you can do nothing about it.

12
00:00:40,755 --> 00:00:45,949
However, what Bayesian would say is that
if someone knew the initial velocity and

13
00:00:45,949 --> 00:00:48,809
all the initial parameters
of the coin toss,

14
00:00:48,809 --> 00:00:52,516
you would be able to predict
the outcome of an experiment.

15
00:00:52,516 --> 00:00:55,269
And for this person the experiment
would not be random.

16
00:00:55,269 --> 00:00:58,327
Another difference
between Frequentists and

17
00:00:58,327 --> 00:01:01,780
Bayesians is how they treat
parameters in the data.

18
00:01:01,780 --> 00:01:06,766
Frequentists would say that parameters
are fixed and data is random.

19
00:01:06,766 --> 00:01:09,900
And they would want to
find this optimal point.

20
00:01:09,900 --> 00:01:13,103
However, Bayesians would
say in a opposite way,

21
00:01:13,103 --> 00:01:16,316
that parameters are random and
the data is fixed.

22
00:01:16,316 --> 00:01:17,930
And this actually makes sense.

23
00:01:17,930 --> 00:01:21,079
When you train your model,
you already know the data and

24
00:01:21,079 --> 00:01:22,558
it is not random anymore.

25
00:01:22,558 --> 00:01:26,299
However, there may be
multiple good points and

26
00:01:26,299 --> 00:01:30,892
you would like to define
a profit distribution over them.

27
00:01:30,892 --> 00:01:34,642
Also Bayesian methods work for
arbitrary number of letter points.

28
00:01:34,642 --> 00:01:39,354
While the Frequentist ones work only
when the number of data points is much

29
00:01:39,354 --> 00:01:41,870
bigger than the number of parameters.

30
00:01:41,870 --> 00:01:45,501
You may say that this is not
a problem when we have the big data.

31
00:01:45,501 --> 00:01:50,913
However, you may remember that for neural
network we have millions of parameters.

32
00:01:50,913 --> 00:01:54,030
However, the number of data
points is only thousands.

33
00:01:56,177 --> 00:02:00,220
Another difference is how Frequentists and
Bayesians train their models.

34
00:02:00,220 --> 00:02:05,167
The Frequentists train using
Maximum Likelihood Principle,

35
00:02:05,167 --> 00:02:11,139
that is they try to find the parameters
theta that maximize the likelihood,

36
00:02:11,139 --> 00:02:15,072
the probability of their
data given parameters.

37
00:02:15,072 --> 00:02:20,161
However, what Bayesians will try to do is
they would try to compute the posterior,

38
00:02:20,161 --> 00:02:23,302
the probability of
the parameters given the data.

39
00:02:23,302 --> 00:02:26,381
And they would try to do this
using the Bayes formula.

40
00:02:26,381 --> 00:02:29,317
And this will give us a lot
of interesting aspects.

41
00:02:29,317 --> 00:02:32,915
It will compute the posterior
distribution, in this case for

42
00:02:32,915 --> 00:02:36,790
the classification, that is
the probability of parameters given

43
00:02:36,790 --> 00:02:40,960
the training data set, the x train and
y train using the Bayes formula.

44
00:02:40,960 --> 00:02:43,544
We can compute the prediction,

45
00:02:43,544 --> 00:02:49,008
the probability of the new point y
test given the training data set.

46
00:02:49,008 --> 00:02:53,613
This can be done using
marginalization principle.

47
00:02:53,613 --> 00:02:57,195
This will be the integral
of the y given X test and

48
00:02:57,195 --> 00:03:01,865
theta times the probability of
theta given the training set.

49
00:03:01,865 --> 00:03:04,807
And we'll have estimated it
using the training procedure.

50
00:03:06,858 --> 00:03:12,036
This formula can also
lead to regularization.

51
00:03:12,036 --> 00:03:15,537
We can treat the prior on
theta as a regularizer.

52
00:03:15,537 --> 00:03:19,587
Imagine if you want to estimate
the probability that your coin will land

53
00:03:19,587 --> 00:03:20,200
heads up.

54
00:03:20,200 --> 00:03:26,341
You already know that most of the coins
land heads up with probability 0.5 and so

55
00:03:26,341 --> 00:03:32,316
you can use the following prior, because
we'll say that most of the coins are fair.

56
00:03:32,316 --> 00:03:36,902
However, if you know that for
your experiment the probability

57
00:03:36,902 --> 00:03:41,747
of heads can either be fair,
that is 0.5 or bias towards heads,

58
00:03:41,747 --> 00:03:47,392
that is greater than 0.5, you could
use for example the following prior.

59
00:03:50,202 --> 00:03:53,859
Also Bayesian methods are really good for
online learning.

60
00:03:53,859 --> 00:03:58,694
Imagine that you have data that
comes in with some small portions.

61
00:03:58,694 --> 00:04:03,687
Then what you could do, you could
use it to update your parameters and

62
00:04:03,687 --> 00:04:08,091
then use the new posterior as
a prior to the next experiment.

63
00:04:08,091 --> 00:04:09,640
Let's see it closer.

64
00:04:09,640 --> 00:04:14,544
The new posterior,
the probability of theta indexed by k,

65
00:04:14,544 --> 00:04:20,049
equals to the posterior that we
get after observing data point xk.

66
00:04:20,049 --> 00:04:24,856
If we apply the base formula
we will see that as a prior we

67
00:04:24,856 --> 00:04:29,050
can use probability of
theta indexed by k-1.

68
00:04:29,050 --> 00:04:31,466
Let's see how it works on an example.

69
00:04:31,466 --> 00:04:34,094
Imagine you want to estimate
some parameter theta.

70
00:04:34,094 --> 00:04:35,836
And here's your prior.

71
00:04:35,836 --> 00:04:41,704
Then you get for example ten points and
you update it using the base formula.

72
00:04:41,704 --> 00:04:43,538
And here's what you get.

73
00:04:43,538 --> 00:04:47,320
The variance of
the distribution is much less.

74
00:04:47,320 --> 00:04:50,269
And also you see that
the mean has changed.

75
00:04:50,269 --> 00:04:54,629
And as you get more and more points, for
example, here you can see for 20 points,

76
00:04:54,629 --> 00:04:58,140
and 30 points, your estimation
becomes more and more precise.

77
00:04:58,140 --> 00:05:03,909
And so you can predict the true
value of the parameters very well.

78
00:05:03,909 --> 00:05:07,886
And now we've seen that there are a lot
of advantages of Bayesian analysis over

79
00:05:07,886 --> 00:05:11,876
frequency swamps, and we'll learn about
them more throughout this course.

80
00:05:11,876 --> 00:05:21,876
[MUSIC]