1 00:00:00,148 --> 00:00:04,377 [MUSIC] 2 00:00:04,377 --> 00:00:08,372 In this video we will see two approaches to statistics, 3 00:00:08,372 --> 00:00:11,435 the Frequentist one and the Bayesian one. 4 00:00:11,435 --> 00:00:15,906 So they differ in a lot of aspects. 5 00:00:15,906 --> 00:00:20,234 And the most important one is that the Frequentists treat 6 00:00:20,234 --> 00:00:24,934 the objective and Bayesians treat it as subjective. 7 00:00:24,934 --> 00:00:26,523 Let me explain it to you. 8 00:00:26,523 --> 00:00:29,114 Imagine you have a coin and you toss it. 9 00:00:29,114 --> 00:00:33,870 What Frequentist would say that with probability one half it will end 10 00:00:33,870 --> 00:00:37,974 heads up and with probability one half it will end tails up. 11 00:00:37,974 --> 00:00:40,755 And you can do nothing about it. 12 00:00:40,755 --> 00:00:45,949 However, what Bayesian would say is that if someone knew the initial velocity and 13 00:00:45,949 --> 00:00:48,809 all the initial parameters of the coin toss, 14 00:00:48,809 --> 00:00:52,516 you would be able to predict the outcome of an experiment. 15 00:00:52,516 --> 00:00:55,269 And for this person the experiment would not be random. 16 00:00:55,269 --> 00:00:58,327 Another difference between Frequentists and 17 00:00:58,327 --> 00:01:01,780 Bayesians is how they treat parameters in the data. 18 00:01:01,780 --> 00:01:06,766 Frequentists would say that parameters are fixed and data is random. 19 00:01:06,766 --> 00:01:09,900 And they would want to find this optimal point. 20 00:01:09,900 --> 00:01:13,103 However, Bayesians would say in a opposite way, 21 00:01:13,103 --> 00:01:16,316 that parameters are random and the data is fixed. 22 00:01:16,316 --> 00:01:17,930 And this actually makes sense. 23 00:01:17,930 --> 00:01:21,079 When you train your model, you already know the data and 24 00:01:21,079 --> 00:01:22,558 it is not random anymore. 25 00:01:22,558 --> 00:01:26,299 However, there may be multiple good points and 26 00:01:26,299 --> 00:01:30,892 you would like to define a profit distribution over them. 27 00:01:30,892 --> 00:01:34,642 Also Bayesian methods work for arbitrary number of letter points. 28 00:01:34,642 --> 00:01:39,354 While the Frequentist ones work only when the number of data points is much 29 00:01:39,354 --> 00:01:41,870 bigger than the number of parameters. 30 00:01:41,870 --> 00:01:45,501 You may say that this is not a problem when we have the big data. 31 00:01:45,501 --> 00:01:50,913 However, you may remember that for neural network we have millions of parameters. 32 00:01:50,913 --> 00:01:54,030 However, the number of data points is only thousands. 33 00:01:56,177 --> 00:02:00,220 Another difference is how Frequentists and Bayesians train their models. 34 00:02:00,220 --> 00:02:05,167 The Frequentists train using Maximum Likelihood Principle, 35 00:02:05,167 --> 00:02:11,139 that is they try to find the parameters theta that maximize the likelihood, 36 00:02:11,139 --> 00:02:15,072 the probability of their data given parameters. 37 00:02:15,072 --> 00:02:20,161 However, what Bayesians will try to do is they would try to compute the posterior, 38 00:02:20,161 --> 00:02:23,302 the probability of the parameters given the data. 39 00:02:23,302 --> 00:02:26,381 And they would try to do this using the Bayes formula. 40 00:02:26,381 --> 00:02:29,317 And this will give us a lot of interesting aspects. 41 00:02:29,317 --> 00:02:32,915 It will compute the posterior distribution, in this case for 42 00:02:32,915 --> 00:02:36,790 the classification, that is the probability of parameters given 43 00:02:36,790 --> 00:02:40,960 the training data set, the x train and y train using the Bayes formula. 44 00:02:40,960 --> 00:02:43,544 We can compute the prediction, 45 00:02:43,544 --> 00:02:49,008 the probability of the new point y test given the training data set. 46 00:02:49,008 --> 00:02:53,613 This can be done using marginalization principle. 47 00:02:53,613 --> 00:02:57,195 This will be the integral of the y given X test and 48 00:02:57,195 --> 00:03:01,865 theta times the probability of theta given the training set. 49 00:03:01,865 --> 00:03:04,807 And we'll have estimated it using the training procedure. 50 00:03:06,858 --> 00:03:12,036 This formula can also lead to regularization. 51 00:03:12,036 --> 00:03:15,537 We can treat the prior on theta as a regularizer. 52 00:03:15,537 --> 00:03:19,587 Imagine if you want to estimate the probability that your coin will land 53 00:03:19,587 --> 00:03:20,200 heads up. 54 00:03:20,200 --> 00:03:26,341 You already know that most of the coins land heads up with probability 0.5 and so 55 00:03:26,341 --> 00:03:32,316 you can use the following prior, because we'll say that most of the coins are fair. 56 00:03:32,316 --> 00:03:36,902 However, if you know that for your experiment the probability 57 00:03:36,902 --> 00:03:41,747 of heads can either be fair, that is 0.5 or bias towards heads, 58 00:03:41,747 --> 00:03:47,392 that is greater than 0.5, you could use for example the following prior. 59 00:03:50,202 --> 00:03:53,859 Also Bayesian methods are really good for online learning. 60 00:03:53,859 --> 00:03:58,694 Imagine that you have data that comes in with some small portions. 61 00:03:58,694 --> 00:04:03,687 Then what you could do, you could use it to update your parameters and 62 00:04:03,687 --> 00:04:08,091 then use the new posterior as a prior to the next experiment. 63 00:04:08,091 --> 00:04:09,640 Let's see it closer. 64 00:04:09,640 --> 00:04:14,544 The new posterior, the probability of theta indexed by k, 65 00:04:14,544 --> 00:04:20,049 equals to the posterior that we get after observing data point xk. 66 00:04:20,049 --> 00:04:24,856 If we apply the base formula we will see that as a prior we 67 00:04:24,856 --> 00:04:29,050 can use probability of theta indexed by k-1. 68 00:04:29,050 --> 00:04:31,466 Let's see how it works on an example. 69 00:04:31,466 --> 00:04:34,094 Imagine you want to estimate some parameter theta. 70 00:04:34,094 --> 00:04:35,836 And here's your prior. 71 00:04:35,836 --> 00:04:41,704 Then you get for example ten points and you update it using the base formula. 72 00:04:41,704 --> 00:04:43,538 And here's what you get. 73 00:04:43,538 --> 00:04:47,320 The variance of the distribution is much less. 74 00:04:47,320 --> 00:04:50,269 And also you see that the mean has changed. 75 00:04:50,269 --> 00:04:54,629 And as you get more and more points, for example, here you can see for 20 points, 76 00:04:54,629 --> 00:04:58,140 and 30 points, your estimation becomes more and more precise. 77 00:04:58,140 --> 00:05:03,909 And so you can predict the true value of the parameters very well. 78 00:05:03,909 --> 00:05:07,886 And now we've seen that there are a lot of advantages of Bayesian analysis over 79 00:05:07,886 --> 00:05:11,876 frequency swamps, and we'll learn about them more throughout this course. 80 00:05:11,876 --> 00:05:21,876 [MUSIC]