In this video, I'm gonna describe the Bayesian Approach to fitting models, using a simple coin tossing example. If you're already know about the Bayesian Approach, you can skip this video. The main idea behind the Bayesian Approach is instead of looking for the most likely setting of the parameters of the model, we should consider all possible settings of the parameters and try to figure out for each of those possible settings, how probable it is, given the data we observed. The Bayesian framework assumes that we always have a prior distribution for everything. That is, for any event that you might care to mention, I have to have some prior probability that, that event might happen. The problem might be very vague. So what's happening is, our data gives us a likelihood term. We combine it with our prior and then we get a posterior. The likelihood term favors settings of our parameters that make the data more likely. It can disagree with the prior. And in the limit, if we get enough data, however unlikely the prior is, the data can overwhelm it. And in the end, with enough data, the truth will out. That is, even if your prior's wrong, you'll end up with the right hypothesis. But that may take an awful lot of data if you thought that things were very unlikely under your prior. So let's start with a coin tossing example. Suppose you don't know anything about coins except that they can be tossed and when you toss a coin you get either a head or a tail. And we're also going to assume you know that each time you do that it's an independent decision. So our model of a coin is going to have one parameter P. This parameter P determines the probability that the coin will produce a head. What happens now if we see 100 tosses and there are 53 heads. What is a good value for P. Well obviously you're tempted to say.53. But what's the justification for that?. The frequentest answer, which is also called maximum likelihood, is to pick the value of p that makes the observations most probable. And that value of p is.53. It's not obvious that's true, let's derive that. So the probability of a particular sequence that contains 53 heads and 47 tails could be written out by writing down p every time you toss a head. And 1-p every time you toss a tail. And then if we collect all the p's together, and all the 1-p's together, we get p^53, and 1-p^47. If we now ask, how does the probability of observing that data depend on p, we can differentiate with respect to p, and we get the expression shown here, and if we then set that derivative to zero. We discover that the probability of the data is maximized by setting P to be.53. So that's maximum likelihood. But there's some problems with using maximum likelihood to decide on the parameters of a model. Suppose for example, we only toss the coin once and we got one head. It doesn't really make sense to say we think the probability of the coin coming down heads in future is one. That would mean we'd be willing to bet that infinite odds that it can't come down tails. And that seems ridiculous. It's sort of intuitively obvious that a much better answer is not.5. But how can we justify that. More importantly, we can ask, is it reasonable to give a single answer. We don't know much. We don't have much data, and so we're unsure about what the value of P is. So what we really ought to do is refuse to give a single answer and instead give a whole probability distribution across possible answers. An answer like 0.5 is fairly likely. An answer like one is maybe still pretty unlikely if we have some prior belief that coins come down heads half the time. So, now I'm going to go through an example where we start with some prior distribution over parameter values, and we'll pick a prior distribution that's easy to work with. Not one that necessarily fits what we really believe about commons. And then we'll show how that prior distribution get modified by data if we adopt the Bayesian Approach. So, we're gonna start with a prior distribution that says all the different values of p are equally likely. We believe that coins come biased to various extents. And any amount of bias is equally likely. So that some coins come down heads half the time. Other coins come downs heads all the time. And those two kinds of coins are equally likely. We now observe a coin coming down heads. So what we do now, is for each possible value of p, we take its prior probability and we multiply by the probability that we would have observed ahead, given that, that value of p is the correct one. So, for example, if we take the value of P = one which says coins come down heads every time then the probability of observing a head would be one. There would be no alternative. And if we take the value of P to be zero the probability of observing a head would be zero. And if we take it to b.5, the probability of observing having b is.5. So we take that red line, that's our prior, and we multiply each point on that by the probability of observing a head according to that hypothesis. And now we get the sloping-like, that's a unnormalized posterior. It's unnormalized because the area under that line doesn't add up to one. And of course for a probability distribution, the probabilities of all the alternative events have to add to one. So the last thing we do is re-normalize it. We scale everything up so the area under the curve is one. And now if we started with the uniform pri distribution of the P we end up with this triangular posterior distribution of the P having observed one head. Now let's do it again. And this time let's suppose we get a tail. So, the prior distribution that we start with now is the post-serial distribution we had after observing our one head. And now, the green line shows the probability that we will get a tail according to each of those hypotheses that correspond to a value of P. So, for example, if P is one, the probability we would observe at time is zero. So we have to multiply our prior by our likelihood term. And now we get a curve like that. Then we have to re-normalize to make the area be one. And that's now a posterior distribution, after having observed one head and one tail. I notice it's a pretty sensible distribution. After observing one of each, we know that P can't be either zero or one, and it also seems very sensible that the most likely thing is now in the middle. If we do this again another 98 times, and keep applying the same strategy of multiplying the posterior we had after the last task, by the likelihood of observing that event, given the various different settings of the parameter p. And let's suppose we get 53 heads and 47 tails in all. Then we'll end up with a curve that looks like this. It'll have its mean at 53.. Because we started with the uniform prior. And it'll be fairly sharply peaked by 53.. But it'll allow other values like 49. is a perfectly reasonable value under this curve. Not quite as lengthy as 53., but very reasonable. Whereas a value of 25. is extremely unlikely under this curve. So we can summarize all that with base therm. Determine the middle of this equation is the joint probability of a set of parameters W, and some data D. And for supervised learning, the data is gonna consist of the target values. So we assume we are given inputs and the data values consist of the target values that are associated with those iinputs. That what we observe. That joint probability can be re-written as the product of an individual probability and a conditional probability. So on the right, we're written it as p of w times p of d given w, and on the left, I've written it as p of d times p of w given d. Now we can divide both sides by p of d. And this gives us base there, I mean it's usual form. Base theorem says that the posterior probability of a particular value of W, given the data D, is just the prior probability of that particular value of W times the probability given that particular value of W, that you would have observed the data you observed. And that has to be normalized by P of D. The probability of the data which is simply the integral over all possible values of W, of P of W, P of D, given W. The bottom line needs to be the sum of the top line a row of possible values w in order for this to be a probability distribution that adds to one. Because that p of d has integrated over all possible values of w, it's not affected by picking a particular value of w on the left-hand side. So when we're looking for the best value of w, for example, we can ignore p of d. It doesn't depend on w. The other two terms on the right-hand side, however, do depend on w.