In this video, I'm gonna describe the
Bayesian Approach to fitting models, using
a simple coin tossing example.
If you're already know about the Bayesian
Approach, you can skip this video.
The main idea behind the Bayesian Approach
is instead of looking for the most likely
setting of the parameters of the model, we
should consider all possible settings of
the parameters and try to figure out for
each of those possible settings, how
probable it is, given the data we
observed.
The Bayesian framework assumes that we
always have a prior distribution for
everything.
That is, for any event that you might care
to mention, I have to have some prior
probability that, that event might happen.
The problem might be very vague.
So what's happening is, our data gives us
a likelihood term.
We combine it with our prior and then we
get a posterior.
The likelihood term favors settings of our
parameters that make the data more likely.
It can disagree with the prior.
And in the limit, if we get enough data,
however unlikely the prior is, the data
can overwhelm it.
And in the end, with enough data, the
truth will out.
That is, even if your prior's wrong,
you'll end up with the right hypothesis.
But that may take an awful lot of data if
you thought that things were very unlikely
under your prior.
So let's start with a coin tossing
example.
Suppose you don't know anything about
coins except that they can be tossed and
when you toss a coin you get either a head
or a tail.
And we're also going to assume you know
that each time you do that it's an
independent decision.
So our model of a coin is going to have
one parameter P.
This parameter P determines the
probability that the coin will produce a
head.
What happens now if we see 100 tosses and
there are 53 heads.
What is a good value for P.
Well obviously you're tempted to say.53.
But what's the justification for that?.
The frequentest answer, which is also
called maximum likelihood, is to pick the
value of p that makes the observations
most probable.
And that value of p is.53.
It's not obvious that's true, let's derive
that.
So the probability of a particular
sequence that contains 53 heads and 47
tails could be written out by writing down
p every time you toss a head.
And 1-p every time you toss a tail.
And then if we collect all the p's
together, and all the 1-p's together, we
get p^53, and 1-p^47.
If we now ask, how does the probability of
observing that data depend on p, we can
differentiate with respect to p, and we
get the expression shown here, and if we
then set that derivative to zero.
We discover that the probability of the
data is maximized by setting P to be.53.
So that's maximum likelihood.
But there's some problems with using
maximum likelihood to decide on the
parameters of a model.
Suppose for example, we only toss the coin
once and we got one head.
It doesn't really make sense to say we
think the probability of the coin coming
down heads in future is one.
That would mean we'd be willing to bet
that infinite odds that it can't come down
tails.
And that seems ridiculous.
It's sort of intuitively obvious that a
much better answer is not.5.
But how can we justify that.
More importantly, we can ask, is it
reasonable to give a single answer.
We don't know much.
We don't have much data, and so we're
unsure about what the value of P is.
So what we really ought to do is refuse to
give a single answer and instead give a
whole probability distribution across
possible answers.
An answer like 0.5 is fairly likely.
An answer like one is maybe still pretty
unlikely if we have some prior belief that
coins come down heads half the time.
So, now I'm going to go through an example
where we start with some prior
distribution over parameter values, and
we'll pick a prior distribution that's
easy to work with.
Not one that necessarily fits what we
really believe about commons.
And then we'll show how that prior
distribution get modified by data if we
adopt the Bayesian Approach.
So, we're gonna start with a prior
distribution that says all the different
values of p are equally likely.
We believe that coins come biased to
various extents.
And any amount of bias is equally likely.
So that some coins come down heads half
the time.
Other coins come downs heads all the time.
And those two kinds of coins are equally
likely.
We now observe a coin coming down heads.
So what we do now, is for each possible
value of p, we take its prior probability
and we multiply by the probability that we
would have observed ahead, given that,
that value of p is the correct one.
So, for example, if we take the value of P
= one which says coins come down heads
every time then the probability of
observing a head would be one.
There would be no alternative.
And if we take the value of P to be zero
the probability of observing a head would
be zero.
And if we take it to b.5, the probability
of observing having b is.5.
So we take that red line, that's our
prior, and we multiply each point on that
by the probability of observing a head
according to that hypothesis.
And now we get the sloping-like, that's a
unnormalized posterior.
It's unnormalized because the area under
that line doesn't add up to one.
And of course for a probability
distribution, the probabilities of all the
alternative events have to add to one.
So the last thing we do is re-normalize
it.
We scale everything up so the area under
the curve is one.
And now if we started with the uniform pri
distribution of the P we end up with this
triangular posterior distribution of the P
having observed one head.
Now let's do it again.
And this time let's suppose we get a tail.
So, the prior distribution that we start
with now is the post-serial distribution
we had after observing our one head.
And now, the green line shows the
probability that we will get a tail
according to each of those hypotheses that
correspond to a value of P.
So, for example, if P is one, the
probability we would observe at time is
zero.
So we have to multiply our prior by our
likelihood term.
And now we get a curve like that.
Then we have to re-normalize to make the
area be one.
And that's now a posterior distribution,
after having observed one head and one
tail.
I notice it's a pretty sensible
distribution.
After observing one of each, we know that
P can't be either zero or one, and it also
seems very sensible that the most likely
thing is now in the middle.
If we do this again another 98 times, and
keep applying the same strategy of
multiplying the posterior we had after the
last task, by the likelihood of observing
that event, given the various different
settings of the parameter p.
And let's suppose we get 53 heads and 47
tails in all.
Then we'll end up with a curve that looks
like this.
It'll have its mean at 53..
Because we started with the uniform prior.
And it'll be fairly sharply peaked by 53..
But it'll allow other values like 49. is a
perfectly reasonable value under this
curve.
Not quite as lengthy as 53., but very
reasonable.
Whereas a value of 25. is extremely
unlikely under this curve.
So we can summarize all that with base
therm.
Determine the middle of this equation is
the joint probability of a set of
parameters W, and some data D.
And for supervised learning, the data is
gonna consist of the target values.
So we assume we are given inputs and the
data values consist of the target values
that are associated with those iinputs.
That what we observe.
That joint probability can be re-written
as the product of an individual
probability and a conditional probability.
So on the right, we're written it as p of
w times p of d given w, and on the left,
I've written it as p of d times p of w
given d.
Now we can divide both sides by p of d.
And this gives us base there, I mean it's
usual form.
Base theorem says that the posterior
probability of a particular value of W,
given the data D, is just the prior
probability of that particular value of W
times the probability given that
particular value of W, that you would have
observed the data you observed.
And that has to be normalized by P of D.
The probability of the data which is
simply the integral over all possible
values of W, of P of W, P of D, given W.
The bottom line needs to be the sum of the
top line a row of possible values w in
order for this to be a probability
distribution that adds to one.
Because that p of d has integrated over
all possible values of w, it's not
affected by picking a particular value of
w on the left-hand side.
So when we're looking for the best value
of w, for example, we can ignore p of d.
It doesn't depend on w.
The other two terms on the right-hand
side, however, do depend on w.