1
00:00:00,000 --> 00:00:05,921
In this video, I'm gonna describe the
Bayesian Approach to fitting models, using

2
00:00:05,921 --> 00:00:10,398
a simple coin tossing example.
If you're already know about the Bayesian

3
00:00:10,398 --> 00:00:14,915
Approach, you can skip this video.
The main idea behind the Bayesian Approach

4
00:00:14,915 --> 00:00:20,317
is instead of looking for the most likely
setting of the parameters of the model, we

5
00:00:20,317 --> 00:00:25,591
should consider all possible settings of
the parameters and try to figure out for

6
00:00:25,591 --> 00:00:30,028
each of those possible settings, how
probable it is, given the data we

7
00:00:30,028 --> 00:00:33,115
observed.
The Bayesian framework assumes that we

8
00:00:33,115 --> 00:00:36,202
always have a prior distribution for
everything.

9
00:00:36,202 --> 00:00:41,283
That is, for any event that you might care
to mention, I have to have some prior

10
00:00:41,283 --> 00:00:47,780
probability that, that event might happen.
The problem might be very vague.

11
00:00:49,320 --> 00:00:53,907
So what's happening is, our data gives us
a likelihood term.

12
00:00:53,907 --> 00:00:58,340
We combine it with our prior and then we
get a posterior.

13
00:00:59,120 --> 00:01:05,660
The likelihood term favors settings of our
parameters that make the data more likely.

14
00:01:06,940 --> 00:01:11,602
It can disagree with the prior.
And in the limit, if we get enough data,

15
00:01:11,602 --> 00:01:15,332
however unlikely the prior is, the data
can overwhelm it.

16
00:01:15,332 --> 00:01:18,730
And in the end, with enough data, the
truth will out.

17
00:01:18,730 --> 00:01:23,725
That is, even if your prior's wrong,
you'll end up with the right hypothesis.

18
00:01:23,725 --> 00:01:29,387
But that may take an awful lot of data if
you thought that things were very unlikely

19
00:01:29,387 --> 00:01:34,887
under your prior.
So let's start with a coin tossing

20
00:01:34,887 --> 00:01:37,973
example.
Suppose you don't know anything about

21
00:01:37,973 --> 00:01:43,609
coins except that they can be tossed and
when you toss a coin you get either a head

22
00:01:43,609 --> 00:01:46,963
or a tail.
And we're also going to assume you know

23
00:01:46,963 --> 00:01:50,720
that each time you do that it's an
independent decision.

24
00:01:53,180 --> 00:01:57,100
So our model of a coin is going to have
one parameter P.

25
00:01:57,100 --> 00:02:02,210
This parameter P determines the
probability that the coin will produce a

26
00:02:02,210 --> 00:02:08,280
head.
What happens now if we see 100 tosses and

27
00:02:08,280 --> 00:02:12,640
there are 53 heads.
What is a good value for P.

28
00:02:13,180 --> 00:02:19,220
Well obviously you're tempted to say.53.
But what's the justification for that?.

29
00:02:20,800 --> 00:02:26,917
The frequentest answer, which is also
called maximum likelihood, is to pick the

30
00:02:26,917 --> 00:02:31,074
value of p that makes the observations
most probable.

31
00:02:31,074 --> 00:02:36,407
And that value of p is.53.
It's not obvious that's true, let's derive

32
00:02:36,407 --> 00:02:40,662
that.
So the probability of a particular

33
00:02:40,662 --> 00:02:47,368
sequence that contains 53 heads and 47
tails could be written out by writing down

34
00:02:47,368 --> 00:02:52,602
p every time you toss a head.
And 1-p every time you toss a tail.

35
00:02:52,602 --> 00:02:58,655
And then if we collect all the p's
together, and all the 1-p's together, we

36
00:02:58,655 --> 00:03:04,320
get p^53, and 1-p^47.
If we now ask, how does the probability of

37
00:03:04,320 --> 00:03:10,674
observing that data depend on p, we can
differentiate with respect to p, and we

38
00:03:10,674 --> 00:03:16,620
get the expression shown here, and if we
then set that derivative to zero.

39
00:03:17,020 --> 00:03:24,256
We discover that the probability of the
data is maximized by setting P to be.53.

40
00:03:24,256 --> 00:03:31,202
So that's maximum likelihood.
But there's some problems with using

41
00:03:31,202 --> 00:03:34,780
maximum likelihood to decide on the
parameters of a model.

42
00:03:35,640 --> 00:03:39,743
Suppose for example, we only toss the coin
once and we got one head.

43
00:03:39,743 --> 00:03:44,643
It doesn't really make sense to say we
think the probability of the coin coming

44
00:03:44,643 --> 00:03:48,746
down heads in future is one.
That would mean we'd be willing to bet

45
00:03:48,746 --> 00:03:51,747
that infinite odds that it can't come down
tails.

46
00:03:51,747 --> 00:03:58,603
And that seems ridiculous.
It's sort of intuitively obvious that a

47
00:03:58,603 --> 00:04:02,920
much better answer is not.5.
But how can we justify that.

48
00:04:02,920 --> 00:04:08,240
More importantly, we can ask, is it
reasonable to give a single answer.

49
00:04:10,280 --> 00:04:14,147
We don't know much.
We don't have much data, and so we're

50
00:04:14,147 --> 00:04:19,604
unsure about what the value of P is.
So what we really ought to do is refuse to

51
00:04:19,604 --> 00:04:24,991
give a single answer and instead give a
whole probability distribution across

52
00:04:24,991 --> 00:04:28,651
possible answers.
An answer like 0.5 is fairly likely.

53
00:04:28,651 --> 00:04:34,453
An answer like one is maybe still pretty
unlikely if we have some prior belief that

54
00:04:34,453 --> 00:04:43,722
coins come down heads half the time.
So, now I'm going to go through an example

55
00:04:43,722 --> 00:04:47,992
where we start with some prior
distribution over parameter values, and

56
00:04:47,992 --> 00:04:51,469
we'll pick a prior distribution that's
easy to work with.

57
00:04:51,469 --> 00:04:55,556
Not one that necessarily fits what we
really believe about commons.

58
00:04:55,556 --> 00:05:00,131
And then we'll show how that prior
distribution get modified by data if we

59
00:05:00,131 --> 00:05:04,929
adopt the Bayesian Approach.
So, we're gonna start with a prior

60
00:05:04,929 --> 00:05:09,016
distribution that says all the different
values of p are equally likely.

61
00:05:09,016 --> 00:05:12,024
We believe that coins come biased to
various extents.

62
00:05:12,024 --> 00:05:16,621
And any amount of bias is equally likely.
So that some coins come down heads half

63
00:05:16,621 --> 00:05:19,515
the time.
Other coins come downs heads all the time.

64
00:05:19,515 --> 00:05:22,240
And those two kinds of coins are equally
likely.

65
00:05:22,520 --> 00:05:29,058
We now observe a coin coming down heads.
So what we do now, is for each possible

66
00:05:29,058 --> 00:05:36,010
value of p, we take its prior probability
and we multiply by the probability that we

67
00:05:36,010 --> 00:05:41,970
would have observed ahead, given that,
that value of p is the correct one.

68
00:05:41,970 --> 00:05:47,464
So, for example, if we take the value of P
= one which says coins come down heads

69
00:05:47,464 --> 00:05:51,984
every time then the probability of
observing a head would be one.

70
00:05:51,984 --> 00:05:56,922
There would be no alternative.
And if we take the value of P to be zero

71
00:05:56,922 --> 00:06:00,400
the probability of observing a head would
be zero.

72
00:06:00,400 --> 00:06:05,048
And if we take it to b.5, the probability
of observing having b is.5.

73
00:06:05,048 --> 00:06:10,311
So we take that red line, that's our
prior, and we multiply each point on that

74
00:06:10,311 --> 00:06:14,959
by the probability of observing a head
according to that hypothesis.

75
00:06:14,959 --> 00:06:19,334
And now we get the sloping-like, that's a
unnormalized posterior.

76
00:06:19,334 --> 00:06:24,324
It's unnormalized because the area under
that line doesn't add up to one.

77
00:06:24,324 --> 00:06:29,382
And of course for a probability
distribution, the probabilities of all the

78
00:06:29,382 --> 00:06:35,065
alternative events have to add to one.
So the last thing we do is re-normalize

79
00:06:35,065 --> 00:06:38,458
it.
We scale everything up so the area under

80
00:06:38,458 --> 00:06:43,066
the curve is one.
And now if we started with the uniform pri

81
00:06:43,066 --> 00:06:49,583
distribution of the P we end up with this
triangular posterior distribution of the P

82
00:06:49,583 --> 00:06:54,201
having observed one head.
Now let's do it again.

83
00:06:54,201 --> 00:07:01,483
And this time let's suppose we get a tail.
So, the prior distribution that we start

84
00:07:01,483 --> 00:07:07,189
with now is the post-serial distribution
we had after observing our one head.

85
00:07:07,189 --> 00:07:12,301
And now, the green line shows the
probability that we will get a tail

86
00:07:12,301 --> 00:07:17,488
according to each of those hypotheses that
correspond to a value of P.

87
00:07:17,488 --> 00:07:22,749
So, for example, if P is one, the
probability we would observe at time is

88
00:07:22,749 --> 00:07:27,294
zero.
So we have to multiply our prior by our

89
00:07:27,294 --> 00:07:31,797
likelihood term.
And now we get a curve like that.

90
00:07:31,797 --> 00:07:35,744
Then we have to re-normalize to make the
area be one.

91
00:07:35,744 --> 00:07:41,627
And that's now a posterior distribution,
after having observed one head and one

92
00:07:41,627 --> 00:07:44,501
tail.
I notice it's a pretty sensible

93
00:07:44,501 --> 00:07:48,182
distribution.
After observing one of each, we know that

94
00:07:48,182 --> 00:07:53,839
P can't be either zero or one, and it also
seems very sensible that the most likely

95
00:07:53,839 --> 00:08:00,499
thing is now in the middle.
If we do this again another 98 times, and

96
00:08:00,499 --> 00:08:06,427
keep applying the same strategy of
multiplying the posterior we had after the

97
00:08:06,427 --> 00:08:12,007
last task, by the likelihood of observing
that event, given the various different

98
00:08:12,007 --> 00:08:17,608
settings of the parameter p.
And let's suppose we get 53 heads and 47

99
00:08:17,608 --> 00:08:21,509
tails in all.
Then we'll end up with a curve that looks

100
00:08:21,509 --> 00:08:24,488
like this.
It'll have its mean at 53..

101
00:08:24,488 --> 00:08:30,802
Because we started with the uniform prior.
And it'll be fairly sharply peaked by 53..

102
00:08:30,802 --> 00:08:36,902
But it'll allow other values like 49. is a
perfectly reasonable value under this

103
00:08:36,902 --> 00:08:40,307
curve.
Not quite as lengthy as 53., but very

104
00:08:40,307 --> 00:08:43,995
reasonable.
Whereas a value of 25. is extremely

105
00:08:43,995 --> 00:08:51,093
unlikely under this curve.
So we can summarize all that with base

106
00:08:51,093 --> 00:08:56,177
therm.
Determine the middle of this equation is

107
00:08:56,177 --> 00:09:00,844
the joint probability of a set of
parameters W, and some data D.

108
00:09:00,844 --> 00:09:06,400
And for supervised learning, the data is
gonna consist of the target values.

109
00:09:06,400 --> 00:09:12,475
So we assume we are given inputs and the
data values consist of the target values

110
00:09:12,475 --> 00:09:16,920
that are associated with those iinputs.
That what we observe.

111
00:09:17,580 --> 00:09:22,804
That joint probability can be re-written
as the product of an individual

112
00:09:22,804 --> 00:09:28,744
probability and a conditional probability.
So on the right, we're written it as p of

113
00:09:28,744 --> 00:09:34,327
w times p of d given w, and on the left,
I've written it as p of d times p of w

114
00:09:34,327 --> 00:09:39,960
given d.
Now we can divide both sides by p of d.

115
00:09:40,220 --> 00:09:43,640
And this gives us base there, I mean it's
usual form.

116
00:09:44,200 --> 00:09:50,431
Base theorem says that the posterior
probability of a particular value of W,

117
00:09:50,431 --> 00:09:56,745
given the data D, is just the prior
probability of that particular value of W

118
00:09:56,745 --> 00:10:02,894
times the probability given that
particular value of W, that you would have

119
00:10:02,894 --> 00:10:08,716
observed the data you observed.
And that has to be normalized by P of D.

120
00:10:08,716 --> 00:10:14,866
The probability of the data which is
simply the integral over all possible

121
00:10:14,866 --> 00:10:21,083
values of W, of P of W, P of D, given W.
The bottom line needs to be the sum of the

122
00:10:21,083 --> 00:10:25,563
top line a row of possible values w in
order for this to be a probability

123
00:10:25,563 --> 00:10:30,764
distribution that adds to one.
Because that p of d has integrated over

124
00:10:30,764 --> 00:10:36,168
all possible values of w, it's not
affected by picking a particular value of

125
00:10:36,168 --> 00:10:40,790
w on the left-hand side.
So when we're looking for the best value

126
00:10:40,790 --> 00:10:45,127
of w, for example, we can ignore p of d.
It doesn't depend on w.

127
00:10:45,127 --> 00:10:49,821
The other two terms on the right-hand
side, however, do depend on w.