1
00:00:00,000 --> 00:00:04,593
[MUSIC]

2
00:00:04,593 --> 00:00:08,569
But, as an alternative, we can perform
Bayesian inference using a totally

3
00:00:08,569 --> 00:00:11,002
different algorithm called Gibbs sampling.

4
00:00:11,002 --> 00:00:15,955
And in Gibbs sampling, we're going to
iteratively provide hard assignments

5
00:00:15,955 --> 00:00:20,832
just like we did in k-means, but these
hard assignments are going to be drawn

6
00:00:20,832 --> 00:00:25,562
randomly from a specific distribution,
whereas remembering k-means,

7
00:00:25,562 --> 00:00:28,800
we just solved an equation and
got a solution.

8
00:00:28,800 --> 00:00:33,490
But the point here is that as we look
across these random samples of our

9
00:00:33,490 --> 00:00:37,150
model parameters and assignment variables
that are being produced by Gibbs sampling,

10
00:00:38,310 --> 00:00:43,570
this is capturing the uncertainty in
the values that these variables can take.

11
00:00:43,570 --> 00:00:47,870
And the reason we're presenting Gibbs
sampling in contrast to variational

12
00:00:47,870 --> 00:00:51,260
Is the fact that the updates
of our Gibbs sampling steps

13
00:00:51,260 --> 00:00:55,030
tend to be very intuitive and
very straightforward to implement.

14
00:00:56,320 --> 00:00:59,930
Okay, so let's look at a couple samples

15
00:00:59,930 --> 00:01:03,530
from a Gibbs sampler to get a little
intuition about what's going on here.

16
00:01:03,530 --> 00:01:07,430
So remember in our LDA model,
here are all our assignment variables and

17
00:01:07,430 --> 00:01:11,410
model parameters that we want
to do inference over, and

18
00:01:11,410 --> 00:01:16,380
this is iteration 10,000 of a Gibbs
sampler, and the reason I'm looking so

19
00:01:16,380 --> 00:01:21,360
far out at iteration 10,000 of the Gibbs
sampler is the fact that in very large

20
00:01:21,360 --> 00:01:26,920
models, like LDA, it can take quite
a few iterations before we get out

21
00:01:26,920 --> 00:01:30,600
of a potentially bad initialization and
start getting some reasonable assignments.

22
00:01:30,600 --> 00:01:35,810
So let's imagine this is iteration 10,000,
and these are the values assigned,

23
00:01:35,810 --> 00:01:40,053
they were randomly assigned to each
of these module parameters and

24
00:01:40,053 --> 00:01:41,330
assignment variables.

25
00:01:41,330 --> 00:01:47,890
And then maybe this is sample 10,001, and
sample 10,002, and if i just flip back and

26
00:01:47,890 --> 00:01:53,170
forth between these samples and do this
a few times, we're just going back and

27
00:01:53,170 --> 00:01:58,100
forth through these three samples,
we see how the values change.

28
00:01:58,100 --> 00:02:02,400
This is the randomness inherent in this
algorithm, and so we can talk about

29
00:02:02,400 --> 00:02:07,840
what do we expect to happen,
as we keep sampling over iterations.

30
00:02:07,840 --> 00:02:10,730
Well, one thing we can look
at here is something called

31
00:02:10,730 --> 00:02:13,580
the joint model probability.

32
00:02:13,580 --> 00:02:18,389
So this is like the likelihood that we
monitored before, but in a Bayesian model,

33
00:02:18,389 --> 00:02:21,943
not only do we have to account for
the likelihood of our data,

34
00:02:21,943 --> 00:02:24,470
given a specific set of model parameters.

35
00:02:24,470 --> 00:02:26,870
But we also have to
look at the probability

36
00:02:26,870 --> 00:02:28,620
of those model parameters themselves.

37
00:02:30,080 --> 00:02:33,480
And together, these two terms
form the joint model probability.

38
00:02:33,480 --> 00:02:37,449
And if we look at joint model probability
over Gibbs sampling iterations,

39
00:02:37,449 --> 00:02:41,859
we tend to get plots that look like the
following, where there's this initial burn

40
00:02:41,859 --> 00:02:46,458
in period where we're just moving from our
initialization to some reasonable space,

41
00:02:46,458 --> 00:02:48,040
and then we wander around.

42
00:02:48,040 --> 00:02:50,050
And throughout this whole process,

43
00:02:50,050 --> 00:02:53,740
the joint model probability can go up and
down.

44
00:02:53,740 --> 00:02:57,820
And the reason is because this
is a randomized algorithm.

45
00:02:57,820 --> 00:03:03,100
There's no guarantee that we're going
to increase our joint probability.

46
00:03:03,100 --> 00:03:05,400
This is not an optimization algorithm.

47
00:03:05,400 --> 00:03:10,590
And that's actually the point of it, we're
actually intending to explore this space

48
00:03:10,590 --> 00:03:16,110
of possible solutions, rather than
converge just to a single point.

49
00:03:17,410 --> 00:03:21,010
And this is very similar to what we
saw for stochastic gradient descent,

50
00:03:21,010 --> 00:03:25,440
or ascent, where there,
there was randomness too

51
00:03:25,440 --> 00:03:28,670
because we were computing a noisy
estimate of the gradient, and

52
00:03:28,670 --> 00:03:33,120
we showed that our objective
function wasn't going to simply

53
00:03:33,120 --> 00:03:38,090
monotonically increase or not decrease,
it might actually go up and down.

54
00:03:39,590 --> 00:03:42,920
But the important thing here is that
eventually, meaning after enough

55
00:03:42,920 --> 00:03:48,150
iterations, the random samples that are
produced by our Gibbs sampling algorithm

56
00:03:48,150 --> 00:03:52,370
allow us to form what are called
correct Bayesian estimates.

57
00:03:52,370 --> 00:03:56,480
And to provide some examples of what this
means, if we're thinking of a predictive

58
00:03:56,480 --> 00:04:02,550
task, we said that Bayesian's average
over uncertainty in the parameters.

59
00:04:02,550 --> 00:04:07,930
Well what we can do from the output of our
Gibbs sampling is we can look at a given

60
00:04:07,930 --> 00:04:13,200
sample and all the assignments, they're
hard assignments, of the model parameters

61
00:04:13,200 --> 00:04:17,870
and variables, and we can form our
prediction from those set of assignments.

62
00:04:17,870 --> 00:04:21,700
Then we can look at our next sample,
form our prediction, our next sample,

63
00:04:21,700 --> 00:04:22,910
form our prediction.

64
00:04:22,910 --> 00:04:28,700
Then we can average over these
predictions, across the different samples,

65
00:04:28,700 --> 00:04:33,010
and naturally because of how the Gibbs
sampler is exploring uncertainty,

66
00:04:33,010 --> 00:04:36,050
it's spending more time in
high probability regions and

67
00:04:36,050 --> 00:04:38,620
less time in low probability regions.

68
00:04:38,620 --> 00:04:40,810
It's forming a correct prediction,

69
00:04:40,810 --> 00:04:45,800
a correct Bayesian average,
over the uncertainty in our parameters.

70
00:04:47,470 --> 00:04:50,660
Another thing we can think
about doing is grabbing out

71
00:04:50,660 --> 00:04:52,710
what I'll call a good solution.

72
00:04:52,710 --> 00:04:58,030
And what I mean by a good solution
is the set of model parameters and

73
00:04:58,030 --> 00:05:01,990
assignment variables that out of
the space of everything we've explored,

74
00:05:01,990 --> 00:05:05,290
that maximize this joint
model probability.

75
00:05:05,290 --> 00:05:09,610
And this is an estimate of what
we call that maximum a posteriori

76
00:05:09,610 --> 00:05:11,500
parameter estimate.

77
00:05:11,500 --> 00:05:16,333
So if we look at monitoring joint
probability over iterations,

78
00:05:16,333 --> 00:05:20,360
we just look at grabbing out
the hard assignments made

79
00:05:20,360 --> 00:05:24,397
at the iteration that
maximizes joint probability.

80
00:05:24,397 --> 00:05:28,539
[MUSIC]