1
00:00:00,000 --> 00:00:04,413
[MUSIC]

2
00:00:04,413 --> 00:00:08,131
Hi, I am Dmitry Vetrov, research professor
from Higher School of Economics and

3
00:00:08,131 --> 00:00:11,973
head of Bayesian Method Research Group and
scientific advisor of Alex and Daniel.

4
00:00:11,973 --> 00:00:14,732
Under this lecture,
I would like to tell you about one

5
00:00:14,732 --> 00:00:18,860
successful example of how [INAUDIBLE]
can be combined with Bayesian influence.

6
00:00:18,860 --> 00:00:20,006
Under this lecture,

7
00:00:20,006 --> 00:00:23,643
we will briefly review how budget
methods can be scaled to big data.

8
00:00:23,643 --> 00:00:27,836
So, suppose we are given a set of
machine problem containing data X,Y,

9
00:00:27,836 --> 00:00:32,460
where X contains observed variables and
Y is hidden variables to be predicted.

10
00:00:32,460 --> 00:00:36,883
And we have a probabilistic classifier,
which gives us the probabilities of

11
00:00:36,883 --> 00:00:41,590
a hidden components, given observed ones,
which is parameterized by y|x, W.

12
00:00:41,590 --> 00:00:46,305
Since we are Bayesians,
we also establish reasonable prior, p(W).

13
00:00:46,305 --> 00:00:49,946
And from the Bayesian point of
view at the training stage,

14
00:00:49,946 --> 00:00:54,046
we need to compute two posterior
distribution, p(W|X, Y).

15
00:00:54,046 --> 00:00:57,594
This posterior distribution contains
all available information about W

16
00:00:57,594 --> 00:00:59,604
we could extract from our training data.

17
00:00:59,604 --> 00:01:02,575
And this is the result
of Bayesian training.

18
00:01:02,575 --> 00:01:03,687
At the test stage,

19
00:01:03,687 --> 00:01:08,144
we need to perform [INAUDIBLE] with
respect to this posterior distribution.

20
00:01:08,144 --> 00:01:11,362
So we're not applying
just single classifier,

21
00:01:11,362 --> 00:01:16,226
we're applying the sample and the weights
of each classifier are given by

22
00:01:16,226 --> 00:01:19,456
our posterior distribution p(W) given X,
Y.

23
00:01:19,456 --> 00:01:23,208
So this is how it should work in theory,
but in practice of course that is not so,

24
00:01:23,208 --> 00:01:25,231
and the problem is in these two integrals.

25
00:01:25,231 --> 00:01:30,306
So they're usually intractable, they're
usually in huge dimensional spaces.

26
00:01:30,306 --> 00:01:32,586
For example, for
the case of deep learning,

27
00:01:32,586 --> 00:01:35,836
the dimensionality of W can be
about tens of millions of per mix.

28
00:01:35,836 --> 00:01:38,527
And since the integrals are intractable,

29
00:01:38,527 --> 00:01:41,525
we cannot even very
roughly approximate them.

30
00:01:41,525 --> 00:01:43,931
This was the reason why,
until very recently,

31
00:01:43,931 --> 00:01:46,710
Bayesian methods were
considered to be not scalable.

32
00:01:46,710 --> 00:01:49,582
The situation has changed
with the development of so

33
00:01:49,582 --> 00:01:52,070
called stochastic variational inference.

34
00:01:52,070 --> 00:01:55,557
And sort of trying to solve
basic inference problem exactly,

35
00:01:55,557 --> 00:01:58,180
that is to find true
[INAUDIBLE] distribution.

36
00:01:58,180 --> 00:02:03,063
P(w) given X, Y [INAUDIBLE] distribution
from some parametric family,

37
00:02:03,063 --> 00:02:05,750
so the distribution is q of w given five.

38
00:02:05,750 --> 00:02:11,230
[INAUDIBLE] Approximated by minimizing
some kind of distance measure between

39
00:02:11,230 --> 00:02:17,144
the two distributions between our rational
approximation and the two posterior.

40
00:02:17,144 --> 00:02:19,496
There can be different distance measures,
but

41
00:02:19,496 --> 00:02:23,035
one of the most popular one is so-called
KL divergence between q and p.

42
00:02:23,035 --> 00:02:28,616
As was mentioned in previous lectures,
in this case, the optimization problem

43
00:02:28,616 --> 00:02:34,290
is exactly equivalent to maximizing
so-called ELBO, or evidence lower bound.

44
00:02:34,290 --> 00:02:37,154
So ELBO itself is an integral,
again, in a huge dimensional space.

45
00:02:37,154 --> 00:02:40,416
And the integral itself
is still untaxable.

46
00:02:40,416 --> 00:02:43,531
But we do not need to
compute the integral exactly.

47
00:02:43,531 --> 00:02:47,300
All we need to do is to optimize
the integral with the [INAUDIBLE].

48
00:02:47,300 --> 00:02:48,264
And surprisingly,

49
00:02:48,264 --> 00:02:51,960
it appears that this is possible using
stochastic optimization framework.

50
00:02:51,960 --> 00:02:55,652
So our ELBO actually has
several very nice purposes.

51
00:02:55,652 --> 00:03:01,131
One of them is the obtaining of
likelihood, p of Y given X, W.

52
00:03:01,131 --> 00:03:02,373
It's inside a logarithm.

53
00:03:02,373 --> 00:03:06,659
This means that it can be split into the
sum of individual likelihoods of obtaining

54
00:03:06,659 --> 00:03:07,228
objects.

55
00:03:07,228 --> 00:03:11,012
And this means that additionalization
of our optimization,

56
00:03:11,012 --> 00:03:14,661
we do not need to compute
the whole training of likelihood.

57
00:03:14,661 --> 00:03:20,150
We can compute only its unbiased estimate
given by a tiny mini batch of data.

58
00:03:20,150 --> 00:03:23,604
So this means that ELBO
supports mini batching.

59
00:03:23,604 --> 00:03:28,393
Another good property is that ELBO
still expectation with respect to

60
00:03:28,393 --> 00:03:30,716
our relational approximation.

61
00:03:30,716 --> 00:03:35,869
And this means that, we need to
obtain unbiased stochastic gradient.

62
00:03:35,869 --> 00:03:40,622
We could remove this integral with
its unbiased Monte Carlo estimate.

63
00:03:40,622 --> 00:03:45,582
For this purpose, we also need to perform
reparameterization trick in order to

64
00:03:45,582 --> 00:03:48,479
reduce the variance of
stochastic gradient.

65
00:03:48,479 --> 00:03:53,800
So then we may simply sample from the
distribution, which is now parameter-free.

66
00:03:53,800 --> 00:03:56,761
And use Monte Carlo estimate
to compute gradients.

67
00:03:56,761 --> 00:04:00,270
Another good property is that
the richer is variational family,

68
00:04:00,270 --> 00:04:03,589
the better we approximate
the true posterior distribution,

69
00:04:03,589 --> 00:04:05,719
so we do not have a risk of overfitting.

70
00:04:05,719 --> 00:04:08,230
The more operational parameters we have,

71
00:04:08,230 --> 00:04:11,448
the closer we are to the true
posterior distribution.

72
00:04:11,448 --> 00:04:12,135
And finally,

73
00:04:12,135 --> 00:04:15,921
we can split ELBO into two parts by
splitting our arithmetic side of integral.

74
00:04:15,921 --> 00:04:18,931
So then there will be two integrals,
two terms.

75
00:04:18,931 --> 00:04:23,405
The first term is called data term, and it
simply shows the expectation with respect

76
00:04:23,405 --> 00:04:26,631
to our relational approximation
of training a likelihood.

77
00:04:26,631 --> 00:04:30,986
And the second term is negative KL
diversions between our relational

78
00:04:30,986 --> 00:04:33,849
approximation and the prior distribution.

79
00:04:33,849 --> 00:04:35,838
Even now if I get the second term,

80
00:04:35,838 --> 00:04:40,316
we optimize just the first term with
respect all possible distributions,

81
00:04:40,316 --> 00:04:43,885
we'll end up with a delta
function at maximal group point.

82
00:04:43,885 --> 00:04:48,080
So there'll be a delta function at WML.

83
00:04:48,080 --> 00:04:48,808
And the second term,

84
00:04:48,808 --> 00:04:51,047
the regularizer prevents us from
collapsing to delta function.

85
00:04:51,047 --> 00:04:55,090
It penalizes too much deviations
from prior distribution.

86
00:04:55,090 --> 00:04:58,349
It optimize both terms with respect
to all possible distributions,

87
00:04:58,349 --> 00:05:00,773
we'll end up with a true
posterior distribution.

88
00:05:00,773 --> 00:05:03,570
But since the true
posterior is untraceable,

89
00:05:03,570 --> 00:05:07,380
we'll limit the set of possible
relational approximations.

90
00:05:07,380 --> 00:05:11,865
And then they'll end up with a variational
distribution, which is the closest in

91
00:05:11,865 --> 00:05:15,189
terms of KL divergence to our
true posterior distribution.

92
00:05:15,189 --> 00:05:18,526
So to conclude,
this is stochastic relational inference.

93
00:05:18,526 --> 00:05:23,316
This is a highly scalable technique
that provides us with the person with

94
00:05:23,316 --> 00:05:24,819
Bayesian inference.

95
00:05:24,819 --> 00:05:27,190
The usage of stochastic optimization and

96
00:05:27,190 --> 00:05:31,396
reparameterization trick makes SVI
very applicable to large data sets.

97
00:05:31,396 --> 00:05:33,802
And in the next section,
I will tell you about dropout and

98
00:05:33,802 --> 00:05:36,121
how it can be interpreted
from Bayesian point of view.

99
00:05:38,021 --> 00:05:48,021
[MUSIC]