1
00:00:00,000 --> 00:00:04,659
[MUSIC]

2
00:00:04,659 --> 00:00:09,415
So let's return to our problem
of estimating the gradient of

3
00:00:09,415 --> 00:00:13,435
the objective with respect
to the parameters phi.

4
00:00:13,435 --> 00:00:18,465
In the previous video, we discussed that
if we use something called [INAUDIBLE].

5
00:00:18,465 --> 00:00:21,098
We can build a stochastic
approximation of this gradient.

6
00:00:21,098 --> 00:00:25,456
But the variance of this stochastic
approximation will be really high.

7
00:00:25,456 --> 00:00:28,670
Therefore, will be really inefficient
to use this approximation to train

8
00:00:28,670 --> 00:00:29,389
the [INAUDIBLE].

9
00:00:29,389 --> 00:00:33,936
So let's see,
let's look at a really nice and simple and

10
00:00:33,936 --> 00:00:38,218
brilliant idea on how to
make this think much better.

11
00:00:38,218 --> 00:00:40,583
Make this approximation much better.

12
00:00:40,583 --> 00:00:45,757
So, let's make a change,
let's first of all recall that

13
00:00:45,757 --> 00:00:51,047
ti is a sample of distribution
from q of ti, given x sin phi.

14
00:00:51,047 --> 00:00:53,272
Let's make a change of labels.

15
00:00:53,272 --> 00:00:59,421
So let's say, instead of sampling ti,
we'll sample some new variable x and

16
00:00:59,421 --> 00:01:05,487
y from the standard variable and
then we'll make ti from this central line.

17
00:01:05,487 --> 00:01:10,443
By multiplying kit for
element y in this way by some

18
00:01:10,443 --> 00:01:16,109
standard deviation, si, and
by aiding the [INAUDIBLE].

19
00:01:16,109 --> 00:01:21,011
So this way,
the distribution of this expression

20
00:01:21,011 --> 00:01:25,677
of this epsilon i times si
plus mi is the same as,

21
00:01:25,677 --> 00:01:30,245
it's just q,
it's the same as [INAUDIBLE] ti.

22
00:01:30,245 --> 00:01:34,889
So instead of sampling ti
from this distribution q,

23
00:01:34,889 --> 00:01:41,548
we can state sample epsilon and
then apply this deterministic function g.

24
00:01:41,548 --> 00:01:43,685
With this multiplying ys and

25
00:01:43,685 --> 00:01:48,323
adding m to get the sample from
the actual distribution of ti.

26
00:01:48,323 --> 00:01:50,752
So we're doing a change of variables.

27
00:01:50,752 --> 00:01:54,867
Instead of sampling from ti,
we're sampling from epsilon i and

28
00:01:54,867 --> 00:01:57,391
then converting it to a sample from ti.

29
00:01:57,391 --> 00:02:03,345
And now we can change our objective,
we can look at the objective and

30
00:02:03,345 --> 00:02:09,734
instead of completing the integral
with respect to the distribution q.

31
00:02:09,734 --> 00:02:12,309
So expect the distribution q,

32
00:02:12,309 --> 00:02:18,358
we can now complete the expected value
with the distribution epsilon i.

33
00:02:18,358 --> 00:02:23,211
And then instead of ti,
use this function of epsilon i everywhere.

34
00:02:23,211 --> 00:02:26,126
And this is an exact expression,
we didn't lose anything,

35
00:02:26,126 --> 00:02:27,679
we just changed the variables.

36
00:02:27,679 --> 00:02:33,352
So instead of considering distribution
on ti, we're considering distribution

37
00:02:33,352 --> 00:02:38,625
on epsilon i and then converting these
epsilon i samples to samples from ti.

38
00:02:38,625 --> 00:02:43,400
And now this g,
this function that converts

39
00:02:43,400 --> 00:02:48,318
epsilon i to tis,
it depends on xi and on phi.

40
00:02:48,318 --> 00:02:52,748
And to convert your epsilon i,
it passes your image xi through

41
00:02:52,748 --> 00:02:56,679
a convolutional neural
network with parameters phi.

42
00:02:56,679 --> 00:03:03,922
And this si and mi, and then multiplies
epsilon i by si and [INAUDIBLE] mi.

43
00:03:03,922 --> 00:03:07,613
This is [INAUDIBLE] function,
licensing of one.

44
00:03:07,613 --> 00:03:13,904
And now we can push the gradient
sign inside the expected value,

45
00:03:13,904 --> 00:03:21,729
so past the probability of epsilon i
because [INAUDIBLE] doesn't depend on phi.

46
00:03:21,729 --> 00:03:26,018
It doesn't depend on the parameters,
we are differentiation with respect to.

47
00:03:26,018 --> 00:03:27,098
And this means, that now we have
an expected value of some expression.

48
00:03:27,098 --> 00:03:32,152
Without ever introducing some artificial

49
00:03:32,152 --> 00:03:37,354
distributions like in the previous video.

50
00:03:37,354 --> 00:03:41,217
We'll like obtain
the expected value naturally.

51
00:03:41,217 --> 00:03:48,503
And now these expected values with respect
to the to the distribution epsilon i,

52
00:03:48,503 --> 00:03:53,551
which is just standard normal
without any parameters.

53
00:03:53,551 --> 00:03:59,715
And now we can approximate this thing
with a sample from standard normal.

54
00:03:59,715 --> 00:04:06,692
And so ultimately we have
re-written our objective,

55
00:04:06,692 --> 00:04:13,525
so the gradient of our
objective with respect to phi.

56
00:04:13,525 --> 00:04:17,951
Is sum with respect to objects
of expected value with respect

57
00:04:17,951 --> 00:04:21,876
to standard normal of
the gradient of some function.

58
00:04:21,876 --> 00:04:27,165
Which is just standard gradient
of your whole neural network,

59
00:04:27,165 --> 00:04:31,465
which defines you the whole
operation [INAUDIBLE].

60
00:04:31,465 --> 00:04:35,236
Andnow you can redraw
this pictures as follows.

61
00:04:35,236 --> 00:04:38,495
You have an input image x.

62
00:04:38,495 --> 00:04:42,765
You pass it through a convolutional
neural network with parameters phi.

63
00:04:42,765 --> 00:04:47,275
You compute the regional parameters m and
s,

64
00:04:47,275 --> 00:04:54,529
then you sample one vector from
standard normal distribution epsilon.

65
00:04:54,529 --> 00:04:59,158
And then you use all these free values,
m, n, s and

66
00:04:59,158 --> 00:05:03,166
epsilon to deterministically compute ti.

67
00:05:03,166 --> 00:05:08,167
And then you put this ti inside this
second convolutional neural network.

68
00:05:08,167 --> 00:05:08,700
So when you define your model like this,
you have

69
00:05:08,700 --> 00:05:09,243
only one place where you
have stochastic units.

70
00:05:09,243 --> 00:05:14,163
This epsilon i from standard

71
00:05:14,163 --> 00:05:18,265
normal distribution.

72
00:05:18,265 --> 00:05:23,140
And this way, you can differentiate
your whole neural structure

73
00:05:23,140 --> 00:05:26,338
with respect to phi and w without trouble.

74
00:05:26,338 --> 00:05:28,639
So you're going to just
use tender flow and

75
00:05:28,639 --> 00:05:32,236
it will find you gradients with
respect to all the parameters.

76
00:05:32,236 --> 00:05:36,936
Because you don't have now some different
shapes with through dissembling,

77
00:05:36,936 --> 00:05:40,567
dissembling is kind of outside procedure,
it's just yes or

78
00:05:40,567 --> 00:05:42,645
nos to determine each functions.

79
00:05:42,645 --> 00:05:47,097
And this is basically implementation of
theory we have just discussed within this

80
00:05:47,097 --> 00:05:48,392
urban mutualization.

81
00:05:48,392 --> 00:05:53,713
And now we're going to approximate our
gradients by just assembling one point,

82
00:05:53,713 --> 00:05:57,992
and then using these gradient
of law of this complex function.

83
00:05:57,992 --> 00:06:04,324
And this complex function like
log of p of xi given g and

84
00:06:04,324 --> 00:06:12,080
w is just this full neural network
with both encoder and decoder.

85
00:06:12,080 --> 00:06:16,779
So to summarize,
we have just get the model that allows

86
00:06:16,779 --> 00:06:21,172
you to fit probability distribution,
like p of x,

87
00:06:21,172 --> 00:06:27,017
into a complicated structure of data,
for example, into images.

88
00:06:27,017 --> 00:06:30,189
And it uses a model of
infinite mixture of Gaussians.

89
00:06:30,189 --> 00:06:35,261
But to define the parameters of these
Gaussians, it uses a variational

90
00:06:35,261 --> 00:06:40,940
neural network with parameters that
are trained with variational inference.

91
00:06:40,940 --> 00:06:45,934
And for learning, we can't use the usual
expectation maximization because we

92
00:06:45,934 --> 00:06:47,361
have to approximate.

93
00:06:47,361 --> 00:06:52,301
And we can't also use variational
expectation maximization because it also

94
00:06:52,301 --> 00:06:53,142
[INAUDIBLE].

95
00:06:53,142 --> 00:06:58,757
So we draft kind of stochastic
version of variational inference.

96
00:06:58,757 --> 00:07:01,495
That is applicable to,
first of all, large data sets,

97
00:07:01,495 --> 00:07:03,139
because we can use mini batches.

98
00:07:03,139 --> 00:07:07,927
And second of all, it's applicable
to the small, so you couldn't

99
00:07:07,927 --> 00:07:12,634
have used the usual variational
inference for this complicated.

100
00:07:12,634 --> 00:07:17,385
Because it has neural networks inside and
every integral is intractable.

101
00:07:17,385 --> 00:07:21,901
And the model with that is
called variational autoencoder,

102
00:07:21,901 --> 00:07:26,770
it's like the plain usual autoencoder but
It has noise inside and

103
00:07:26,770 --> 00:07:31,482
uses [INAUDIBLE] regularization
to make sure that noise stays.

104
00:07:31,482 --> 00:07:35,891
That the [INAUDIBLE] chooses
the right amount of noise to use.

105
00:07:35,891 --> 00:07:42,488
And can be used to for example, generate
nice images or to handle missing data or

106
00:07:42,488 --> 00:07:46,628
to find [INAUDIBLE] in the data and
stuff like that.

107
00:07:46,628 --> 00:07:56,628
[MUSIC]