1
00:00:00,000 --> 00:00:00,524
[MUSIC]

2
00:00:00,524 --> 00:00:05,022
In this section we will review Dropout,
and

3
00:00:05,022 --> 00:00:10,380
review its connections
with a Bayesian framework.

4
00:00:10,380 --> 00:00:16,079
So Dropout was invented in 2011, and
became popular regularization technique.

5
00:00:16,079 --> 00:00:16,970
We know that it works.

6
00:00:16,970 --> 00:00:19,323
We know that its params are fitting.

7
00:00:19,323 --> 00:00:22,929
And the essence of Dropout is actually
just injection of the noise to

8
00:00:22,929 --> 00:00:26,802
the variance, or to the activations
at each iteration of our training.

9
00:00:26,802 --> 00:00:31,940
The magnitude of this noise is defined by
user, and is usually called dropout rate.

10
00:00:31,940 --> 00:00:32,959
The noise can be different.

11
00:00:32,959 --> 00:00:37,123
For example, it can be Bernoulli noise,
then we are talking about binary dropout.

12
00:00:37,123 --> 00:00:41,090
Or it can be Gaussian noise,
then we tell about Gaussian dropout.

13
00:00:41,090 --> 00:00:45,371
Let us review Gaussian dropout in details.

14
00:00:45,371 --> 00:00:50,339
At each iteration of training, we generate
Gaussian noise with a mean of 1 and

15
00:00:50,339 --> 00:00:51,479
variance Alpha.

16
00:00:51,479 --> 00:00:55,379
Then I multiply each weight
Theta ij to Epsilon ij,

17
00:00:55,379 --> 00:01:00,370
Epsilon ij is this noise generated
from Gaussian distribution,

18
00:01:00,370 --> 00:01:04,289
and obtain noisified
versions of the waves, wij.

19
00:01:04,289 --> 00:01:07,174
Let us consider Gaussian
dropout in details.

20
00:01:07,174 --> 00:01:09,086
Additional iteration of training,

21
00:01:09,086 --> 00:01:12,239
we multiply always Theta ij by
Gaussian noise, Epsilon ij.

22
00:01:12,239 --> 00:01:16,481
Epsilon ij goes from a Gaussian
distribution with a mean of 1 and

23
00:01:16,481 --> 00:01:17,670
variance Alpha.

24
00:01:17,670 --> 00:01:22,442
Then I obtain noisified
versions of the weights, wij.

25
00:01:22,442 --> 00:01:25,775
And finally, we compute stochastic
gradients of a triangle

26
00:01:25,775 --> 00:01:28,334
gradient given these noisified weights, w.

27
00:01:28,334 --> 00:01:31,850
But then we obtain exactly the same
stochastic gradient as it would

28
00:01:31,850 --> 00:01:36,203
be if we optimized the expectation with
respect to Gaussian distribution over w.

29
00:01:36,203 --> 00:01:40,600
With weight Theta, and variance Alpha
Theta squared, obtaining our likelihood.

30
00:01:40,600 --> 00:01:45,845
So the distribution itself
is fully factorized.

31
00:01:45,845 --> 00:01:48,314
To show it, let us first perform
a little regularization trick.

32
00:01:48,314 --> 00:01:55,064
So we change the distribution over
w to the distribution over Epsilon.

33
00:01:55,064 --> 00:01:59,760
Epsilon now has a min of 1 and variance of
Alpha, and it is still fully factorized.

34
00:01:59,760 --> 00:02:07,574
And the old likelihood is computed
at the point Theta times Epsilon.

35
00:02:07,574 --> 00:02:10,528
Now our probability density
doesn't depend on theta, and

36
00:02:10,528 --> 00:02:13,373
we may move the differentiation
inside of our integral.

37
00:02:13,373 --> 00:02:18,201
And then we may change the integral
to its Monte Carlo estimate, and

38
00:02:18,201 --> 00:02:23,043
obtain exactly the same expression
as it was on the previous slide.

39
00:02:23,043 --> 00:02:26,999
So now we know that Gaussian dropout
optimizes the following objective.

40
00:02:26,999 --> 00:02:30,415
This expectation with respect
to the distribution over w,

41
00:02:30,415 --> 00:02:35,129
the distribution is fully factorized and
its Gaussian, with a mean of Theta ij and

42
00:02:35,129 --> 00:02:37,127
variance Alpha Theta ij squared.

43
00:02:37,127 --> 00:02:40,260
And the expectation is computed
from obtaining a likelihood.

44
00:02:40,260 --> 00:02:45,108
So this looks pretty much the same
like the first term in ELBO, where as

45
00:02:45,108 --> 00:02:51,044
an regularization approximation we used
fully factorized Gaussian distribution.

46
00:02:51,044 --> 00:02:52,146
But where is the second term?

47
00:02:52,146 --> 00:02:54,175
Where is the KL divergence?

48
00:02:54,175 --> 00:02:57,628
Remember that ELBO consists of 2 terms,
the data term and

49
00:02:57,628 --> 00:03:00,947
the negative KL divergence,
that is our regularizer.

50
00:03:00,947 --> 00:03:02,543
In Gaussian dropout,

51
00:03:02,543 --> 00:03:08,262
we have shown that we are optimizing with
respect to Theta of just the first term.

52
00:03:08,262 --> 00:03:12,151
So we managed to find such
prior distribution, p(W),

53
00:03:12,151 --> 00:03:17,590
that the second term will depend only on
Alpha, and it will not depend on Theta.

54
00:03:17,590 --> 00:03:22,358
Now I've managed to prove that this
two procedures are exactly equivalent.

55
00:03:22,358 --> 00:03:26,423
So remember that in Gaussian dropout,
Alpha is assumed to be fixed.

56
00:03:26,423 --> 00:03:30,425
And if Alpha is fixed,
then optimization of ELBO is equivalent to

57
00:03:30,425 --> 00:03:34,300
the optimization of just the first
term with respect to Theta.

58
00:03:34,300 --> 00:03:37,225
Surprisingly such prior
distribution exists, and

59
00:03:37,225 --> 00:03:39,356
it is known from information theory.

60
00:03:39,356 --> 00:03:43,025
So this is a so-called
improper log-uniform prior.

61
00:03:43,025 --> 00:03:44,839
It is fully factorized again,

62
00:03:44,839 --> 00:03:49,112
and each of its factors is proportional
to 1 over absolute value of wij.

63
00:03:49,112 --> 00:03:53,297
This is improper distribution,
so it can not be normalized.

64
00:03:53,297 --> 00:03:55,660
Nevertheless, it has
several quite nice purpose.

65
00:03:55,660 --> 00:04:01,352
For example, if we consider the algorithm
of absolute value of wij, in other words,

66
00:04:01,352 --> 00:04:06,663
it is to show that it will be uniformly
distributed from minus to plus infinity.

67
00:04:06,663 --> 00:04:10,480
And again,
this is improper probability distribution.

68
00:04:10,480 --> 00:04:16,880
For us, it is important that this
prior distribution, roughly speaking,

69
00:04:16,880 --> 00:04:21,988
analysis the precision with
which we are trying to find wij.

70
00:04:21,988 --> 00:04:25,996
We may easily show that the counter
divergence between our Gaussian numeration

71
00:04:25,996 --> 00:04:29,703
approximation and such kind of prior
distribution will be dependent only

72
00:04:29,703 --> 00:04:31,753
on Alpha, and will not depend on Theta.

73
00:04:31,753 --> 00:04:35,352
The counter divergence is still
intractable function, but

74
00:04:35,352 --> 00:04:39,162
now it is a function of just
one-dimensional parameter Alpha,

75
00:04:39,162 --> 00:04:43,553
and it can be easily approximated
by smooth, differentiable function.

76
00:04:43,553 --> 00:04:46,437
So in the figure you see the black dots.

77
00:04:46,437 --> 00:04:49,263
This is the exact values
of our KL divergence,

78
00:04:49,263 --> 00:04:51,373
given different values of Alpha.

79
00:04:51,373 --> 00:04:54,968
And the red curve is our smooth,
differentiable approximation.

80
00:04:54,968 --> 00:04:59,613
And the existence of this smooth,
differential approximation means that

81
00:04:59,613 --> 00:05:03,901
potentially we may optimize the KL
divergence with expect to Alpha.

82
00:05:03,901 --> 00:05:07,853
And hence, we optimize the ELBO with
respect to both Theta and Alpha.

83
00:05:07,853 --> 00:05:11,383
And this is what we are going
to do in the next lecture.

84
00:05:11,383 --> 00:05:14,815
So to conclude, dropouts is
a popular regularization technique.

85
00:05:14,815 --> 00:05:21,008
The essence of dropout is simple injection
of the noise in each iteration you obtain.

86
00:05:21,008 --> 00:05:24,786
In this lecture, we have shown
that one of the popular dropouts,

87
00:05:24,786 --> 00:05:26,642
so-called Gaussian dropout,

88
00:05:26,642 --> 00:05:31,330
is exactly equivalent to a special kind
of generalization Bayesian procedure.

89
00:05:31,330 --> 00:05:35,078
And this understanding, the understanding
that dropout is a particular case

90
00:05:35,078 --> 00:05:38,714
of Bayesian inference, allows us to
construct various generalizations of

91
00:05:38,714 --> 00:05:41,917
dropout that may possess several
quite interesting properties.

92
00:05:41,917 --> 00:05:43,922
We'll review one of them
in the next lecture.

93
00:05:43,922 --> 00:05:53,922
[MUSIC]