1
00:00:03,770 --> 00:00:07,845
Okay. So, we decided to model our distribution

2
00:00:07,845 --> 00:00:11,790
through facts by using the continuous mixture of Gaussians.

3
00:00:11,790 --> 00:00:13,645
So, let's develop this idea.

4
00:00:13,645 --> 00:00:16,247
To define this model fully,

5
00:00:16,247 --> 00:00:19,585
we have to define the prior and the likelihood.

6
00:00:19,585 --> 00:00:23,520
And let's define the prior to be just standard norm, because, why not.

7
00:00:23,520 --> 00:00:26,080
It will just force the latent variables t

8
00:00:26,080 --> 00:00:29,370
to be around zero and with some unique variants.

9
00:00:29,370 --> 00:00:33,885
And the likelihood, we decide that we will use Gaussians, right?

10
00:00:33,885 --> 00:00:37,585
With parameters that depend on t somehow.

11
00:00:37,585 --> 00:00:41,919
So, how can we define these parameters,

12
00:00:41,919 --> 00:00:45,040
these pro-metric way to convert t to the parameters of the Gaussian?

13
00:00:45,040 --> 00:00:51,840
Well, if we use linear function for Mu of t with some parameters w and b and a constant

14
00:00:51,840 --> 00:00:56,190
for sigma of t. Which this Sigma zero can

15
00:00:56,190 --> 00:01:01,630
be a parameter or maybe like all these identity matrix, it doesn't matter that much.

16
00:01:01,630 --> 00:01:04,170
We'll get the usual PPCA model.

17
00:01:04,170 --> 00:01:09,570
And, this probabilistic PPCA model is

18
00:01:09,570 --> 00:01:15,540
really nice but it's not powerful enough for our kinds of natural images data.

19
00:01:15,540 --> 00:01:19,350
So, let's think what can we change to make this model more powerful.

20
00:01:19,350 --> 00:01:23,610
If a linear function is not powerful enough for our purposes,

21
00:01:23,610 --> 00:01:28,335
let's use convolutional neural network because it works nice for images data.

22
00:01:28,335 --> 00:01:31,530
Right? So, let's say that Mu of t is

23
00:01:31,530 --> 00:01:35,044
some convolutional neural network apply it to the latent called

24
00:01:35,044 --> 00:01:43,965
t. So it gets as input the latent t and outputs your image or a mean vector for an image.

25
00:01:43,965 --> 00:01:48,030
And then Sigma t is also a commercial neural network which

26
00:01:48,030 --> 00:01:52,880
takes living quarters input and output your covariance matrix Sigma.

27
00:01:52,880 --> 00:01:58,665
This will define our model in some kind of parametric form.

28
00:01:58,665 --> 00:02:00,985
So we have them all like this.

29
00:02:00,985 --> 00:02:05,505
And let's emphasize that we have some weights and then you'll input

30
00:02:05,505 --> 00:02:11,280
w. Let's put them in all parts far off our model definitions.

31
00:02:11,280 --> 00:02:12,540
Do not forget about them.

32
00:02:12,540 --> 00:02:17,300
We are going to train the model to have them all like this.

33
00:02:17,300 --> 00:02:22,470
So pre-meal to facts given the weights of neuron that are w is a mixture of Gaussians,

34
00:02:22,470 --> 00:02:24,870
where the parameters of the Gaussians depends on

35
00:02:24,870 --> 00:02:29,734
the leading variable t for a convolutional neural network.

36
00:02:29,734 --> 00:02:37,030
One problem here is that if for example your images are 100 by 100,

37
00:02:37,030 --> 00:02:43,420
then you have just 10000 pixels in each image and it's pretty low resolution.

38
00:02:43,420 --> 00:02:46,375
It's not high end in anyway,

39
00:02:46,375 --> 00:02:48,615
but even in this case,

40
00:02:48,615 --> 00:02:54,210
your covariance matrix will be 10,000 by 10,000. And that's a lot.

41
00:02:54,210 --> 00:02:58,800
So we want to avoid that and it's not so reasonable to

42
00:02:58,800 --> 00:03:04,945
ask our neural network to output your 10,000 by 10,000 image, or matrix.

43
00:03:04,945 --> 00:03:12,950
To get rid of this problem let's just say that our covariance matrix will be diagonal.

44
00:03:12,950 --> 00:03:18,150
Instead of outputting the whole large matrix Sigma,

45
00:03:18,150 --> 00:03:19,980
we'll ask our neural network to produce

46
00:03:19,980 --> 00:03:23,800
just the weights on the diagonal of this covariance matrix.

47
00:03:23,800 --> 00:03:28,600
So we will have 10,000 Sigmas here for example and we will

48
00:03:28,600 --> 00:03:32,030
put these numbers on the diagonal of

49
00:03:32,030 --> 00:03:35,820
covariance matrix to define the actual normal distribution,

50
00:03:35,820 --> 00:03:42,880
or condition on the latent variable t. Now our conditional distributions are vectorized.

51
00:03:42,880 --> 00:03:48,488
It's Gaussians with zero off diagonal elements in the covariance matrix, but it's okay.

52
00:03:48,488 --> 00:03:52,635
Mixture of vectors as Gaussian is not a factor as distribution.

53
00:03:52,635 --> 00:03:55,350
So we don't have much problems here.

54
00:03:55,350 --> 00:03:58,695
We have our model fully defined,

55
00:03:58,695 --> 00:04:00,450
now have to train it somehow.

56
00:04:00,450 --> 00:04:04,875
We have to train.

57
00:04:04,875 --> 00:04:08,585
The natural way to do it is to use maximum likelihood estimation

58
00:04:08,585 --> 00:04:13,215
so to maximize the density of our data set given the parameters;

59
00:04:13,215 --> 00:04:16,750
the parameters of the conventional unit neural network.

60
00:04:16,750 --> 00:04:20,925
This can be redefined by a sum integral where we marginalize

61
00:04:20,925 --> 00:04:25,260
out the latent variable t. Since we have a latent variable,

62
00:04:25,260 --> 00:04:27,720
let's use expectation maximization algorithm.

63
00:04:27,720 --> 00:04:33,090
It is specifically invented for these kind of models.

64
00:04:33,090 --> 00:04:35,465
And in the expectation maximization algorithm,

65
00:04:35,465 --> 00:04:37,500
if you recall from week two,

66
00:04:37,500 --> 00:04:42,180
we're building a lower bond on the logarithm of this marginal likelihood,

67
00:04:42,180 --> 00:04:46,065
P of x given w and we are lower modeling

68
00:04:46,065 --> 00:04:51,465
this value by something which depends on w and some new variational parameters Q.

69
00:04:51,465 --> 00:04:54,600
And then we'll maximize this lower balance with respect to

70
00:04:54,600 --> 00:04:58,815
both w and q to get this lower bound

71
00:04:58,815 --> 00:05:02,010
as high as possible as accurate so as close to the

72
00:05:02,010 --> 00:05:06,585
actual lower for the margin look like what is possible.

73
00:05:06,585 --> 00:05:11,420
And the problem here is that when you step off of the play

74
00:05:11,420 --> 00:05:14,910
an expectation maximisation algorithm we have to use we

75
00:05:14,910 --> 00:05:19,440
have to find the best years original latent variables.

76
00:05:19,440 --> 00:05:22,950
And this is intractable in this case because you have to compute

77
00:05:22,950 --> 00:05:27,630
some integrals and your integrals contains convolutional neural networks in them.

78
00:05:27,630 --> 00:05:30,295
And this is just too hard to do analytically.

79
00:05:30,295 --> 00:05:37,815
So E-M is actually not the way to go here. So what else can we do?

80
00:05:37,815 --> 00:05:42,450
Well in the previous week we discussed the Markov chain Monte Carlo and we can

81
00:05:42,450 --> 00:05:47,320
use we can use this MCMC to approximate M-step of the expectation maximisation.

82
00:05:47,320 --> 00:05:51,705
Right. Well. This way on the amstaff

83
00:05:51,705 --> 00:05:56,460
we instead of using the expected value with respect to the Q.

84
00:05:56,460 --> 00:05:59,640
Which is in the posterior distribution on the latent variables from

85
00:05:59,640 --> 00:06:05,100
the previous iteration in that we will approximate this expected value with samples,

86
00:06:05,100 --> 00:06:11,210
with an average and then we'll maximize this iteration instead of the expected value.

87
00:06:11,210 --> 00:06:12,975
It's an option we can do that.

88
00:06:12,975 --> 00:06:17,775
Well it's going to be kind of slow because this way on each iteration of

89
00:06:17,775 --> 00:06:23,450
expectation optimization you have to run like hundreds of situation of Markov chain.

90
00:06:23,450 --> 00:06:28,425
Wait until have converged and then start to collect samples.

91
00:06:28,425 --> 00:06:31,090
So this way you will have kind of a mess that loop.

92
00:06:31,090 --> 00:06:35,730
You will have all the reiterations of expectation maximisation and iterations of

93
00:06:35,730 --> 00:06:41,085
Markov chain Monte Carlo and this will probably not be very fast to do.

94
00:06:41,085 --> 00:06:43,520
So let's see what else can we do.

95
00:06:43,520 --> 00:06:46,869
Well we can try variational inference and the idea of variational inference

96
00:06:46,869 --> 00:06:51,810
is to maximize the same lower bound

97
00:06:51,810 --> 00:07:00,930
but to restrict the distribution you do be vectorized.

98
00:07:00,930 --> 00:07:04,350
So for example if the later they will charge for each data object

99
00:07:04,350 --> 00:07:08,465
is 50 dimensional then this Q

100
00:07:08,465 --> 00:07:10,920
I of T I will be just a product of

101
00:07:10,920 --> 00:07:16,260
50 one dimensional distributions so it's a nice way to go,

102
00:07:16,260 --> 00:07:18,330
it's a nice approach.

103
00:07:18,330 --> 00:07:23,430
It will approximate your expectation maximisation but it usually works and pretty fast.

104
00:07:23,430 --> 00:07:26,775
But it turns out that in this case even this is intractable.

105
00:07:26,775 --> 00:07:29,650
So in this approximation is not enough to get

106
00:07:29,650 --> 00:07:34,790
an efficient method for training your latent variable model.

107
00:07:34,790 --> 00:07:38,995
And we have to approximate even further.

108
00:07:38,995 --> 00:07:42,900
So we have to drive even less accurate approximation to be

109
00:07:42,900 --> 00:07:48,170
able to build an efficient method for treating this kind of model.