1
00:00:03,890 --> 00:00:07,455
Let's start with discussing a problem of fitting

2
00:00:07,455 --> 00:00:11,119
a distribution P of X into a data-set of points.

3
00:00:11,119 --> 00:00:14,955
Why? Well, we have already discussed this problem in week one,

4
00:00:14,955 --> 00:00:19,810
when we discussed how to fit a Gaussian to a data-set of points,

5
00:00:19,810 --> 00:00:21,210
we discussed it in week two,

6
00:00:21,210 --> 00:00:23,070
when we discussed clustering problem,

7
00:00:23,070 --> 00:00:27,550
and how we can solve it by fitting the Gaussian mixture model into our data.

8
00:00:27,550 --> 00:00:29,235
And also, we discussed

9
00:00:29,235 --> 00:00:34,035
probabilistic PC which is kind of an infinite mixture of Gaussians.

10
00:00:34,035 --> 00:00:36,330
But now, we will want

11
00:00:36,330 --> 00:00:42,750
to return to this question because it turns out that the methods we covered,

12
00:00:42,750 --> 00:00:47,520
like Gaussian or Gaussian mixture model on the probabilistic PC,

13
00:00:47,520 --> 00:00:55,905
are not enough to capture the complicated objects like images, like natural images.

14
00:00:55,905 --> 00:00:57,955
So, you may want to fit

15
00:00:57,955 --> 00:01:02,265
your data-set of natural images into a probabilistic distribution,

16
00:01:02,265 --> 00:01:05,290
for example, to generate new data.

17
00:01:05,290 --> 00:01:08,865
And, if you try to do that with Gaussian mixture model, it will work,

18
00:01:08,865 --> 00:01:15,180
but it will not work as well as some more sophisticated models we will discuss this week.

19
00:01:15,180 --> 00:01:17,700
And so, in this example for example,

20
00:01:17,700 --> 00:01:25,950
we generated some fake celebrity faces by using a generative model,

21
00:01:25,950 --> 00:01:28,275
and you can do these kinds of things if you

22
00:01:28,275 --> 00:01:30,870
have a probability distribution of your training data,

23
00:01:30,870 --> 00:01:33,645
so you can sample new images from this distribution.

24
00:01:33,645 --> 00:01:37,900
And also you can, if you have such a model, like P of X,

25
00:01:37,900 --> 00:01:43,910
you can also do a kind of Photoshop of the future applications, like here.

26
00:01:43,910 --> 00:01:46,260
So you can, with a few brush strokes,

27
00:01:46,260 --> 00:01:48,405
you can change a few pixels in your image,

28
00:01:48,405 --> 00:01:50,960
and the program will try to recolor everything else,

29
00:01:50,960 --> 00:01:53,645
so the picture will stay for the realistic.

30
00:01:53,645 --> 00:01:56,835
So, it will change the color of the hair and etc.

31
00:01:56,835 --> 00:02:01,420
So, one more reason to try

32
00:02:01,420 --> 00:02:06,375
to fit distribution P of X into some complicated structured data like images,

33
00:02:06,375 --> 00:02:09,410
is to detect anomalies.

34
00:02:09,410 --> 00:02:14,180
So, for example, you have a bank and you have a sequence of transactions, and then,

35
00:02:14,180 --> 00:02:19,900
if you fit your probabilistic model into this sequence of transactions,

36
00:02:19,900 --> 00:02:23,320
for a new transaction you can predict how probable

37
00:02:23,320 --> 00:02:26,680
this transaction is according to our model,

38
00:02:26,680 --> 00:02:28,968
our current training data-set,

39
00:02:28,968 --> 00:02:31,640
and if this particular transaction is not very probable,

40
00:02:31,640 --> 00:02:37,015
then we may say that it's kind of suspicious and we may ask humans to check it.

41
00:02:37,015 --> 00:02:39,810
And also, for example, if you have security camera footage,

42
00:02:39,810 --> 00:02:45,460
you can train the model on your normal day security camera, and then,

43
00:02:45,460 --> 00:02:50,350
if something suspicious happens then you can detect that by seeing that

44
00:02:50,350 --> 00:02:52,360
some images from your cameras have

45
00:02:52,360 --> 00:02:56,480
a low probability P of your image according to your model.

46
00:02:56,480 --> 00:03:00,115
So, you can detect anomalies, detect suspicious behavior.

47
00:03:00,115 --> 00:03:01,495
And, one more reason is,

48
00:03:01,495 --> 00:03:03,940
you may want to handle missing data.

49
00:03:03,940 --> 00:03:08,470
For example, you have some images with obscured parts,

50
00:03:08,470 --> 00:03:10,105
and you want to do predictions.

51
00:03:10,105 --> 00:03:11,945
In this case, if you have P of X,

52
00:03:11,945 --> 00:03:13,735
so probability of your data,

53
00:03:13,735 --> 00:03:16,685
it will help you greatly to deal with it.

54
00:03:16,685 --> 00:03:19,690
And finally, sometimes people try to

55
00:03:19,690 --> 00:03:25,250
represent some highly structured data in low dimensional embeddings.

56
00:03:25,250 --> 00:03:30,225
And, this is not some inherent property for modeling data with P of X but,

57
00:03:30,225 --> 00:03:32,225
as we will see in the models we will cover,

58
00:03:32,225 --> 00:03:34,485
it kind of comes naturally.

59
00:03:34,485 --> 00:03:39,820
So, it will give a latent code to any object it sees,

60
00:03:39,820 --> 00:03:46,720
and then we can use this latent code to explore the space of our objects kind of nicely.

61
00:03:46,720 --> 00:03:51,880
So, for example, people sometimes build these kind of latent codes for molecules and

62
00:03:51,880 --> 00:03:57,600
then try to discover new drugs by exploring this space of molecules in this latent space.

63
00:03:57,600 --> 00:04:00,130
Okay, so, let's say we're convinced,

64
00:04:00,130 --> 00:04:08,075
we want to model P of X of some natural images or other types of structured data.

65
00:04:08,075 --> 00:04:10,580
How can we do it?

66
00:04:10,580 --> 00:04:14,950
Well, probably the most natural approach is to say that,

67
00:04:14,950 --> 00:04:17,230
let's use a convolutional neural network

68
00:04:17,230 --> 00:04:20,320
because it's something that works really well for images.

69
00:04:20,320 --> 00:04:24,835
And, let's say that our convolution neural network will look at the image,

70
00:04:24,835 --> 00:04:28,520
and then return your probability of this image, right?

71
00:04:28,520 --> 00:04:31,520
It will like, it's the simplest possible parametric model

72
00:04:31,520 --> 00:04:34,765
of something that returns your probability for any image.

73
00:04:34,765 --> 00:04:36,740
And, to make things more stable,

74
00:04:36,740 --> 00:04:40,520
let's say that CNN will actually return your logarithm of probability.

75
00:04:40,520 --> 00:04:46,710
The problem with this approach is that you have to normalize your distribution.

76
00:04:46,710 --> 00:04:49,360
You have to make your distribution to sum up to one,

77
00:04:49,360 --> 00:04:54,805
with respect to sum according to all possible images in the world,

78
00:04:54,805 --> 00:04:57,060
and there are billions of them.

79
00:04:57,060 --> 00:05:00,375
So, this normalization constant is very expensive to compute,

80
00:05:00,375 --> 00:05:07,345
and you have to compute it to do the training or inference in the proper manner.

81
00:05:07,345 --> 00:05:09,120
So, this thing is infeasible.

82
00:05:09,120 --> 00:05:10,850
You can't do that because of normalization.

83
00:05:10,850 --> 00:05:13,620
So, what else can you do?

84
00:05:13,620 --> 00:05:15,570
Well, you can use the chain rule.

85
00:05:15,570 --> 00:05:18,040
If you recall from week one,

86
00:05:18,040 --> 00:05:20,580
any probabilistic distribution can be

87
00:05:20,580 --> 00:05:23,785
decomposed into a product of some condition distributions,

88
00:05:23,785 --> 00:05:25,645
and we can apply it to natural images,

89
00:05:25,645 --> 00:05:27,485
for example, like this.

90
00:05:27,485 --> 00:05:29,850
So, we have an image, in this case,

91
00:05:29,850 --> 00:05:33,060
it's a three by three pixel image, but of course,

92
00:05:33,060 --> 00:05:37,440
in a practical situation you will use 100 by 100 for example,

93
00:05:37,440 --> 00:05:40,430
or an even more high resolution image.

94
00:05:40,430 --> 00:05:45,805
And, you can enumerate each pixel of this image somehow, like for example,

95
00:05:45,805 --> 00:05:49,725
a row by row fashion and then you can say that the distribution

96
00:05:49,725 --> 00:05:55,060
of this whole image is the same as the joint distribution of the pixels.

97
00:05:55,060 --> 00:05:57,645
And, this joint distribution decomposes into

98
00:05:57,645 --> 00:06:02,150
the product of conditional distributions by the chain rule.

99
00:06:02,150 --> 00:06:08,015
So, the distribution on the whole image equals to the probability of the first pixel,

100
00:06:08,015 --> 00:06:10,620
marginal probability, plus the probability of

101
00:06:10,620 --> 00:06:13,950
the second pixel given the first one, and etc.

102
00:06:13,950 --> 00:06:17,280
And, now you may try to build these kind of

103
00:06:17,280 --> 00:06:23,570
conditional probability models to model your overall joint probability.

104
00:06:23,570 --> 00:06:27,390
And, if your model for conditional probability is flexible enough,

105
00:06:27,390 --> 00:06:29,730
you will not lose anything because you can

106
00:06:29,730 --> 00:06:33,280
represent in this way any probability distribution.

107
00:06:33,280 --> 00:06:35,670
And, the natural idea how to represent

108
00:06:35,670 --> 00:06:39,150
these conditional probabilities is with recurrent neural network.

109
00:06:39,150 --> 00:06:43,935
Which basically will read your image pixel by pixel,

110
00:06:43,935 --> 00:06:46,815
and then outputs your prediction for the next pixel.

111
00:06:46,815 --> 00:06:50,795
Prediction for brightness for next pixel for example,

112
00:06:50,795 --> 00:06:54,780
and this approach makes modeling much easier because now normalization

113
00:06:54,780 --> 00:06:59,805
constant has to think only about one dimensional distribution.

114
00:06:59,805 --> 00:07:03,145
So, if for example, your image is grayscale,

115
00:07:03,145 --> 00:07:09,220
then each pixel can be decoded with the number from zero to 255.

116
00:07:09,220 --> 00:07:10,780
So, the brightness level,

117
00:07:10,780 --> 00:07:14,640
and then your normalization constant can be computed by just summing up

118
00:07:14,640 --> 00:07:19,920
with respect to all these 256 values, so it's easy.

119
00:07:19,920 --> 00:07:21,658
It's a really nice approach,

120
00:07:21,658 --> 00:07:23,957
check it out if you have time,

121
00:07:23,957 --> 00:07:30,210
but a downside of that is you have to generate your new images one pixel at a time.

122
00:07:30,210 --> 00:07:33,150
So, if you want to generate a new image you have

123
00:07:33,150 --> 00:07:35,970
to first of all generate X1 from the marginal distribution X1,

124
00:07:35,970 --> 00:07:40,320
then you will feed this just generated X1 into the RNN,

125
00:07:40,320 --> 00:07:45,125
it will output your distribution on the next pixel and etc.

126
00:07:45,125 --> 00:07:47,625
So, no matter how many computers you have,

127
00:07:47,625 --> 00:07:54,355
one high resolution image can take like minutes which is really long.

128
00:07:54,355 --> 00:07:59,970
And so, we may want to look at something else.

129
00:07:59,970 --> 00:08:05,265
One more thing you can do, is to say that your distribution over pixels is independent.

130
00:08:05,265 --> 00:08:08,835
So, each pixel is independent of the others.

131
00:08:08,835 --> 00:08:12,375
In this case, you can easily feed this kind of distribution into your data,

132
00:08:12,375 --> 00:08:17,710
but it turns out to be a too restrictive assumption to say about your data.

133
00:08:17,710 --> 00:08:21,465
So, even in this simple example of a data-set of handwritten digits,

134
00:08:21,465 --> 00:08:24,720
if you have like 10,000 of these small images,

135
00:08:24,720 --> 00:08:27,726
and you train this kind of factorised model on them,

136
00:08:27,726 --> 00:08:32,665
you will get really not nice looking samples like this.

137
00:08:32,665 --> 00:08:36,360
That's because the assumption that each pixel is independent

138
00:08:36,360 --> 00:08:40,880
of the others is really not held on true data.

139
00:08:40,880 --> 00:08:44,175
So, for example, if you saw one half of the image,

140
00:08:44,175 --> 00:08:46,365
you can probably restore the other half

141
00:08:46,365 --> 00:08:50,585
quite accurately which means that they're not independent.

142
00:08:50,585 --> 00:08:52,745
So, this assumption is too restrictive.

143
00:08:52,745 --> 00:08:55,200
One more thing you can do is,

144
00:08:55,200 --> 00:08:57,700
you can use Gaussian mixture model.

145
00:08:57,700 --> 00:09:02,970
And, this thing is like really flexible in theory,

146
00:09:02,970 --> 00:09:05,735
it can represent any probability distribution,

147
00:09:05,735 --> 00:09:10,530
but in practice for complicated data like natural images,

148
00:09:10,530 --> 00:09:13,530
it can be really inefficient.

149
00:09:13,530 --> 00:09:19,108
So, we will have to use maybe thousands of Gaussians, of components,

150
00:09:19,108 --> 00:09:20,400
and in this case

151
00:09:20,400 --> 00:09:23,670
the overall methods will

152
00:09:23,670 --> 00:09:27,510
fail to capture the structure because it'll be too hard to train it.

153
00:09:27,510 --> 00:09:32,798
One more thing we can try is an infinite mixture of Gaussians like the probabilistic B,

154
00:09:32,798 --> 00:09:35,170
C, E, methods we covered in week two.

155
00:09:35,170 --> 00:09:37,845
So, here the idea is that each object,

156
00:09:37,845 --> 00:09:41,790
each image X has a corresponding latent variable T,

157
00:09:41,790 --> 00:09:44,655
and the image X is caused by this T,

158
00:09:44,655 --> 00:09:47,520
so we can marginalize out T. And,

159
00:09:47,520 --> 00:09:50,815
the conditional distribution X given T is a Gaussian.

160
00:09:50,815 --> 00:09:54,510
So, we kind of have a mixture of infinitely many Gaussians, for each value of T,

161
00:09:54,510 --> 00:09:59,440
there's one Gaussian and we mix them with weights.

162
00:09:59,440 --> 00:10:03,235
Note here that, even if the Gaussians are factorized,

163
00:10:03,235 --> 00:10:07,610
so they have independent components for each dimension, the mixture is not.

164
00:10:07,610 --> 00:10:11,550
So, this is a little bit more powerful model than the Gaussian mixture model,

165
00:10:11,550 --> 00:10:14,765
and we will discuss in the next videos how we can

166
00:10:14,765 --> 00:10:19,940
make it even more powerful by using neural networks inside this model.