1
00:00:00,500 --> 00:00:01,550
In this and the next video,

2
00:00:02,040 --> 00:00:03,470
I'd like to tell you about one

3
00:00:03,760 --> 00:00:05,880
possible extension to the

4
00:00:06,140 --> 00:00:08,270
anomaly detection algorithm that we've developed so far.

5
00:00:09,020 --> 00:00:11,970
This extension uses something called the multivariate

6
00:00:12,100 --> 00:00:13,480
Gaussian distribution, and it

7
00:00:13,770 --> 00:00:14,970
has some advantages, and some

8
00:00:15,160 --> 00:00:16,790
disadvantages, and it can

9
00:00:17,070 --> 00:00:20,610
sometimes catch some anomalies that the earlier algorithm didn't.

10
00:00:21,740 --> 00:00:23,730
To motivate this, let's start with an example.

11
00:00:25,620 --> 00:00:28,410
Let's say that so our unlabeled data looks like what I have plotted here.

12
00:00:29,060 --> 00:00:30,190
And I'm going to use

13
00:00:30,340 --> 00:00:32,320
the example of monitoring machines

14
00:00:32,890 --> 00:00:34,890
in the data center, monitoring computers in the data center.

15
00:00:35,290 --> 00:00:36,170
So my two features are x1

16
00:00:36,220 --> 00:00:37,070
which is the CPU load and x2

17
00:00:37,250 --> 00:00:39,280
which is maybe the memory use.

18
00:00:41,160 --> 00:00:42,160
So if I take

19
00:00:42,340 --> 00:00:43,330
my two features, x1 and x2,

20
00:00:43,580 --> 00:00:45,960
and I model them as Gaussians then

21
00:00:46,200 --> 00:00:47,430
here's a plot of

22
00:00:47,610 --> 00:00:49,040
my X1 features, here's a

23
00:00:49,210 --> 00:00:50,370
plot of my X2 features,

24
00:00:50,980 --> 00:00:51,880
and so if I fit a

25
00:00:51,910 --> 00:00:52,640
Gaussian to that, maybe I'll

26
00:00:52,760 --> 00:00:56,050
get a Gaussian like this, so

27
00:00:56,730 --> 00:00:57,750
here's P of X 1,

28
00:00:57,860 --> 00:01:00,350
which depends

29
00:01:00,690 --> 00:01:02,130
on the parameters mu 1, and

30
00:01:02,440 --> 00:01:04,740
sigma squared 1,

31
00:01:04,880 --> 00:01:06,120
and here's my memory used, and,

32
00:01:06,240 --> 00:01:07,020
you know, maybe I'll get a Gaussian

33
00:01:07,560 --> 00:01:09,910
that looks like this, and this is my P of X 2,

34
00:01:10,760 --> 00:01:12,500
which depends on mu 2 and sigma squared 2.

35
00:01:12,590 --> 00:01:14,660
And so this is

36
00:01:14,870 --> 00:01:16,340
how the anomaly detection algorithm

37
00:01:16,790 --> 00:01:17,850
models X1 and X2.

38
00:01:19,900 --> 00:01:21,160
Now let's say that in the

39
00:01:21,260 --> 00:01:22,330
test sets I have an

40
00:01:22,410 --> 00:01:24,010
example that looks like this.

41
00:01:25,540 --> 00:01:26,600
The location of that green

42
00:01:27,310 --> 00:01:29,160
cross, so the value of

43
00:01:29,360 --> 00:01:31,220
X 1 is about 0.4, and the value of X 2 is about 1.5.

44
00:01:31,300 --> 00:01:34,430
Now, if you look at

45
00:01:34,660 --> 00:01:35,780
the data, it looks like,

46
00:01:35,960 --> 00:01:36,780
yeah, most of the data data

47
00:01:37,140 --> 00:01:38,800
lies in this region, and

48
00:01:38,940 --> 00:01:40,400
so that green cross

49
00:01:41,110 --> 00:01:43,510
is pretty far away from any of the data I've seen.

50
00:01:43,840 --> 00:01:44,870
It looks like that should be raised

51
00:01:45,210 --> 00:01:46,790
as an anomaly. So, in my

52
00:01:46,970 --> 00:01:48,660
data, in my, in the

53
00:01:48,790 --> 00:01:49,930
data of my good examples,

54
00:01:50,320 --> 00:01:51,430
it looks like, you know, the

55
00:01:51,510 --> 00:01:52,680
CPU load, and the

56
00:01:52,770 --> 00:01:54,330
memory use, they sort

57
00:01:54,680 --> 00:01:56,100
of grow linearly with each other.

58
00:01:56,560 --> 00:01:57,720
So if I have a

59
00:01:57,940 --> 00:01:59,000
machine using lots of CPU,

60
00:01:59,150 --> 00:02:00,460
you know memory use

61
00:02:00,830 --> 00:02:02,930
will also be high, whereas this

62
00:02:03,320 --> 00:02:05,910
example, this green example it looks like

63
00:02:06,040 --> 00:02:07,140
here, the CPU load is

64
00:02:07,280 --> 00:02:08,280
very low, but the memory use

65
00:02:08,490 --> 00:02:09,310
is very high, and I just

66
00:02:09,430 --> 00:02:10,820
have not seen that before in my training set.

67
00:02:10,980 --> 00:02:12,150
It looks like that should be an anomaly.

68
00:02:13,190 --> 00:02:15,300
But let's see what the anomaly detection algorithm will do.

69
00:02:15,570 --> 00:02:16,750
Well, for the CPU load, it

70
00:02:16,850 --> 00:02:17,990
puts it at around there

71
00:02:18,280 --> 00:02:20,700
0.5 and this reasonably high

72
00:02:20,900 --> 00:02:21,910
probability is not that

73
00:02:22,120 --> 00:02:23,350
far from other examples we've seen,

74
00:02:23,650 --> 00:02:25,230
maybe, whereas, for the

75
00:02:26,160 --> 00:02:28,320
memory use, this appointment, 0.5,

76
00:02:29,030 --> 00:02:29,900
whereas for the memory

77
00:02:30,030 --> 00:02:32,340
use, it's about 1.5, which is there. Again,

78
00:02:32,680 --> 00:02:34,600
you know, it's all to

79
00:02:34,730 --> 00:02:35,850
us, it's not terribly Gaussian, but

80
00:02:35,980 --> 00:02:37,310
the value here and the value

81
00:02:37,550 --> 00:02:38,830
here is not that different

82
00:02:39,210 --> 00:02:41,180
from many other examples we've

83
00:02:41,430 --> 00:02:43,020
seen, and so P of

84
00:02:43,210 --> 00:02:44,530
X 1, will be pretty high,

85
00:02:45,550 --> 00:02:46,030
reasonably high.

86
00:02:46,290 --> 00:02:47,730
P of X 2 reasonably high.

87
00:02:47,980 --> 00:02:49,030
I mean, if you look at this

88
00:02:49,910 --> 00:02:51,230
plot right, this point here,

89
00:02:51,410 --> 00:02:52,530
it doesn't look that bad, and

90
00:02:52,830 --> 00:02:54,440
if you look at this plot, you

91
00:02:54,720 --> 00:02:56,690
know across here, doesn't look that bad.

92
00:02:57,050 --> 00:02:58,780
I mean, I have had examples with

93
00:02:58,980 --> 00:03:00,730
even greater memory used, or

94
00:03:01,030 --> 00:03:02,270
with even less CPU use,

95
00:03:02,860 --> 00:03:04,780
and so this example doesn't look that anomalous.

96
00:03:05,940 --> 00:03:07,380
And so, an anomaly detection algorithm

97
00:03:07,680 --> 00:03:10,090
will fail to flag this point as an anomaly.

98
00:03:10,550 --> 00:03:12,220
And it turns out what

99
00:03:12,360 --> 00:03:13,610
our anomaly detection algorithm is

100
00:03:13,880 --> 00:03:15,070
doing is that it is

101
00:03:15,200 --> 00:03:16,700
not realizing that this blue

102
00:03:16,900 --> 00:03:18,060
ellipse shows the high

103
00:03:18,210 --> 00:03:19,380
probability region, is that, one

104
00:03:19,490 --> 00:03:21,290
of the thing is that, examples here,

105
00:03:21,720 --> 00:03:23,430
a high probability, and the

106
00:03:23,680 --> 00:03:24,980
examples, the next circle

107
00:03:26,170 --> 00:03:27,280
of from a lower probably, and

108
00:03:27,370 --> 00:03:28,950
examples here are even

109
00:03:29,220 --> 00:03:31,040
lower probability, and somehow, here

110
00:03:31,150 --> 00:03:32,070
are things that are, green cross

111
00:03:32,420 --> 00:03:33,430
there, it's pretty high probability,

112
00:03:34,490 --> 00:03:35,510
and in particular, it tends to think

113
00:03:35,990 --> 00:03:37,740
that, you know, everything in this

114
00:03:38,000 --> 00:03:40,400
region, everything on the

115
00:03:40,580 --> 00:03:43,390
line that I'm circling over, has, you know, about equal probability,

116
00:03:44,160 --> 00:03:45,810
and it doesn't realize that something

117
00:03:46,790 --> 00:03:50,910
out here actually has

118
00:03:51,080 --> 00:03:53,130
much lower probability than something over there.

119
00:03:55,060 --> 00:03:56,080
So, in order to fix

120
00:03:56,270 --> 00:03:57,300
this, we can, we're going to

121
00:03:57,580 --> 00:03:58,930
develop a modified version of

122
00:03:58,990 --> 00:04:01,030
the anomaly detection algorithm, using

123
00:04:01,430 --> 00:04:02,520
something called the multivariate

124
00:04:02,580 --> 00:04:05,880
Gaussian distribution also called the multivariate normal distribution.

125
00:04:07,330 --> 00:04:08,120
So here's what we're going to

126
00:04:08,810 --> 00:04:10,270
do. We have features x

127
00:04:10,470 --> 00:04:11,680
which are in Rn and

128
00:04:11,910 --> 00:04:14,180
instead of P of X 1, P of X 2, separately,

129
00:04:14,570 --> 00:04:15,630
we're going to model P of

130
00:04:15,800 --> 00:04:16,840
X, all in one go,

131
00:04:17,010 --> 00:04:18,970
so model P of X, you know, all at the same time.

132
00:04:20,300 --> 00:04:21,550
So the parameters of the

133
00:04:21,830 --> 00:04:24,140
multivariate Gaussian distribution are mu,

134
00:04:24,630 --> 00:04:25,770
which is a vector, and sigma,

135
00:04:26,490 --> 00:04:28,450
which is an n by n matrix, called a covariance matrix,

136
00:04:29,640 --> 00:04:30,870
and this is similar to the

137
00:04:31,010 --> 00:04:32,220
covariance matrix that we

138
00:04:32,430 --> 00:04:33,560
saw when we were working

139
00:04:34,080 --> 00:04:35,200
with the PCA, with the

140
00:04:35,280 --> 00:04:36,700
principal components analysis algorithm.

141
00:04:37,860 --> 00:04:38,970
For the second complete is, let

142
00:04:39,070 --> 00:04:39,880
me just write out the formula

143
00:04:40,930 --> 00:04:42,390
for the multivariate Gaussian distribution.

144
00:04:42,820 --> 00:04:44,030
So we say that probability of

145
00:04:44,140 --> 00:04:45,100
X, and this is parameterized

146
00:04:46,090 --> 00:04:47,500
by my parameters mu and

147
00:04:47,640 --> 00:04:49,280
sigma that the

148
00:04:49,360 --> 00:04:50,100
probability of x is equal

149
00:04:50,430 --> 00:04:52,260
to once again

150
00:04:52,580 --> 00:04:54,810
there's absolutely no need to memorize this formula.

151
00:04:56,030 --> 00:04:56,780
You know, you can look it up

152
00:04:57,010 --> 00:04:58,160
whenever you need to use

153
00:04:58,340 --> 00:04:59,130
it, but this is what

154
00:04:59,690 --> 00:05:01,230
the probability of X looks like.

155
00:05:03,000 --> 00:05:04,680
Transverse, 2nd inverse, X

156
00:05:05,220 --> 00:05:06,300
minus mu.

157
00:05:07,400 --> 00:05:08,850
And this thing here,

158
00:05:10,390 --> 00:05:11,510
the absolute value of sigma, this

159
00:05:11,680 --> 00:05:13,140
thing here when you write

160
00:05:13,410 --> 00:05:14,430
this symbol, this is called

161
00:05:14,600 --> 00:05:17,220
the determent of sigma

162
00:05:18,150 --> 00:05:19,620
and this is a mathematical function

163
00:05:20,210 --> 00:05:21,740
of a matrix and you really

164
00:05:21,960 --> 00:05:22,820
don't need to know what the

165
00:05:23,240 --> 00:05:24,250
determinant of a matrix is,

166
00:05:24,780 --> 00:05:25,770
but really all you need to

167
00:05:25,860 --> 00:05:27,180
know is that you can

168
00:05:27,320 --> 00:05:29,380
compute it in octave by using

169
00:05:29,760 --> 00:05:31,820
the octave command DET of

170
00:05:33,570 --> 00:05:33,570
sigma.

171
00:05:34,010 --> 00:05:36,210
Okay, and again, just be clear, alright?

172
00:05:36,300 --> 00:05:38,240
In this expression, these sigmas

173
00:05:38,730 --> 00:05:41,250
here, these are just n by n matrix.

174
00:05:41,850 --> 00:05:43,150
This is not a summation and

175
00:05:43,260 --> 00:05:45,680
you know, the sigma there is an n by n matrix.

176
00:05:46,710 --> 00:05:47,780
So that's the formula for P

177
00:05:48,010 --> 00:05:50,500
of X, but it's

178
00:05:50,820 --> 00:05:52,030
more interestingly, or more importantly,

179
00:05:53,940 --> 00:05:55,610
what does P of X actually looks like?

180
00:05:56,190 --> 00:05:57,450
Lets look at some examples of

181
00:05:58,020 --> 00:06:00,690
multivariate Gaussian distributions.

182
00:06:02,350 --> 00:06:03,380
So let's take a

183
00:06:03,500 --> 00:06:04,700
two dimensional example, say if

184
00:06:04,820 --> 00:06:06,550
I have N equals 2, I

185
00:06:06,710 --> 00:06:08,160
have two features, X 1 and X 2.

186
00:06:09,250 --> 00:06:10,540
Lets say I set MU to

187
00:06:10,650 --> 00:06:11,800
be equal to 0 and sigma

188
00:06:12,330 --> 00:06:14,030
to be equal to this matrix here.

189
00:06:14,200 --> 00:06:16,710
With 1s on the diagonals and 0s on the off-diagonals,

190
00:06:17,600 --> 00:06:19,980
this matrix is sometimes also called the identity matrix.

191
00:06:21,350 --> 00:06:22,470
In that case, p of

192
00:06:22,590 --> 00:06:24,950
x will look like

193
00:06:25,240 --> 00:06:27,430
this, and what

194
00:06:27,600 --> 00:06:29,380
I'm showing in this figure is, you know,

195
00:06:29,500 --> 00:06:30,900
for a specific value of X1

196
00:06:31,240 --> 00:06:32,860
and for a specific value of

197
00:06:32,970 --> 00:06:34,680
X2, the height of

198
00:06:34,810 --> 00:06:36,470
this surface the value

199
00:06:36,970 --> 00:06:38,330
of p of x. And

200
00:06:38,470 --> 00:06:39,520
so with this setting the parameters

201
00:06:40,610 --> 00:06:42,100
p of x is highest when

202
00:06:42,300 --> 00:06:43,620
X1 and X2 equal zero 0,

203
00:06:44,010 --> 00:06:45,710
so that's the peak of this Gaussian distribution,

204
00:06:46,950 --> 00:06:48,760
and the probability falls off with this

205
00:06:48,970 --> 00:06:51,330
sort of two dimensional Gaussian or

206
00:06:51,510 --> 00:06:53,590
this bell shaped two dimensional bell-shaped surface.

207
00:06:55,080 --> 00:06:56,400
Down below is the same

208
00:06:56,610 --> 00:06:58,230
thing but plotted using a

209
00:06:58,330 --> 00:07:00,970
contour plot instead, or using different colors,

210
00:07:01,150 --> 00:07:02,020
and so this

211
00:07:02,530 --> 00:07:04,210
heavy intense red in the

212
00:07:04,280 --> 00:07:06,260
middle, corresponds to the highest values,

213
00:07:06,850 --> 00:07:08,230
and then the values decrease

214
00:07:08,790 --> 00:07:10,470
with the yellow being slightly lower

215
00:07:10,700 --> 00:07:11,830
values the cyan being

216
00:07:12,060 --> 00:07:13,230
lower values and this deep

217
00:07:14,000 --> 00:07:15,440
blue being the lowest

218
00:07:15,450 --> 00:07:17,010
values so this is really the same figure but plotted

219
00:07:17,240 --> 00:07:19,410
viewed from the top instead, using colors instead.

220
00:07:21,390 --> 00:07:22,510
And so, with this distribution,

221
00:07:23,830 --> 00:07:25,010
you see that it faces most

222
00:07:25,300 --> 00:07:27,440
of the probability near 0,0

223
00:07:27,600 --> 00:07:28,630
and then as you go out

224
00:07:28,710 --> 00:07:32,450
from 0,0 the probability of X1 and X2 goes down.

225
00:07:36,000 --> 00:07:37,220
Now lets try varying some

226
00:07:37,310 --> 00:07:38,630
of the parameters and see

227
00:07:38,770 --> 00:07:40,150
what happens. So let's

228
00:07:40,940 --> 00:07:42,420
take sigma and change it

229
00:07:42,590 --> 00:07:44,720
so let's say sigma shrinks a

230
00:07:44,870 --> 00:07:46,350
little bit. Sigma is a

231
00:07:46,580 --> 00:07:47,710
covariance matrix and so it

232
00:07:47,820 --> 00:07:49,030
measures the variance or the

233
00:07:49,120 --> 00:07:50,640
variability of the features X1 X2.

234
00:07:50,720 --> 00:07:52,080
So if the shrink

235
00:07:52,400 --> 00:07:53,430
sigma then what you get

236
00:07:53,780 --> 00:07:54,290
is what you get is that the

237
00:07:54,400 --> 00:07:56,320
width of this bump diminishes

238
00:07:57,760 --> 00:07:59,310
and the height also

239
00:07:59,550 --> 00:08:00,620
increases a bit, because the

240
00:08:01,090 --> 00:08:03,080
area under the surface is equal to 1.

241
00:08:03,130 --> 00:08:04,400
So the integral of the

242
00:08:04,950 --> 00:08:06,230
volume under the surface is

243
00:08:06,580 --> 00:08:08,000
equal to 1, because probability

244
00:08:08,690 --> 00:08:10,080
distribution must integrate to one.

245
00:08:10,800 --> 00:08:11,650
But, if you shrink the variance,

246
00:08:12,660 --> 00:08:14,290
it's kinda like shrinking

247
00:08:14,810 --> 00:08:15,870
sigma squared,

248
00:08:16,740 --> 00:08:20,080
you end up with a narrower distribution, and one that's a little bit taller.

249
00:08:20,860 --> 00:08:22,150
And so you see here also the

250
00:08:22,580 --> 00:08:27,200
concentric ellipsis has shrunk a little bit.

251
00:08:27,340 --> 00:08:28,730
Whereas in contrast if you were to increase sigma

252
00:08:29,770 --> 00:08:31,000
to 2 2 on the

253
00:08:31,110 --> 00:08:32,020
diagonals, so it is now two

254
00:08:32,220 --> 00:08:34,370
times the identity then you end up with a

255
00:08:34,510 --> 00:08:35,880
much wider and much flatter Gaussian.

256
00:08:36,150 --> 00:08:38,190
And so the width of this is much wider.

257
00:08:38,930 --> 00:08:39,800
This is hard to see but this

258
00:08:40,020 --> 00:08:41,090
is still a bell shaped bump,

259
00:08:41,210 --> 00:08:42,540
it's just flattened down a lot,

260
00:08:42,620 --> 00:08:44,470
it has become much wider and

261
00:08:44,590 --> 00:08:45,720
so the variance or the

262
00:08:45,830 --> 00:08:48,690
variability of X1 and X2 just becomes wider.

263
00:08:50,520 --> 00:08:50,980
Here are a few more examples.

264
00:08:51,670 --> 00:08:53,930
Now lets try varying

265
00:08:54,070 --> 00:08:55,490
one of the elements of sigma at the time.

266
00:08:55,840 --> 00:08:58,080
Let's say I send sigma to

267
00:08:58,140 --> 00:09:00,020
0.6 there, and 1 over there.

268
00:09:01,340 --> 00:09:02,380
What this does, is this

269
00:09:02,610 --> 00:09:04,240
reduces the variance of

270
00:09:05,780 --> 00:09:06,960
the first feature, X 1, while

271
00:09:07,770 --> 00:09:08,890
keeping the variance of the

272
00:09:08,960 --> 00:09:11,530
second feature X 2, the same.

273
00:09:12,160 --> 00:09:15,150
And so with this setting of parameters, you can model things like that.

274
00:09:15,670 --> 00:09:16,910
X 1 has smaller variance, and

275
00:09:17,580 --> 00:09:19,120
X 2 has larger variance.

276
00:09:20,080 --> 00:09:20,800
Whereas if I do this,

277
00:09:21,120 --> 00:09:22,900
if I set this

278
00:09:23,090 --> 00:09:24,390
matrix to 2, 1

279
00:09:24,560 --> 00:09:25,900
then you can also model

280
00:09:26,230 --> 00:09:27,470
examples where you know here

281
00:09:28,850 --> 00:09:30,590
we'll say X1 can have take

282
00:09:30,830 --> 00:09:31,930
on a large range of values

283
00:09:32,220 --> 00:09:34,870
whereas X2 takes on a relatively narrower range of values.

284
00:09:35,070 --> 00:09:37,060
And that's reflected in this

285
00:09:37,270 --> 00:09:38,040
figure as well, you know where,

286
00:09:38,750 --> 00:09:40,530
the distribution falls off

287
00:09:40,830 --> 00:09:42,670
more slowly as X 1

288
00:09:42,820 --> 00:09:43,940
moves away from 0,

289
00:09:44,180 --> 00:09:45,380
and falls off very

290
00:09:45,640 --> 00:09:48,080
rapidly as X 2 moves away from 0.

291
00:09:49,190 --> 00:09:50,710
And similarly if

292
00:09:50,800 --> 00:09:52,320
we were to modify

293
00:09:53,010 --> 00:09:54,490
this element of the

294
00:09:54,660 --> 00:09:55,570
matrix instead, then similar to the previous

295
00:09:57,390 --> 00:09:58,860
slide, except that here where

296
00:09:59,450 --> 00:10:00,900
you know playing around here saying

297
00:10:01,240 --> 00:10:03,010
that X2 can take on

298
00:10:03,170 --> 00:10:04,460
a very small range of values

299
00:10:05,190 --> 00:10:06,840
and so here if this

300
00:10:07,200 --> 00:10:08,740
is 0.6, we notice now X2

301
00:10:09,810 --> 00:10:10,610
tends to take on a much

302
00:10:10,760 --> 00:10:12,930
smaller range of values than the original example,

303
00:10:14,010 --> 00:10:15,310
whereas if we were to

304
00:10:15,680 --> 00:10:17,320
set sigma to be equal to 2 then

305
00:10:17,410 --> 00:10:20,580
that's like saying X2 you know, has a much larger range of values.

306
00:10:22,780 --> 00:10:23,570
Now, one of the cool

307
00:10:23,880 --> 00:10:24,950
things about the multivariate

308
00:10:25,190 --> 00:10:26,690
Gaussian distribution is that

309
00:10:26,880 --> 00:10:28,050
you can also use it to

310
00:10:28,330 --> 00:10:30,230
model correlations between the data.

311
00:10:30,410 --> 00:10:31,930
That is we can use it to

312
00:10:32,060 --> 00:10:33,510
model the fact that

313
00:10:33,610 --> 00:10:34,940
X1 and X2 tend to be

314
00:10:35,070 --> 00:10:36,760
highly correlated with each other for example.

315
00:10:37,640 --> 00:10:38,880
So specifically if you start

316
00:10:39,540 --> 00:10:40,720
to change the off diagonal

317
00:10:41,340 --> 00:10:42,390
entries of this covariance

318
00:10:42,950 --> 00:10:45,250
matrix you can get a different type of Gaussian distribution.

319
00:10:46,610 --> 00:10:48,250
And so as I

320
00:10:48,340 --> 00:10:49,590
increase the off-diagonal entries

321
00:10:50,090 --> 00:10:51,300
from .5 to .8, what

322
00:10:51,580 --> 00:10:53,080
I get is this distribution that

323
00:10:53,380 --> 00:10:54,590
is more and more thinly peaked

324
00:10:55,100 --> 00:10:57,480
along this sort of x equals y line.

325
00:10:57,700 --> 00:10:59,100
And so here the

326
00:10:59,160 --> 00:11:00,610
contour says that x and

327
00:11:00,730 --> 00:11:03,010
y tend to grow together and

328
00:11:03,290 --> 00:11:04,500
the things that are with

329
00:11:04,640 --> 00:11:06,550
large probability are if

330
00:11:06,790 --> 00:11:08,140
either X1 is large and

331
00:11:08,260 --> 00:11:09,560
Y2 is large or X1

332
00:11:09,890 --> 00:11:11,160
is small and Y2 is small.

333
00:11:11,490 --> 00:11:12,480
Or somewhere in between.

334
00:11:13,110 --> 00:11:14,700
And as this entry,

335
00:11:15,130 --> 00:11:16,280
0.8 gets large, you get

336
00:11:16,490 --> 00:11:18,410
a Gaussian distribution, that's sort of

337
00:11:18,660 --> 00:11:20,570
where all the probability lies on

338
00:11:20,770 --> 00:11:22,870
this sort of narrow region,

339
00:11:24,350 --> 00:11:26,200
where x is approximately equal to

340
00:11:26,420 --> 00:11:27,530
y. This is a very

341
00:11:28,020 --> 00:11:30,290
tall, thin distribution you know

342
00:11:30,670 --> 00:11:32,570
line mostly along this line

343
00:11:33,860 --> 00:11:34,940
central region where x is

344
00:11:35,010 --> 00:11:36,860
close to y. So this

345
00:11:37,130 --> 00:11:38,350
is if we set these

346
00:11:38,810 --> 00:11:40,360
entries to be positive entries.

347
00:11:40,970 --> 00:11:42,120
In contrast if we set

348
00:11:42,460 --> 00:11:43,530
these to negative values, as

349
00:11:44,350 --> 00:11:46,340
I decreases it to -.5

350
00:11:46,380 --> 00:11:47,920
down to -.8, then

351
00:11:48,060 --> 00:11:49,360
what we get is a model where

352
00:11:49,870 --> 00:11:50,930
we put most of the probability

353
00:11:51,620 --> 00:11:53,930
in this sort of negative X

354
00:11:54,010 --> 00:11:55,420
one in the next 2 correlation region,

355
00:11:55,710 --> 00:11:57,330
and so, most of the

356
00:11:57,480 --> 00:11:58,420
probability now lies in this region,

357
00:11:58,810 --> 00:11:59,910
where X 1 is about equal

358
00:12:00,190 --> 00:12:01,700
to -X 2, rather than X

359
00:12:01,890 --> 00:12:03,370
1 equals X 2.

360
00:12:04,180 --> 00:12:05,460
And so this captures a sort

361
00:12:05,610 --> 00:12:08,050
of negative correlation between x1

362
00:12:10,300 --> 00:12:10,650
and x2.

363
00:12:11,010 --> 00:12:12,550
And so this is

364
00:12:12,750 --> 00:12:13,640
a hopefully this gives you a sense of the

365
00:12:13,750 --> 00:12:15,230
different distributions that the

366
00:12:15,650 --> 00:12:17,400
multivariate Gaussian distribution can capture.

367
00:12:18,680 --> 00:12:20,430
So follow up in varying, the

368
00:12:20,730 --> 00:12:22,200
covariance matrix sigma, the other

369
00:12:22,910 --> 00:12:23,880
thing you can do is

370
00:12:24,030 --> 00:12:26,090
also, vary the mean

371
00:12:26,300 --> 00:12:27,730
parameter mu, and so

372
00:12:28,370 --> 00:12:29,740
operationally, we have mu

373
00:12:30,270 --> 00:12:31,190
equal 0 0, and so the

374
00:12:31,250 --> 00:12:32,820
distribution was centered around

375
00:12:33,270 --> 00:12:34,650
X 1 equals 0, X2 equals 0,

376
00:12:35,050 --> 00:12:35,980
so the peak of the

377
00:12:36,070 --> 00:12:38,530
distribution is here, whereas,

378
00:12:38,950 --> 00:12:40,430
if we vary the values of

379
00:12:40,610 --> 00:12:42,120
mu, then that varies the

380
00:12:42,360 --> 00:12:43,700
peak of the distribution and so,

381
00:12:43,910 --> 00:12:45,770
if mu equals 0, 0.5,

382
00:12:45,920 --> 00:12:47,100
the peak is at, you know,

383
00:12:47,270 --> 00:12:49,470
X1 equals zero, and X2

384
00:12:49,810 --> 00:12:51,430
equals 0.5, and so the

385
00:12:51,980 --> 00:12:53,400
peak or the center of

386
00:12:53,710 --> 00:12:55,260
this distribution has shifted,

387
00:12:56,470 --> 00:12:57,770
and if mu was 1.5

388
00:12:58,340 --> 00:13:00,050
minus 0.5 then OK,

389
00:13:01,170 --> 00:13:03,350
and similarly the peak

390
00:13:03,890 --> 00:13:05,490
of the distribution has now

391
00:13:05,620 --> 00:13:06,750
shifted to a different location,

392
00:13:07,670 --> 00:13:09,710
corresponding to where, you know,

393
00:13:09,910 --> 00:13:11,020
X1 is 1.5 and X2

394
00:13:11,350 --> 00:13:12,710
is -0.5, and so

395
00:13:13,290 --> 00:13:15,180
varying the mu parameter, just shifts

396
00:13:15,730 --> 00:13:17,840
around the center of this whole distribution.

397
00:13:18,450 --> 00:13:19,670
So, hopefully, looking at

398
00:13:19,780 --> 00:13:21,270
all these different pictures gives you

399
00:13:21,410 --> 00:13:22,440
a sense of the sort

400
00:13:22,700 --> 00:13:24,850
of probability distributions that

401
00:13:25,070 --> 00:13:28,000
the Multivariate Gaussian Distribution allows you to capture.

402
00:13:28,800 --> 00:13:29,800
And the key advantage of it

403
00:13:29,990 --> 00:13:30,930
is it allows you to

404
00:13:31,130 --> 00:13:32,240
capture, when you'd expect

405
00:13:32,750 --> 00:13:33,840
two different features to be

406
00:13:33,970 --> 00:13:36,560
positively correlated, or maybe negatively correlated.

407
00:13:37,790 --> 00:13:39,030
In the next video, we'll take

408
00:13:39,260 --> 00:13:40,760
this multivariate Gaussian distribution

409
00:13:41,670 --> 00:13:43,290
and apply it to anomaly detection.