1 00:00:00,500 --> 00:00:01,550 In this and the next video, 2 00:00:02,040 --> 00:00:03,470 I'd like to tell you about one 3 00:00:03,760 --> 00:00:05,880 possible extension to the 4 00:00:06,140 --> 00:00:08,270 anomaly detection algorithm that we've developed so far. 5 00:00:09,020 --> 00:00:11,970 This extension uses something called the multivariate 6 00:00:12,100 --> 00:00:13,480 Gaussian distribution, and it 7 00:00:13,770 --> 00:00:14,970 has some advantages, and some 8 00:00:15,160 --> 00:00:16,790 disadvantages, and it can 9 00:00:17,070 --> 00:00:20,610 sometimes catch some anomalies that the earlier algorithm didn't. 10 00:00:21,740 --> 00:00:23,730 To motivate this, let's start with an example. 11 00:00:25,620 --> 00:00:28,410 Let's say that so our unlabeled data looks like what I have plotted here. 12 00:00:29,060 --> 00:00:30,190 And I'm going to use 13 00:00:30,340 --> 00:00:32,320 the example of monitoring machines 14 00:00:32,890 --> 00:00:34,890 in the data center, monitoring computers in the data center. 15 00:00:35,290 --> 00:00:36,170 So my two features are x1 16 00:00:36,220 --> 00:00:37,070 which is the CPU load and x2 17 00:00:37,250 --> 00:00:39,280 which is maybe the memory use. 18 00:00:41,160 --> 00:00:42,160 So if I take 19 00:00:42,340 --> 00:00:43,330 my two features, x1 and x2, 20 00:00:43,580 --> 00:00:45,960 and I model them as Gaussians then 21 00:00:46,200 --> 00:00:47,430 here's a plot of 22 00:00:47,610 --> 00:00:49,040 my X1 features, here's a 23 00:00:49,210 --> 00:00:50,370 plot of my X2 features, 24 00:00:50,980 --> 00:00:51,880 and so if I fit a 25 00:00:51,910 --> 00:00:52,640 Gaussian to that, maybe I'll 26 00:00:52,760 --> 00:00:56,050 get a Gaussian like this, so 27 00:00:56,730 --> 00:00:57,750 here's P of X 1, 28 00:00:57,860 --> 00:01:00,350 which depends 29 00:01:00,690 --> 00:01:02,130 on the parameters mu 1, and 30 00:01:02,440 --> 00:01:04,740 sigma squared 1, 31 00:01:04,880 --> 00:01:06,120 and here's my memory used, and, 32 00:01:06,240 --> 00:01:07,020 you know, maybe I'll get a Gaussian 33 00:01:07,560 --> 00:01:09,910 that looks like this, and this is my P of X 2, 34 00:01:10,760 --> 00:01:12,500 which depends on mu 2 and sigma squared 2. 35 00:01:12,590 --> 00:01:14,660 And so this is 36 00:01:14,870 --> 00:01:16,340 how the anomaly detection algorithm 37 00:01:16,790 --> 00:01:17,850 models X1 and X2. 38 00:01:19,900 --> 00:01:21,160 Now let's say that in the 39 00:01:21,260 --> 00:01:22,330 test sets I have an 40 00:01:22,410 --> 00:01:24,010 example that looks like this. 41 00:01:25,540 --> 00:01:26,600 The location of that green 42 00:01:27,310 --> 00:01:29,160 cross, so the value of 43 00:01:29,360 --> 00:01:31,220 X 1 is about 0.4, and the value of X 2 is about 1.5. 44 00:01:31,300 --> 00:01:34,430 Now, if you look at 45 00:01:34,660 --> 00:01:35,780 the data, it looks like, 46 00:01:35,960 --> 00:01:36,780 yeah, most of the data data 47 00:01:37,140 --> 00:01:38,800 lies in this region, and 48 00:01:38,940 --> 00:01:40,400 so that green cross 49 00:01:41,110 --> 00:01:43,510 is pretty far away from any of the data I've seen. 50 00:01:43,840 --> 00:01:44,870 It looks like that should be raised 51 00:01:45,210 --> 00:01:46,790 as an anomaly. So, in my 52 00:01:46,970 --> 00:01:48,660 data, in my, in the 53 00:01:48,790 --> 00:01:49,930 data of my good examples, 54 00:01:50,320 --> 00:01:51,430 it looks like, you know, the 55 00:01:51,510 --> 00:01:52,680 CPU load, and the 56 00:01:52,770 --> 00:01:54,330 memory use, they sort 57 00:01:54,680 --> 00:01:56,100 of grow linearly with each other. 58 00:01:56,560 --> 00:01:57,720 So if I have a 59 00:01:57,940 --> 00:01:59,000 machine using lots of CPU, 60 00:01:59,150 --> 00:02:00,460 you know memory use 61 00:02:00,830 --> 00:02:02,930 will also be high, whereas this 62 00:02:03,320 --> 00:02:05,910 example, this green example it looks like 63 00:02:06,040 --> 00:02:07,140 here, the CPU load is 64 00:02:07,280 --> 00:02:08,280 very low, but the memory use 65 00:02:08,490 --> 00:02:09,310 is very high, and I just 66 00:02:09,430 --> 00:02:10,820 have not seen that before in my training set. 67 00:02:10,980 --> 00:02:12,150 It looks like that should be an anomaly. 68 00:02:13,190 --> 00:02:15,300 But let's see what the anomaly detection algorithm will do. 69 00:02:15,570 --> 00:02:16,750 Well, for the CPU load, it 70 00:02:16,850 --> 00:02:17,990 puts it at around there 71 00:02:18,280 --> 00:02:20,700 0.5 and this reasonably high 72 00:02:20,900 --> 00:02:21,910 probability is not that 73 00:02:22,120 --> 00:02:23,350 far from other examples we've seen, 74 00:02:23,650 --> 00:02:25,230 maybe, whereas, for the 75 00:02:26,160 --> 00:02:28,320 memory use, this appointment, 0.5, 76 00:02:29,030 --> 00:02:29,900 whereas for the memory 77 00:02:30,030 --> 00:02:32,340 use, it's about 1.5, which is there. Again, 78 00:02:32,680 --> 00:02:34,600 you know, it's all to 79 00:02:34,730 --> 00:02:35,850 us, it's not terribly Gaussian, but 80 00:02:35,980 --> 00:02:37,310 the value here and the value 81 00:02:37,550 --> 00:02:38,830 here is not that different 82 00:02:39,210 --> 00:02:41,180 from many other examples we've 83 00:02:41,430 --> 00:02:43,020 seen, and so P of 84 00:02:43,210 --> 00:02:44,530 X 1, will be pretty high, 85 00:02:45,550 --> 00:02:46,030 reasonably high. 86 00:02:46,290 --> 00:02:47,730 P of X 2 reasonably high. 87 00:02:47,980 --> 00:02:49,030 I mean, if you look at this 88 00:02:49,910 --> 00:02:51,230 plot right, this point here, 89 00:02:51,410 --> 00:02:52,530 it doesn't look that bad, and 90 00:02:52,830 --> 00:02:54,440 if you look at this plot, you 91 00:02:54,720 --> 00:02:56,690 know across here, doesn't look that bad. 92 00:02:57,050 --> 00:02:58,780 I mean, I have had examples with 93 00:02:58,980 --> 00:03:00,730 even greater memory used, or 94 00:03:01,030 --> 00:03:02,270 with even less CPU use, 95 00:03:02,860 --> 00:03:04,780 and so this example doesn't look that anomalous. 96 00:03:05,940 --> 00:03:07,380 And so, an anomaly detection algorithm 97 00:03:07,680 --> 00:03:10,090 will fail to flag this point as an anomaly. 98 00:03:10,550 --> 00:03:12,220 And it turns out what 99 00:03:12,360 --> 00:03:13,610 our anomaly detection algorithm is 100 00:03:13,880 --> 00:03:15,070 doing is that it is 101 00:03:15,200 --> 00:03:16,700 not realizing that this blue 102 00:03:16,900 --> 00:03:18,060 ellipse shows the high 103 00:03:18,210 --> 00:03:19,380 probability region, is that, one 104 00:03:19,490 --> 00:03:21,290 of the thing is that, examples here, 105 00:03:21,720 --> 00:03:23,430 a high probability, and the 106 00:03:23,680 --> 00:03:24,980 examples, the next circle 107 00:03:26,170 --> 00:03:27,280 of from a lower probably, and 108 00:03:27,370 --> 00:03:28,950 examples here are even 109 00:03:29,220 --> 00:03:31,040 lower probability, and somehow, here 110 00:03:31,150 --> 00:03:32,070 are things that are, green cross 111 00:03:32,420 --> 00:03:33,430 there, it's pretty high probability, 112 00:03:34,490 --> 00:03:35,510 and in particular, it tends to think 113 00:03:35,990 --> 00:03:37,740 that, you know, everything in this 114 00:03:38,000 --> 00:03:40,400 region, everything on the 115 00:03:40,580 --> 00:03:43,390 line that I'm circling over, has, you know, about equal probability, 116 00:03:44,160 --> 00:03:45,810 and it doesn't realize that something 117 00:03:46,790 --> 00:03:50,910 out here actually has 118 00:03:51,080 --> 00:03:53,130 much lower probability than something over there. 119 00:03:55,060 --> 00:03:56,080 So, in order to fix 120 00:03:56,270 --> 00:03:57,300 this, we can, we're going to 121 00:03:57,580 --> 00:03:58,930 develop a modified version of 122 00:03:58,990 --> 00:04:01,030 the anomaly detection algorithm, using 123 00:04:01,430 --> 00:04:02,520 something called the multivariate 124 00:04:02,580 --> 00:04:05,880 Gaussian distribution also called the multivariate normal distribution. 125 00:04:07,330 --> 00:04:08,120 So here's what we're going to 126 00:04:08,810 --> 00:04:10,270 do. We have features x 127 00:04:10,470 --> 00:04:11,680 which are in Rn and 128 00:04:11,910 --> 00:04:14,180 instead of P of X 1, P of X 2, separately, 129 00:04:14,570 --> 00:04:15,630 we're going to model P of 130 00:04:15,800 --> 00:04:16,840 X, all in one go, 131 00:04:17,010 --> 00:04:18,970 so model P of X, you know, all at the same time. 132 00:04:20,300 --> 00:04:21,550 So the parameters of the 133 00:04:21,830 --> 00:04:24,140 multivariate Gaussian distribution are mu, 134 00:04:24,630 --> 00:04:25,770 which is a vector, and sigma, 135 00:04:26,490 --> 00:04:28,450 which is an n by n matrix, called a covariance matrix, 136 00:04:29,640 --> 00:04:30,870 and this is similar to the 137 00:04:31,010 --> 00:04:32,220 covariance matrix that we 138 00:04:32,430 --> 00:04:33,560 saw when we were working 139 00:04:34,080 --> 00:04:35,200 with the PCA, with the 140 00:04:35,280 --> 00:04:36,700 principal components analysis algorithm. 141 00:04:37,860 --> 00:04:38,970 For the second complete is, let 142 00:04:39,070 --> 00:04:39,880 me just write out the formula 143 00:04:40,930 --> 00:04:42,390 for the multivariate Gaussian distribution. 144 00:04:42,820 --> 00:04:44,030 So we say that probability of 145 00:04:44,140 --> 00:04:45,100 X, and this is parameterized 146 00:04:46,090 --> 00:04:47,500 by my parameters mu and 147 00:04:47,640 --> 00:04:49,280 sigma that the 148 00:04:49,360 --> 00:04:50,100 probability of x is equal 149 00:04:50,430 --> 00:04:52,260 to once again 150 00:04:52,580 --> 00:04:54,810 there's absolutely no need to memorize this formula. 151 00:04:56,030 --> 00:04:56,780 You know, you can look it up 152 00:04:57,010 --> 00:04:58,160 whenever you need to use 153 00:04:58,340 --> 00:04:59,130 it, but this is what 154 00:04:59,690 --> 00:05:01,230 the probability of X looks like. 155 00:05:03,000 --> 00:05:04,680 Transverse, 2nd inverse, X 156 00:05:05,220 --> 00:05:06,300 minus mu. 157 00:05:07,400 --> 00:05:08,850 And this thing here, 158 00:05:10,390 --> 00:05:11,510 the absolute value of sigma, this 159 00:05:11,680 --> 00:05:13,140 thing here when you write 160 00:05:13,410 --> 00:05:14,430 this symbol, this is called 161 00:05:14,600 --> 00:05:17,220 the determent of sigma 162 00:05:18,150 --> 00:05:19,620 and this is a mathematical function 163 00:05:20,210 --> 00:05:21,740 of a matrix and you really 164 00:05:21,960 --> 00:05:22,820 don't need to know what the 165 00:05:23,240 --> 00:05:24,250 determinant of a matrix is, 166 00:05:24,780 --> 00:05:25,770 but really all you need to 167 00:05:25,860 --> 00:05:27,180 know is that you can 168 00:05:27,320 --> 00:05:29,380 compute it in octave by using 169 00:05:29,760 --> 00:05:31,820 the octave command DET of 170 00:05:33,570 --> 00:05:33,570 sigma. 171 00:05:34,010 --> 00:05:36,210 Okay, and again, just be clear, alright? 172 00:05:36,300 --> 00:05:38,240 In this expression, these sigmas 173 00:05:38,730 --> 00:05:41,250 here, these are just n by n matrix. 174 00:05:41,850 --> 00:05:43,150 This is not a summation and 175 00:05:43,260 --> 00:05:45,680 you know, the sigma there is an n by n matrix. 176 00:05:46,710 --> 00:05:47,780 So that's the formula for P 177 00:05:48,010 --> 00:05:50,500 of X, but it's 178 00:05:50,820 --> 00:05:52,030 more interestingly, or more importantly, 179 00:05:53,940 --> 00:05:55,610 what does P of X actually looks like? 180 00:05:56,190 --> 00:05:57,450 Lets look at some examples of 181 00:05:58,020 --> 00:06:00,690 multivariate Gaussian distributions. 182 00:06:02,350 --> 00:06:03,380 So let's take a 183 00:06:03,500 --> 00:06:04,700 two dimensional example, say if 184 00:06:04,820 --> 00:06:06,550 I have N equals 2, I 185 00:06:06,710 --> 00:06:08,160 have two features, X 1 and X 2. 186 00:06:09,250 --> 00:06:10,540 Lets say I set MU to 187 00:06:10,650 --> 00:06:11,800 be equal to 0 and sigma 188 00:06:12,330 --> 00:06:14,030 to be equal to this matrix here. 189 00:06:14,200 --> 00:06:16,710 With 1s on the diagonals and 0s on the off-diagonals, 190 00:06:17,600 --> 00:06:19,980 this matrix is sometimes also called the identity matrix. 191 00:06:21,350 --> 00:06:22,470 In that case, p of 192 00:06:22,590 --> 00:06:24,950 x will look like 193 00:06:25,240 --> 00:06:27,430 this, and what 194 00:06:27,600 --> 00:06:29,380 I'm showing in this figure is, you know, 195 00:06:29,500 --> 00:06:30,900 for a specific value of X1 196 00:06:31,240 --> 00:06:32,860 and for a specific value of 197 00:06:32,970 --> 00:06:34,680 X2, the height of 198 00:06:34,810 --> 00:06:36,470 this surface the value 199 00:06:36,970 --> 00:06:38,330 of p of x. And 200 00:06:38,470 --> 00:06:39,520 so with this setting the parameters 201 00:06:40,610 --> 00:06:42,100 p of x is highest when 202 00:06:42,300 --> 00:06:43,620 X1 and X2 equal zero 0, 203 00:06:44,010 --> 00:06:45,710 so that's the peak of this Gaussian distribution, 204 00:06:46,950 --> 00:06:48,760 and the probability falls off with this 205 00:06:48,970 --> 00:06:51,330 sort of two dimensional Gaussian or 206 00:06:51,510 --> 00:06:53,590 this bell shaped two dimensional bell-shaped surface. 207 00:06:55,080 --> 00:06:56,400 Down below is the same 208 00:06:56,610 --> 00:06:58,230 thing but plotted using a 209 00:06:58,330 --> 00:07:00,970 contour plot instead, or using different colors, 210 00:07:01,150 --> 00:07:02,020 and so this 211 00:07:02,530 --> 00:07:04,210 heavy intense red in the 212 00:07:04,280 --> 00:07:06,260 middle, corresponds to the highest values, 213 00:07:06,850 --> 00:07:08,230 and then the values decrease 214 00:07:08,790 --> 00:07:10,470 with the yellow being slightly lower 215 00:07:10,700 --> 00:07:11,830 values the cyan being 216 00:07:12,060 --> 00:07:13,230 lower values and this deep 217 00:07:14,000 --> 00:07:15,440 blue being the lowest 218 00:07:15,450 --> 00:07:17,010 values so this is really the same figure but plotted 219 00:07:17,240 --> 00:07:19,410 viewed from the top instead, using colors instead. 220 00:07:21,390 --> 00:07:22,510 And so, with this distribution, 221 00:07:23,830 --> 00:07:25,010 you see that it faces most 222 00:07:25,300 --> 00:07:27,440 of the probability near 0,0 223 00:07:27,600 --> 00:07:28,630 and then as you go out 224 00:07:28,710 --> 00:07:32,450 from 0,0 the probability of X1 and X2 goes down. 225 00:07:36,000 --> 00:07:37,220 Now lets try varying some 226 00:07:37,310 --> 00:07:38,630 of the parameters and see 227 00:07:38,770 --> 00:07:40,150 what happens. So let's 228 00:07:40,940 --> 00:07:42,420 take sigma and change it 229 00:07:42,590 --> 00:07:44,720 so let's say sigma shrinks a 230 00:07:44,870 --> 00:07:46,350 little bit. Sigma is a 231 00:07:46,580 --> 00:07:47,710 covariance matrix and so it 232 00:07:47,820 --> 00:07:49,030 measures the variance or the 233 00:07:49,120 --> 00:07:50,640 variability of the features X1 X2. 234 00:07:50,720 --> 00:07:52,080 So if the shrink 235 00:07:52,400 --> 00:07:53,430 sigma then what you get 236 00:07:53,780 --> 00:07:54,290 is what you get is that the 237 00:07:54,400 --> 00:07:56,320 width of this bump diminishes 238 00:07:57,760 --> 00:07:59,310 and the height also 239 00:07:59,550 --> 00:08:00,620 increases a bit, because the 240 00:08:01,090 --> 00:08:03,080 area under the surface is equal to 1. 241 00:08:03,130 --> 00:08:04,400 So the integral of the 242 00:08:04,950 --> 00:08:06,230 volume under the surface is 243 00:08:06,580 --> 00:08:08,000 equal to 1, because probability 244 00:08:08,690 --> 00:08:10,080 distribution must integrate to one. 245 00:08:10,800 --> 00:08:11,650 But, if you shrink the variance, 246 00:08:12,660 --> 00:08:14,290 it's kinda like shrinking 247 00:08:14,810 --> 00:08:15,870 sigma squared, 248 00:08:16,740 --> 00:08:20,080 you end up with a narrower distribution, and one that's a little bit taller. 249 00:08:20,860 --> 00:08:22,150 And so you see here also the 250 00:08:22,580 --> 00:08:27,200 concentric ellipsis has shrunk a little bit. 251 00:08:27,340 --> 00:08:28,730 Whereas in contrast if you were to increase sigma 252 00:08:29,770 --> 00:08:31,000 to 2 2 on the 253 00:08:31,110 --> 00:08:32,020 diagonals, so it is now two 254 00:08:32,220 --> 00:08:34,370 times the identity then you end up with a 255 00:08:34,510 --> 00:08:35,880 much wider and much flatter Gaussian. 256 00:08:36,150 --> 00:08:38,190 And so the width of this is much wider. 257 00:08:38,930 --> 00:08:39,800 This is hard to see but this 258 00:08:40,020 --> 00:08:41,090 is still a bell shaped bump, 259 00:08:41,210 --> 00:08:42,540 it's just flattened down a lot, 260 00:08:42,620 --> 00:08:44,470 it has become much wider and 261 00:08:44,590 --> 00:08:45,720 so the variance or the 262 00:08:45,830 --> 00:08:48,690 variability of X1 and X2 just becomes wider. 263 00:08:50,520 --> 00:08:50,980 Here are a few more examples. 264 00:08:51,670 --> 00:08:53,930 Now lets try varying 265 00:08:54,070 --> 00:08:55,490 one of the elements of sigma at the time. 266 00:08:55,840 --> 00:08:58,080 Let's say I send sigma to 267 00:08:58,140 --> 00:09:00,020 0.6 there, and 1 over there. 268 00:09:01,340 --> 00:09:02,380 What this does, is this 269 00:09:02,610 --> 00:09:04,240 reduces the variance of 270 00:09:05,780 --> 00:09:06,960 the first feature, X 1, while 271 00:09:07,770 --> 00:09:08,890 keeping the variance of the 272 00:09:08,960 --> 00:09:11,530 second feature X 2, the same. 273 00:09:12,160 --> 00:09:15,150 And so with this setting of parameters, you can model things like that. 274 00:09:15,670 --> 00:09:16,910 X 1 has smaller variance, and 275 00:09:17,580 --> 00:09:19,120 X 2 has larger variance. 276 00:09:20,080 --> 00:09:20,800 Whereas if I do this, 277 00:09:21,120 --> 00:09:22,900 if I set this 278 00:09:23,090 --> 00:09:24,390 matrix to 2, 1 279 00:09:24,560 --> 00:09:25,900 then you can also model 280 00:09:26,230 --> 00:09:27,470 examples where you know here 281 00:09:28,850 --> 00:09:30,590 we'll say X1 can have take 282 00:09:30,830 --> 00:09:31,930 on a large range of values 283 00:09:32,220 --> 00:09:34,870 whereas X2 takes on a relatively narrower range of values. 284 00:09:35,070 --> 00:09:37,060 And that's reflected in this 285 00:09:37,270 --> 00:09:38,040 figure as well, you know where, 286 00:09:38,750 --> 00:09:40,530 the distribution falls off 287 00:09:40,830 --> 00:09:42,670 more slowly as X 1 288 00:09:42,820 --> 00:09:43,940 moves away from 0, 289 00:09:44,180 --> 00:09:45,380 and falls off very 290 00:09:45,640 --> 00:09:48,080 rapidly as X 2 moves away from 0. 291 00:09:49,190 --> 00:09:50,710 And similarly if 292 00:09:50,800 --> 00:09:52,320 we were to modify 293 00:09:53,010 --> 00:09:54,490 this element of the 294 00:09:54,660 --> 00:09:55,570 matrix instead, then similar to the previous 295 00:09:57,390 --> 00:09:58,860 slide, except that here where 296 00:09:59,450 --> 00:10:00,900 you know playing around here saying 297 00:10:01,240 --> 00:10:03,010 that X2 can take on 298 00:10:03,170 --> 00:10:04,460 a very small range of values 299 00:10:05,190 --> 00:10:06,840 and so here if this 300 00:10:07,200 --> 00:10:08,740 is 0.6, we notice now X2 301 00:10:09,810 --> 00:10:10,610 tends to take on a much 302 00:10:10,760 --> 00:10:12,930 smaller range of values than the original example, 303 00:10:14,010 --> 00:10:15,310 whereas if we were to 304 00:10:15,680 --> 00:10:17,320 set sigma to be equal to 2 then 305 00:10:17,410 --> 00:10:20,580 that's like saying X2 you know, has a much larger range of values. 306 00:10:22,780 --> 00:10:23,570 Now, one of the cool 307 00:10:23,880 --> 00:10:24,950 things about the multivariate 308 00:10:25,190 --> 00:10:26,690 Gaussian distribution is that 309 00:10:26,880 --> 00:10:28,050 you can also use it to 310 00:10:28,330 --> 00:10:30,230 model correlations between the data. 311 00:10:30,410 --> 00:10:31,930 That is we can use it to 312 00:10:32,060 --> 00:10:33,510 model the fact that 313 00:10:33,610 --> 00:10:34,940 X1 and X2 tend to be 314 00:10:35,070 --> 00:10:36,760 highly correlated with each other for example. 315 00:10:37,640 --> 00:10:38,880 So specifically if you start 316 00:10:39,540 --> 00:10:40,720 to change the off diagonal 317 00:10:41,340 --> 00:10:42,390 entries of this covariance 318 00:10:42,950 --> 00:10:45,250 matrix you can get a different type of Gaussian distribution. 319 00:10:46,610 --> 00:10:48,250 And so as I 320 00:10:48,340 --> 00:10:49,590 increase the off-diagonal entries 321 00:10:50,090 --> 00:10:51,300 from .5 to .8, what 322 00:10:51,580 --> 00:10:53,080 I get is this distribution that 323 00:10:53,380 --> 00:10:54,590 is more and more thinly peaked 324 00:10:55,100 --> 00:10:57,480 along this sort of x equals y line. 325 00:10:57,700 --> 00:10:59,100 And so here the 326 00:10:59,160 --> 00:11:00,610 contour says that x and 327 00:11:00,730 --> 00:11:03,010 y tend to grow together and 328 00:11:03,290 --> 00:11:04,500 the things that are with 329 00:11:04,640 --> 00:11:06,550 large probability are if 330 00:11:06,790 --> 00:11:08,140 either X1 is large and 331 00:11:08,260 --> 00:11:09,560 Y2 is large or X1 332 00:11:09,890 --> 00:11:11,160 is small and Y2 is small. 333 00:11:11,490 --> 00:11:12,480 Or somewhere in between. 334 00:11:13,110 --> 00:11:14,700 And as this entry, 335 00:11:15,130 --> 00:11:16,280 0.8 gets large, you get 336 00:11:16,490 --> 00:11:18,410 a Gaussian distribution, that's sort of 337 00:11:18,660 --> 00:11:20,570 where all the probability lies on 338 00:11:20,770 --> 00:11:22,870 this sort of narrow region, 339 00:11:24,350 --> 00:11:26,200 where x is approximately equal to 340 00:11:26,420 --> 00:11:27,530 y. This is a very 341 00:11:28,020 --> 00:11:30,290 tall, thin distribution you know 342 00:11:30,670 --> 00:11:32,570 line mostly along this line 343 00:11:33,860 --> 00:11:34,940 central region where x is 344 00:11:35,010 --> 00:11:36,860 close to y. So this 345 00:11:37,130 --> 00:11:38,350 is if we set these 346 00:11:38,810 --> 00:11:40,360 entries to be positive entries. 347 00:11:40,970 --> 00:11:42,120 In contrast if we set 348 00:11:42,460 --> 00:11:43,530 these to negative values, as 349 00:11:44,350 --> 00:11:46,340 I decreases it to -.5 350 00:11:46,380 --> 00:11:47,920 down to -.8, then 351 00:11:48,060 --> 00:11:49,360 what we get is a model where 352 00:11:49,870 --> 00:11:50,930 we put most of the probability 353 00:11:51,620 --> 00:11:53,930 in this sort of negative X 354 00:11:54,010 --> 00:11:55,420 one in the next 2 correlation region, 355 00:11:55,710 --> 00:11:57,330 and so, most of the 356 00:11:57,480 --> 00:11:58,420 probability now lies in this region, 357 00:11:58,810 --> 00:11:59,910 where X 1 is about equal 358 00:12:00,190 --> 00:12:01,700 to -X 2, rather than X 359 00:12:01,890 --> 00:12:03,370 1 equals X 2. 360 00:12:04,180 --> 00:12:05,460 And so this captures a sort 361 00:12:05,610 --> 00:12:08,050 of negative correlation between x1 362 00:12:10,300 --> 00:12:10,650 and x2. 363 00:12:11,010 --> 00:12:12,550 And so this is 364 00:12:12,750 --> 00:12:13,640 a hopefully this gives you a sense of the 365 00:12:13,750 --> 00:12:15,230 different distributions that the 366 00:12:15,650 --> 00:12:17,400 multivariate Gaussian distribution can capture. 367 00:12:18,680 --> 00:12:20,430 So follow up in varying, the 368 00:12:20,730 --> 00:12:22,200 covariance matrix sigma, the other 369 00:12:22,910 --> 00:12:23,880 thing you can do is 370 00:12:24,030 --> 00:12:26,090 also, vary the mean 371 00:12:26,300 --> 00:12:27,730 parameter mu, and so 372 00:12:28,370 --> 00:12:29,740 operationally, we have mu 373 00:12:30,270 --> 00:12:31,190 equal 0 0, and so the 374 00:12:31,250 --> 00:12:32,820 distribution was centered around 375 00:12:33,270 --> 00:12:34,650 X 1 equals 0, X2 equals 0, 376 00:12:35,050 --> 00:12:35,980 so the peak of the 377 00:12:36,070 --> 00:12:38,530 distribution is here, whereas, 378 00:12:38,950 --> 00:12:40,430 if we vary the values of 379 00:12:40,610 --> 00:12:42,120 mu, then that varies the 380 00:12:42,360 --> 00:12:43,700 peak of the distribution and so, 381 00:12:43,910 --> 00:12:45,770 if mu equals 0, 0.5, 382 00:12:45,920 --> 00:12:47,100 the peak is at, you know, 383 00:12:47,270 --> 00:12:49,470 X1 equals zero, and X2 384 00:12:49,810 --> 00:12:51,430 equals 0.5, and so the 385 00:12:51,980 --> 00:12:53,400 peak or the center of 386 00:12:53,710 --> 00:12:55,260 this distribution has shifted, 387 00:12:56,470 --> 00:12:57,770 and if mu was 1.5 388 00:12:58,340 --> 00:13:00,050 minus 0.5 then OK, 389 00:13:01,170 --> 00:13:03,350 and similarly the peak 390 00:13:03,890 --> 00:13:05,490 of the distribution has now 391 00:13:05,620 --> 00:13:06,750 shifted to a different location, 392 00:13:07,670 --> 00:13:09,710 corresponding to where, you know, 393 00:13:09,910 --> 00:13:11,020 X1 is 1.5 and X2 394 00:13:11,350 --> 00:13:12,710 is -0.5, and so 395 00:13:13,290 --> 00:13:15,180 varying the mu parameter, just shifts 396 00:13:15,730 --> 00:13:17,840 around the center of this whole distribution. 397 00:13:18,450 --> 00:13:19,670 So, hopefully, looking at 398 00:13:19,780 --> 00:13:21,270 all these different pictures gives you 399 00:13:21,410 --> 00:13:22,440 a sense of the sort 400 00:13:22,700 --> 00:13:24,850 of probability distributions that 401 00:13:25,070 --> 00:13:28,000 the Multivariate Gaussian Distribution allows you to capture. 402 00:13:28,800 --> 00:13:29,800 And the key advantage of it 403 00:13:29,990 --> 00:13:30,930 is it allows you to 404 00:13:31,130 --> 00:13:32,240 capture, when you'd expect 405 00:13:32,750 --> 00:13:33,840 two different features to be 406 00:13:33,970 --> 00:13:36,560 positively correlated, or maybe negatively correlated. 407 00:13:37,790 --> 00:13:39,030 In the next video, we'll take 408 00:13:39,260 --> 00:13:40,760 this multivariate Gaussian distribution 409 00:13:41,670 --> 00:13:43,290 and apply it to anomaly detection.