1 00:00:00,000 --> 00:00:04,336 [MUSIC] 2 00:00:04,336 --> 00:00:08,524 And when we're thinking about using mixture models to do clustering, 3 00:00:08,524 --> 00:00:13,150 note that they can also be used just to do what's called density estimations. 4 00:00:13,150 --> 00:00:18,780 So estimate those types of curves over the histograms that we drew earlier. 5 00:00:18,780 --> 00:00:22,490 But in our case, we're going to focus in on the clustering application. 6 00:00:22,490 --> 00:00:27,060 And there, there's a really important other variable that we're going to 7 00:00:27,060 --> 00:00:30,210 introduce, and that is the cluster indicator, 8 00:00:30,210 --> 00:00:33,240 the assignment variable for every one of our observations. 9 00:00:33,240 --> 00:00:39,272 So we have, this is the cluster 10 00:00:39,272 --> 00:00:45,780 assignment for observation Xi. 11 00:00:45,780 --> 00:00:50,060 So this is exactly the same variable that we had in k-means that was assigning 12 00:00:50,060 --> 00:00:55,580 observations to clusters, but in that case just using the cluster center. 13 00:00:55,580 --> 00:00:58,940 Okay, so let's step back and think about what our model is saying. 14 00:00:58,940 --> 00:01:04,170 And the first question we can think about is, what's the probability that the ith 15 00:01:04,170 --> 00:01:09,550 data point in our data set is associated with the kth cluster? 16 00:01:09,550 --> 00:01:12,980 So for example, when we're talking about our images we could say, 17 00:01:12,980 --> 00:01:16,450 what's the probability that the ith image I see is, 18 00:01:16,450 --> 00:01:20,620 let's say, in the cluster of clouds images? 19 00:01:20,620 --> 00:01:24,950 And let's talk about this before we've actually observed what the image is. 20 00:01:24,950 --> 00:01:27,530 All we know is it's the ith index in our data set. 21 00:01:29,140 --> 00:01:34,520 Okay, well this is fully specified by the mixture weight pi k, 22 00:01:34,520 --> 00:01:39,540 because that tells us how prevalent cloud images are in our data set. 23 00:01:41,290 --> 00:01:43,250 So that's given right here. 24 00:01:43,250 --> 00:01:50,170 And if we don't observe the content of the image, then we just are caring about how 25 00:01:50,170 --> 00:01:55,180 many cloud images do we have relative to forest images relative to sunset images. 26 00:01:55,180 --> 00:01:58,310 So we say that the prior probability, 27 00:01:58,310 --> 00:02:02,940 that the ith image is assigned to cluster k, is given by pi k. 28 00:02:05,730 --> 00:02:10,870 Another question is, what if I know that an image comes from cluster k? 29 00:02:10,870 --> 00:02:12,080 So I'm going to fix that. 30 00:02:12,080 --> 00:02:14,260 I already know that it's a clouds image. 31 00:02:15,600 --> 00:02:16,680 Now I can say, 32 00:02:16,680 --> 00:02:22,166 what's the likelihood of observing the RGB vector associated with this image? 33 00:02:22,166 --> 00:02:28,040 So Xi, given that the image came from the kth cluster, 34 00:02:28,040 --> 00:02:31,370 this cluster of cloud images. 35 00:02:33,140 --> 00:02:40,370 And in this case what we do is we simply go to, this is the distribution of 36 00:02:42,280 --> 00:02:46,633 cloud images, or distribution of blue for 37 00:02:46,633 --> 00:02:51,810 cloud images. 38 00:02:51,810 --> 00:02:54,600 And we say okay, let's take this one image I have. 39 00:02:54,600 --> 00:02:57,490 This is my Xi image. 40 00:02:57,490 --> 00:03:02,466 And I look at its blue intensity and I say, under this distribution for clouds, 41 00:03:02,466 --> 00:03:04,580 how likely is it? 42 00:03:04,580 --> 00:03:06,735 Well, it's pretty likely. 43 00:03:06,735 --> 00:03:14,050 So it's reasonable to say that this was a clouds image. 44 00:03:14,050 --> 00:03:17,420 But I can also look at this probability under, 45 00:03:17,420 --> 00:03:19,540 remember this was the forest category. 46 00:03:21,460 --> 00:03:22,120 And I could say, 47 00:03:22,120 --> 00:03:25,190 well what's the likelihood of this image under the forest category? 48 00:03:25,190 --> 00:03:27,910 Well, it's not that high. 49 00:03:27,910 --> 00:03:31,580 But on the other hand, what we know is that 50 00:03:31,580 --> 00:03:36,970 there are many more forest images in our data set than cloud images. 51 00:03:36,970 --> 00:03:40,600 So what we're going to be doing when we're going to form our soft assignments, 52 00:03:40,600 --> 00:03:43,360 which we'll talk about in the next section, 53 00:03:43,360 --> 00:03:46,130 is we're going to be thinking about weighting these two terms. 54 00:03:46,130 --> 00:03:50,380 Saying, well what's the prior probability that this image 55 00:03:50,380 --> 00:03:53,010 is form any one of these different classes? 56 00:03:53,010 --> 00:03:58,380 So in this case, it's most likely a forest image. 57 00:03:58,380 --> 00:04:02,240 But the I say, okay, well now I've observed the content of this image, 58 00:04:02,240 --> 00:04:08,320 the RGB vector for this image, and I want to say, I need to weight that in. 59 00:04:08,320 --> 00:04:16,332 And under the sunset category, it's extremely extremely unlikely. 60 00:04:16,332 --> 00:04:19,320 There is basically zero probability of observing 61 00:04:19,320 --> 00:04:21,410 this blue intensity value under that category. 62 00:04:21,410 --> 00:04:26,470 So I can rule it out regardless of what the weight is on that category. 63 00:04:26,470 --> 00:04:29,630 But for these other categories, these other clusters, 64 00:04:29,630 --> 00:04:35,750 there's going to be some competition between how much I'm likely to just see 65 00:04:35,750 --> 00:04:40,950 images of that type versus how likely it is under that category. 66 00:04:40,950 --> 00:04:44,010 And we're going to use both of these things to 67 00:04:44,010 --> 00:04:47,069 represent our uncertainty about the cluster assignment. 68 00:04:48,170 --> 00:04:51,500 So just to circle back and make sure we're very clear when we're looking at 69 00:04:51,500 --> 00:04:57,300 the probability of an observed RGB vector. 70 00:04:57,300 --> 00:05:02,050 RGB for image i, given that it's in cluster k, 71 00:05:03,430 --> 00:05:09,720 then this is just a single Gaussian with mean mu k and covariance sigma k. 72 00:05:09,720 --> 00:05:14,300 And this is referred to as the likelihood term, 73 00:05:14,300 --> 00:05:17,150 whereas before we called this the prior term. 74 00:05:19,104 --> 00:05:23,435 And I want to point out that this image, indeed there should be 75 00:05:23,435 --> 00:05:28,104 uncertainty about whether it's assigned to the clouds cluster or 76 00:05:28,104 --> 00:05:31,879 the forest cluster, because here we see some trees. 77 00:05:34,114 --> 00:05:36,620 And here we see some clouds. 78 00:05:38,270 --> 00:05:42,761 So it'd be natural to have uncertainty on the assignment of this image. 79 00:05:42,761 --> 00:05:46,979 [MUSIC]