1 00:00:00,000 --> 00:00:04,709 [MUSIC] 2 00:00:04,709 --> 00:00:07,437 Let's spend some time describing mixture models. 3 00:00:07,437 --> 00:00:12,146 And here, as a motivating application, we're going to use this idea of trying to 4 00:00:12,146 --> 00:00:16,380 discover groups of related images, so that is cluster images. 5 00:00:16,380 --> 00:00:19,070 And the reason we're going to use this application is because it's really 6 00:00:19,070 --> 00:00:23,570 visually appealing and very intuitive because of that structure. 7 00:00:23,570 --> 00:00:28,580 So remember here our goal is to discover, for example, this group of all 8 00:00:28,580 --> 00:00:33,490 images relating to clouds, and all images about sunsets, and images about dogs, and 9 00:00:33,490 --> 00:00:39,680 images about pink flowers, and images about the ocean, and things like this. 10 00:00:39,680 --> 00:00:43,070 But remember, because we're in an unsupervised setting, 11 00:00:43,070 --> 00:00:46,320 the output of our algorithm are just going to be these groupings or 12 00:00:46,320 --> 00:00:49,380 cluster assignments, rather than labels. 13 00:00:49,380 --> 00:00:53,020 So we're not actually going to label the group as clouds, but 14 00:00:53,020 --> 00:00:56,250 you could go through and do that post facto. 15 00:00:56,250 --> 00:01:00,390 Okay, so to start with let's discuss our image representation. 16 00:01:00,390 --> 00:01:04,890 And for the sake of this module, and for the assignment that you're going to be 17 00:01:04,890 --> 00:01:09,830 working with, we're going to use a really really, simple image representation, 18 00:01:09,830 --> 00:01:14,070 where we just simply average the RGB, red, green, 19 00:01:14,070 --> 00:01:19,300 blue values, so the pixel intensities across every pixel in the image, 20 00:01:19,300 --> 00:01:22,640 to get a single RGB vector per image. 21 00:01:24,220 --> 00:01:30,957 So for example, maybe this image of clouds has an RGB vector 0.05, 22 00:01:30,957 --> 00:01:35,674 0.7, 0.9, and here we've normalized all 23 00:01:35,674 --> 00:01:39,846 these RGB values to be between zero and one. 24 00:01:39,846 --> 00:01:45,397 And then maybe this sunset image has an R value of 0.85, 25 00:01:45,397 --> 00:01:52,282 green value of 0.05, blue value 0.35, and finally this trees or 26 00:01:52,282 --> 00:01:58,186 forest image has an RGB vector 0.02, 0.95, 0.4. 27 00:01:58,186 --> 00:02:01,905 Now that we have a quantitative representation of our images, 28 00:02:01,905 --> 00:02:05,840 we can turn to our data analysis and our modeling. 29 00:02:05,840 --> 00:02:10,130 So to start with, lets imagine that we take all of the cloud images. 30 00:02:10,130 --> 00:02:14,240 So for now lets imagine that these images have labels, they don't, 31 00:02:14,240 --> 00:02:16,790 that's the point of this clustering task. 32 00:02:16,790 --> 00:02:19,160 But for the sake of building up the model, 33 00:02:19,160 --> 00:02:23,410 let's imagine that we could grab out just the images of clouds. 34 00:02:24,530 --> 00:02:27,030 And then let's just look at the blue 35 00:02:28,870 --> 00:02:32,720 index in this RGB vector, and histogram 36 00:02:32,720 --> 00:02:38,400 what that blue intensity is across all these cloud images in our dataset. 37 00:02:38,400 --> 00:02:43,180 So maybe the histogram would look something like this, where the average 38 00:02:43,180 --> 00:02:48,050 value is pretty high, 0.8, but there's some spread around that. 39 00:02:48,050 --> 00:02:52,380 And the spread might look somewhat like a bell curve, 40 00:02:52,380 --> 00:02:58,360 where maybe there are lots of images that have blue intensities around 0.8 and 41 00:02:58,360 --> 00:03:03,840 very few with extremely high values and very few with extremely low blue values. 42 00:03:05,320 --> 00:03:08,760 Then we could look at all the sunset images and 43 00:03:08,760 --> 00:03:10,730 make the same type of histogram. 44 00:03:10,730 --> 00:03:13,470 But now, because we're looking at these sunset images and 45 00:03:13,470 --> 00:03:18,330 you tend not to get lots of blues in these images, maybe the average value is much 46 00:03:18,330 --> 00:03:24,060 lower like 0.3 and perhaps there's also a lot less spread in 47 00:03:24,060 --> 00:03:28,630 terms of the range of blue values we see across these sunset images. 48 00:03:29,670 --> 00:03:34,400 And then we can do the same thing for forest images, but 49 00:03:34,400 --> 00:03:39,070 maybe here the blue intensity is a bit higher than in sunsets because you can get 50 00:03:39,070 --> 00:03:43,170 parts of the image that are really about the sky, so maybe it's 51 00:03:43,170 --> 00:03:48,160 a bit higher at 0.42 and a little bit more spread than we saw for sunset images. 52 00:03:49,300 --> 00:03:55,140 Okay, but remember that we don't actually have these labels of sunset, 53 00:03:55,140 --> 00:04:00,940 forest, cloud, we just have a whole bunch of jumbled up images and 54 00:04:00,940 --> 00:04:03,780 for each one we have it's blue intensity and 55 00:04:03,780 --> 00:04:07,930 we can make a histogram over all images in our dataset. 56 00:04:07,930 --> 00:04:10,910 And maybe the histogram would look something like this where there 57 00:04:10,910 --> 00:04:13,760 are these multiple humps, three different humps. 58 00:04:13,760 --> 00:04:18,504 And what that would really correspond to is the fact that there are three different 59 00:04:18,504 --> 00:04:21,941 categories of images that we're looking at, the sunset, 60 00:04:21,941 --> 00:04:25,052 forest and cloud images that we talked about before. 61 00:04:25,052 --> 00:04:28,689 And if we look at any single image, for example, 62 00:04:28,689 --> 00:04:33,572 maybe this forest image shown here, where I've placed the image 63 00:04:33,572 --> 00:04:38,569 with what its blue intensity might be so something around 0.4. 64 00:04:38,569 --> 00:04:43,871 Well maybe from this histogram here we could say, well there's some group 65 00:04:43,871 --> 00:04:49,010 of images with high blue intensity and I'm going to call that one cluster, 66 00:04:49,010 --> 00:04:53,334 one group, and this forest image is clearly not in that group, 67 00:04:53,334 --> 00:04:57,420 but now I'm in this position where I don't really know. 68 00:04:57,420 --> 00:05:02,280 There seem to be maybe two other groups and I don't know which one this goes into. 69 00:05:03,440 --> 00:05:06,590 Well in this case one thing we can actually do is we can look at another 70 00:05:06,590 --> 00:05:08,430 dimension of our observation factor. 71 00:05:08,430 --> 00:05:13,050 We can look at the red intensity in the set of images. 72 00:05:13,050 --> 00:05:16,890 And if we look at the histogram 73 00:05:16,890 --> 00:05:20,844 over red intensities over all forest images in our data set, 74 00:05:20,844 --> 00:05:26,780 well this intensity is probably pretty low in most these images. 75 00:05:26,780 --> 00:05:30,080 Not many reds appear in images of forests. 76 00:05:30,080 --> 00:05:34,220 But if we look at the sunset images, then the red 77 00:05:34,220 --> 00:05:39,540 intensities are going to be really high, maybe something centered around 0.9. 78 00:05:39,540 --> 00:05:42,330 So whereas when we looked at the blue category, 79 00:05:42,330 --> 00:05:46,900 it was pretty hard to distinguish between forests and sunsets. 80 00:05:46,900 --> 00:05:49,032 When we look at the red dimension, 81 00:05:49,032 --> 00:05:54,270 maybe along that dimension things are much more separable. 82 00:05:54,270 --> 00:05:58,677 So the point I want to make here is that actually sometimes when we're thinking 83 00:05:58,677 --> 00:06:02,955 about doing clustering, and thinking about probabilities of assignments 84 00:06:02,955 --> 00:06:07,029 of observations to clusters, the thing that can allow us to distinguish 85 00:06:07,029 --> 00:06:11,439 between the different clusters really appears when we're looking at a higher 86 00:06:11,439 --> 00:06:14,555 dimensional space than just one dimension like blue. 87 00:06:14,555 --> 00:06:18,929 Okay, but to this point really all we've talked about is our data, 88 00:06:18,929 --> 00:06:23,845 thinking about histogramming these values, these intensity values, and 89 00:06:23,845 --> 00:06:28,067 this idea that there might be some structure there but what we want 90 00:06:28,067 --> 00:06:32,596 to turn to is a model to actually capture this clustering structure and 91 00:06:32,596 --> 00:06:36,623 to do these types of soft assignments that we've described. 92 00:06:36,623 --> 00:06:37,123 [MUSIC]