[MUSIC] Let's spend some time
describing mixture models. And here, as a motivating application,
we're going to use this idea of trying to discover groups of related images,
so that is cluster images. And the reason we're going to use this
application is because it's really visually appealing and
very intuitive because of that structure. So remember here our goal is to discover,
for example, this group of all images relating to clouds, and all images
about sunsets, and images about dogs, and images about pink flowers, and images
about the ocean, and things like this. But remember,
because we're in an unsupervised setting, the output of our algorithm are just
going to be these groupings or cluster assignments, rather than labels. So we're not actually going to
label the group as clouds, but you could go through and
do that post facto. Okay, so to start with let's
discuss our image representation. And for the sake of this module, and for
the assignment that you're going to be working with, we're going to use a really
really, simple image representation, where we just simply average the RGB,
red, green, blue values, so the pixel intensities
across every pixel in the image, to get a single RGB vector per image. So for example, maybe this image
of clouds has an RGB vector 0.05, 0.7, 0.9, and here we've normalized all these RGB values to be between zero and
one. And then maybe this sunset
image has an R value of 0.85, green value of 0.05, blue value 0.35,
and finally this trees or forest image has an RGB vector 0.02,
0.95, 0.4. Now that we have a quantitative
representation of our images, we can turn to our data analysis and
our modeling. So to start with, lets imagine that
we take all of the cloud images. So for now lets imagine that these
images have labels, they don't, that's the point of this clustering task. But for the sake of building up the model, let's imagine that we could grab
out just the images of clouds. And then let's just look at the blue index in this RGB vector, and histogram what that blue intensity is across all
these cloud images in our dataset. So maybe the histogram would look
something like this, where the average value is pretty high, 0.8, but
there's some spread around that. And the spread might look
somewhat like a bell curve, where maybe there are lots of images that
have blue intensities around 0.8 and very few with extremely high values and
very few with extremely low blue values. Then we could look at all
the sunset images and make the same type of histogram. But now, because we're looking
at these sunset images and you tend not to get lots of blues in these
images, maybe the average value is much lower like 0.3 and
perhaps there's also a lot less spread in terms of the range of blue values
we see across these sunset images. And then we can do the same thing for
forest images, but maybe here the blue intensity is a bit
higher than in sunsets because you can get parts of the image that are really
about the sky, so maybe it's a bit higher at 0.42 and a little bit more
spread than we saw for sunset images. Okay, but remember that we don't
actually have these labels of sunset, forest, cloud, we just have a whole
bunch of jumbled up images and for each one we have
it's blue intensity and we can make a histogram over
all images in our dataset. And maybe the histogram would look
something like this where there are these multiple humps,
three different humps. And what that would really correspond to
is the fact that there are three different categories of images that
we're looking at, the sunset, forest and
cloud images that we talked about before. And if we look at any single image,
for example, maybe this forest image shown here,
where I've placed the image with what its blue intensity might be so
something around 0.4. Well maybe from this histogram here
we could say, well there's some group of images with high blue intensity and
I'm going to call that one cluster, one group, and this forest image
is clearly not in that group, but now I'm in this position
where I don't really know. There seem to be maybe two other groups
and I don't know which one this goes into. Well in this case one thing we can
actually do is we can look at another dimension of our observation factor. We can look at the red
intensity in the set of images. And if we look at the histogram over red intensities over all
forest images in our data set, well this intensity is probably
pretty low in most these images. Not many reds appear in images of forests. But if we look at the sunset images,
then the red intensities are going to be really high,
maybe something centered around 0.9. So whereas when we looked
at the blue category, it was pretty hard to distinguish
between forests and sunsets. When we look at the red dimension, maybe along that dimension
things are much more separable. So the point I want to make here is that
actually sometimes when we're thinking about doing clustering, and thinking
about probabilities of assignments of observations to clusters,
the thing that can allow us to distinguish between the different clusters really
appears when we're looking at a higher dimensional space than just
one dimension like blue. Okay, but to this point really all
we've talked about is our data, thinking about histogramming these values,
these intensity values, and this idea that there might be some
structure there but what we want to turn to is a model to actually
capture this clustering structure and to do these types of soft
assignments that we've described. [MUSIC]