1
00:00:00,000 --> 00:00:04,709
[MUSIC]

2
00:00:04,709 --> 00:00:07,437
Let's spend some time
describing mixture models.

3
00:00:07,437 --> 00:00:12,146
And here, as a motivating application,
we're going to use this idea of trying to

4
00:00:12,146 --> 00:00:16,380
discover groups of related images,
so that is cluster images.

5
00:00:16,380 --> 00:00:19,070
And the reason we're going to use this
application is because it's really

6
00:00:19,070 --> 00:00:23,570
visually appealing and
very intuitive because of that structure.

7
00:00:23,570 --> 00:00:28,580
So remember here our goal is to discover,
for example, this group of all

8
00:00:28,580 --> 00:00:33,490
images relating to clouds, and all images
about sunsets, and images about dogs, and

9
00:00:33,490 --> 00:00:39,680
images about pink flowers, and images
about the ocean, and things like this.

10
00:00:39,680 --> 00:00:43,070
But remember,
because we're in an unsupervised setting,

11
00:00:43,070 --> 00:00:46,320
the output of our algorithm are just
going to be these groupings or

12
00:00:46,320 --> 00:00:49,380
cluster assignments, rather than labels.

13
00:00:49,380 --> 00:00:53,020
So we're not actually going to
label the group as clouds, but

14
00:00:53,020 --> 00:00:56,250
you could go through and
do that post facto.

15
00:00:56,250 --> 00:01:00,390
Okay, so to start with let's
discuss our image representation.

16
00:01:00,390 --> 00:01:04,890
And for the sake of this module, and for
the assignment that you're going to be

17
00:01:04,890 --> 00:01:09,830
working with, we're going to use a really
really, simple image representation,

18
00:01:09,830 --> 00:01:14,070
where we just simply average the RGB,
red, green,

19
00:01:14,070 --> 00:01:19,300
blue values, so the pixel intensities
across every pixel in the image,

20
00:01:19,300 --> 00:01:22,640
to get a single RGB vector per image.

21
00:01:24,220 --> 00:01:30,957
So for example, maybe this image
of clouds has an RGB vector 0.05,

22
00:01:30,957 --> 00:01:35,674
0.7, 0.9, and here we've normalized all

23
00:01:35,674 --> 00:01:39,846
these RGB values to be between zero and
one.

24
00:01:39,846 --> 00:01:45,397
And then maybe this sunset
image has an R value of 0.85,

25
00:01:45,397 --> 00:01:52,282
green value of 0.05, blue value 0.35,
and finally this trees or

26
00:01:52,282 --> 00:01:58,186
forest image has an RGB vector 0.02,
0.95, 0.4.

27
00:01:58,186 --> 00:02:01,905
Now that we have a quantitative
representation of our images,

28
00:02:01,905 --> 00:02:05,840
we can turn to our data analysis and
our modeling.

29
00:02:05,840 --> 00:02:10,130
So to start with, lets imagine that
we take all of the cloud images.

30
00:02:10,130 --> 00:02:14,240
So for now lets imagine that these
images have labels, they don't,

31
00:02:14,240 --> 00:02:16,790
that's the point of this clustering task.

32
00:02:16,790 --> 00:02:19,160
But for the sake of building up the model,

33
00:02:19,160 --> 00:02:23,410
let's imagine that we could grab
out just the images of clouds.

34
00:02:24,530 --> 00:02:27,030
And then let's just look at the blue

35
00:02:28,870 --> 00:02:32,720
index in this RGB vector, and histogram

36
00:02:32,720 --> 00:02:38,400
what that blue intensity is across all
these cloud images in our dataset.

37
00:02:38,400 --> 00:02:43,180
So maybe the histogram would look
something like this, where the average

38
00:02:43,180 --> 00:02:48,050
value is pretty high, 0.8, but
there's some spread around that.

39
00:02:48,050 --> 00:02:52,380
And the spread might look
somewhat like a bell curve,

40
00:02:52,380 --> 00:02:58,360
where maybe there are lots of images that
have blue intensities around 0.8 and

41
00:02:58,360 --> 00:03:03,840
very few with extremely high values and
very few with extremely low blue values.

42
00:03:05,320 --> 00:03:08,760
Then we could look at all
the sunset images and

43
00:03:08,760 --> 00:03:10,730
make the same type of histogram.

44
00:03:10,730 --> 00:03:13,470
But now, because we're looking
at these sunset images and

45
00:03:13,470 --> 00:03:18,330
you tend not to get lots of blues in these
images, maybe the average value is much

46
00:03:18,330 --> 00:03:24,060
lower like 0.3 and
perhaps there's also a lot less spread in

47
00:03:24,060 --> 00:03:28,630
terms of the range of blue values
we see across these sunset images.

48
00:03:29,670 --> 00:03:34,400
And then we can do the same thing for
forest images, but

49
00:03:34,400 --> 00:03:39,070
maybe here the blue intensity is a bit
higher than in sunsets because you can get

50
00:03:39,070 --> 00:03:43,170
parts of the image that are really
about the sky, so maybe it's

51
00:03:43,170 --> 00:03:48,160
a bit higher at 0.42 and a little bit more
spread than we saw for sunset images.

52
00:03:49,300 --> 00:03:55,140
Okay, but remember that we don't
actually have these labels of sunset,

53
00:03:55,140 --> 00:04:00,940
forest, cloud, we just have a whole
bunch of jumbled up images and

54
00:04:00,940 --> 00:04:03,780
for each one we have
it's blue intensity and

55
00:04:03,780 --> 00:04:07,930
we can make a histogram over
all images in our dataset.

56
00:04:07,930 --> 00:04:10,910
And maybe the histogram would look
something like this where there

57
00:04:10,910 --> 00:04:13,760
are these multiple humps,
three different humps.

58
00:04:13,760 --> 00:04:18,504
And what that would really correspond to
is the fact that there are three different

59
00:04:18,504 --> 00:04:21,941
categories of images that
we're looking at, the sunset,

60
00:04:21,941 --> 00:04:25,052
forest and
cloud images that we talked about before.

61
00:04:25,052 --> 00:04:28,689
And if we look at any single image,
for example,

62
00:04:28,689 --> 00:04:33,572
maybe this forest image shown here,
where I've placed the image

63
00:04:33,572 --> 00:04:38,569
with what its blue intensity might be so
something around 0.4.

64
00:04:38,569 --> 00:04:43,871
Well maybe from this histogram here
we could say, well there's some group

65
00:04:43,871 --> 00:04:49,010
of images with high blue intensity and
I'm going to call that one cluster,

66
00:04:49,010 --> 00:04:53,334
one group, and this forest image
is clearly not in that group,

67
00:04:53,334 --> 00:04:57,420
but now I'm in this position
where I don't really know.

68
00:04:57,420 --> 00:05:02,280
There seem to be maybe two other groups
and I don't know which one this goes into.

69
00:05:03,440 --> 00:05:06,590
Well in this case one thing we can
actually do is we can look at another

70
00:05:06,590 --> 00:05:08,430
dimension of our observation factor.

71
00:05:08,430 --> 00:05:13,050
We can look at the red
intensity in the set of images.

72
00:05:13,050 --> 00:05:16,890
And if we look at the histogram

73
00:05:16,890 --> 00:05:20,844
over red intensities over all
forest images in our data set,

74
00:05:20,844 --> 00:05:26,780
well this intensity is probably
pretty low in most these images.

75
00:05:26,780 --> 00:05:30,080
Not many reds appear in images of forests.

76
00:05:30,080 --> 00:05:34,220
But if we look at the sunset images,
then the red

77
00:05:34,220 --> 00:05:39,540
intensities are going to be really high,
maybe something centered around 0.9.

78
00:05:39,540 --> 00:05:42,330
So whereas when we looked
at the blue category,

79
00:05:42,330 --> 00:05:46,900
it was pretty hard to distinguish
between forests and sunsets.

80
00:05:46,900 --> 00:05:49,032
When we look at the red dimension,

81
00:05:49,032 --> 00:05:54,270
maybe along that dimension
things are much more separable.

82
00:05:54,270 --> 00:05:58,677
So the point I want to make here is that
actually sometimes when we're thinking

83
00:05:58,677 --> 00:06:02,955
about doing clustering, and thinking
about probabilities of assignments

84
00:06:02,955 --> 00:06:07,029
of observations to clusters,
the thing that can allow us to distinguish

85
00:06:07,029 --> 00:06:11,439
between the different clusters really
appears when we're looking at a higher

86
00:06:11,439 --> 00:06:14,555
dimensional space than just
one dimension like blue.

87
00:06:14,555 --> 00:06:18,929
Okay, but to this point really all
we've talked about is our data,

88
00:06:18,929 --> 00:06:23,845
thinking about histogramming these values,
these intensity values, and

89
00:06:23,845 --> 00:06:28,067
this idea that there might be some
structure there but what we want

90
00:06:28,067 --> 00:06:32,596
to turn to is a model to actually
capture this clustering structure and

91
00:06:32,596 --> 00:06:36,623
to do these types of soft
assignments that we've described.

92
00:06:36,623 --> 00:06:37,123
[MUSIC]