1
00:00:04,668 --> 00:00:08,513
The algorithm we're going to use to infer
the cluster parameters as well as soft

2
00:00:08,513 --> 00:00:12,190
assignments is something called
expectation maximization or.

3
00:00:12,190 --> 00:00:18,790
And the idea is we start with a set
of unlabeled, observed inputs.

4
00:00:18,790 --> 00:00:20,700
Such as the data shown here.

5
00:00:20,700 --> 00:00:25,540
And the goal is to output a set of
soft assignments per data point

6
00:00:25,540 --> 00:00:30,620
meaning an allocation of that data
point to each one of the clusters.

7
00:00:30,620 --> 00:00:33,320
And in this example,
there are three different clusters.

8
00:00:33,320 --> 00:00:37,390
A fuchsia cluster, a green cluster and
a blue cluster and

9
00:00:37,390 --> 00:00:41,830
the shadings of the individual data
points indicates the uncertainty of our

10
00:00:41,830 --> 00:00:46,300
assignment of that data point to
each of these different clusters.

11
00:00:46,300 --> 00:00:50,860
And a key question here is how are we
going to output the soft assignments

12
00:00:50,860 --> 00:00:53,970
just from this set of
unlabeled data points.

13
00:00:53,970 --> 00:00:58,080
well to begin with lets just assume we
actually know the cluster parameters and

14
00:00:58,080 --> 00:01:01,310
we just want to compute the soft
assignments having fixed the values

15
00:01:01,310 --> 00:01:03,648
of this cluster parameters.

16
00:01:03,648 --> 00:01:06,220
The soft assignments
are quantified by something called

17
00:01:06,220 --> 00:01:10,930
the responsibility vector,
so for each observation I.

18
00:01:10,930 --> 00:01:16,108
We form a responsibility
vector with elements Ri1,

19
00:01:16,108 --> 00:01:19,680
Ri2, all the way up to Ri

20
00:01:20,750 --> 00:01:25,410
capital k where capital k is
the total number of clusters

21
00:01:27,300 --> 00:01:31,310
in our mixture model and
each element Ri little k,

22
00:01:32,890 --> 00:01:38,090
represents the responsibility that
cluster k takes for observation i.

23
00:01:38,090 --> 00:01:43,053
So in particular,
it's the probability that observation i is

24
00:01:43,053 --> 00:01:47,089
assigned to k cluster, so
that's this term here.

25
00:01:47,089 --> 00:01:50,970
Given, remember this bar here means given.

26
00:01:52,860 --> 00:01:55,060
The set of cluster weights and shapes.

27
00:01:56,200 --> 00:01:59,740
So that's here, remember the pi's or
the weights and

28
00:01:59,740 --> 00:02:04,680
the muse in the sigma's specify
the shapes of each of the clusters and

29
00:02:04,680 --> 00:02:11,930
were going to condition on observed value
of the i data point so that's xi here.

30
00:02:11,930 --> 00:02:14,310
So just to make sure this
notation is very clear.

31
00:02:14,310 --> 00:02:18,965
What we're writing here when we write
probability of the first thing is

32
00:02:18,965 --> 00:02:22,388
the random variable that
the distribution is over.

33
00:02:25,598 --> 00:02:31,560
And then on the right-hand side
of this given sign this var.

34
00:02:31,560 --> 00:02:35,120
These are going to be
a set of fixed values.

35
00:02:38,210 --> 00:02:41,210
That define the probability distribution.

36
00:02:46,595 --> 00:02:49,855
But before we get to specifying
this probability in equations,

37
00:02:49,855 --> 00:02:53,255
let's gain some intuition by
looking at a set of pictures.

38
00:02:53,255 --> 00:02:57,031
And in this example, we're going to
assume that there's just two clusters,

39
00:02:57,031 --> 00:03:00,925
this green and blue cluster, with
spherically symmetric shapes, and we're

40
00:03:00,925 --> 00:03:05,055
going to assume that there's basically
equal weightings between the two clusters,

41
00:03:05,055 --> 00:03:07,860
which we see from the fact that they have.

42
00:03:07,860 --> 00:03:13,060
Very similar numbers of data points which
are represented by these little stars.

43
00:03:13,060 --> 00:03:16,330
And then we're going to hone
in on a single observation

44
00:03:16,330 --> 00:03:18,870
which is outlined with this pink color.

45
00:03:18,870 --> 00:03:21,470
And we're going to look at

46
00:03:21,470 --> 00:03:27,140
the soft assignment of this
observation to these two clusters.

47
00:03:27,140 --> 00:03:32,190
And what we see in this case here is
that this highlighted observation

48
00:03:32,190 --> 00:03:37,190
is closer to the center of the green
cluster than it is to the blue cluster.

49
00:03:37,190 --> 00:03:40,650
So as a result, the green cluster is
going to take more responsibility for

50
00:03:40,650 --> 00:03:43,250
this observation than the blue cluster.

51
00:03:43,250 --> 00:03:47,910
Though still both split responsibility for
this observation.

52
00:03:47,910 --> 00:03:52,110
But in contrast in this situation here,
notice that we've

53
00:03:52,110 --> 00:03:57,410
shifted our highlighted data
point towards the blue cluster.

54
00:03:57,410 --> 00:04:02,460
So now this key observation is closer
to the center of the blue cluster.

55
00:04:02,460 --> 00:04:05,920
And this blue cluster, as a result,
is going to take more responsibility for

56
00:04:05,920 --> 00:04:06,620
this data point.

57
00:04:08,470 --> 00:04:15,050
But, finally, if the key observation were
somewhere in-between the two clusters,

58
00:04:15,050 --> 00:04:19,700
then there would be basically split
responsibility between the green and

59
00:04:19,700 --> 00:04:22,800
blue cluster,
representing a lot of uncertainty.

60
00:04:22,800 --> 00:04:25,720
In the cluster membership for
that data point.

61
00:04:27,010 --> 00:04:31,050
But what if one cluster weighed much more
heavily in a mixture model than the other

62
00:04:31,050 --> 00:04:31,970
cluster.

63
00:04:31,970 --> 00:04:33,100
So, for example,

64
00:04:33,100 --> 00:04:37,480
what if this green cluster have
a lot more mass than the blue one?

65
00:04:37,480 --> 00:04:41,560
And we see this now by the green cluster
having lots and lots of data points and

66
00:04:41,560 --> 00:04:43,500
the blue cluster having just a few.

67
00:04:43,500 --> 00:04:46,710
On this case, if we look at this
highlighted observation that was

68
00:04:46,710 --> 00:04:49,540
on the boundary of the green and
blue clusters.

69
00:04:51,240 --> 00:04:56,600
We're still of course uncertain about the
cluster membership, but the green cluster

70
00:04:56,600 --> 00:05:01,580
starts to seem like a much more likely
explanation for this observed point.

71
00:05:03,120 --> 00:05:06,590
And so,
what happens is because of this imbalance

72
00:05:06,590 --> 00:05:09,590
in the proportion of the clusters
in our mixture model,

73
00:05:09,590 --> 00:05:13,600
the green cluster starts to take more
responsibility for that point than

74
00:05:13,600 --> 00:05:18,399
the blue point whereas previously it was
exactly split between the two clusters.

75
00:05:19,750 --> 00:05:23,030
In equations, we're saying that
we need to weigh this initial or

76
00:05:23,030 --> 00:05:28,260
prior probability that a given observation
comes from anyone of our clusters

77
00:05:28,260 --> 00:05:33,500
with how likely the observed value
is under each of these clusters.

78
00:05:33,500 --> 00:05:35,190
So, that's the second term here.

79
00:05:36,340 --> 00:05:42,460
So, for example let me switch colors here,
yeah I'll choose a red color.

80
00:05:42,460 --> 00:05:44,920
If we look at a data point here.

81
00:05:44,920 --> 00:05:49,909
Maybe we would say that this data point is

82
00:05:49,909 --> 00:05:54,913
very unlikely under the green cluster.

83
00:05:59,920 --> 00:06:03,247
Even though our prior probability on
the green cluster is much higher.

84
00:06:09,087 --> 00:06:12,430
So we really have to consider
both a these two terms together.

85
00:06:14,040 --> 00:06:17,030
Then in order to make
this a valid probability,

86
00:06:17,030 --> 00:06:21,260
we need to normalize over all
possible cluster assignments.

87
00:06:22,730 --> 00:06:26,450
These equations quantify the discussion
that we had in the last section

88
00:06:26,450 --> 00:06:31,200
where we talked about how in to weigh
the prior term which represented just for

89
00:06:31,200 --> 00:06:35,120
example that prior probability that out
of our set of images were going to grab

90
00:06:35,120 --> 00:06:39,360
out an image and that image is going to be
a cloud image with the likelihood term.

91
00:06:39,360 --> 00:06:40,890
Which represented for

92
00:06:40,890 --> 00:06:45,140
a given image based on the values that
we're observing on of that image.

93
00:06:45,140 --> 00:06:48,320
How likely are those values under
a giving cluster assignment.

94
00:06:49,330 --> 00:06:50,140
And as we see now,

95
00:06:50,140 --> 00:06:55,030
it is indeed the combination of these two
terms that determine our soft assignments.

96
00:06:56,150 --> 00:06:59,420
So, in summery,
if we know the cluster parameters,

97
00:06:59,420 --> 00:07:02,085
then computing these soft
assignments is really trivial.

98
00:07:02,085 --> 00:07:03,320
All we have to do is for

99
00:07:03,320 --> 00:07:08,300
each one of our possible clusters,
we compute the prior times the likelyhood.

100
00:07:08,300 --> 00:07:11,860
To form our responsibility and
then just normalize this vector, so

101
00:07:11,860 --> 00:07:16,360
that it sums to one over
all the possible clusters.

102
00:07:16,360 --> 00:07:20,580
But the story isn't over, because we don't
actually know our cluster parameters.

103
00:07:20,580 --> 00:07:23,565
That's something that we also have
to somehow infer just from our

104
00:07:23,565 --> 00:07:24,388
unlabeled data.