1 00:00:04,668 --> 00:00:08,513 The algorithm we're going to use to infer the cluster parameters as well as soft 2 00:00:08,513 --> 00:00:12,190 assignments is something called expectation maximization or. 3 00:00:12,190 --> 00:00:18,790 And the idea is we start with a set of unlabeled, observed inputs. 4 00:00:18,790 --> 00:00:20,700 Such as the data shown here. 5 00:00:20,700 --> 00:00:25,540 And the goal is to output a set of soft assignments per data point 6 00:00:25,540 --> 00:00:30,620 meaning an allocation of that data point to each one of the clusters. 7 00:00:30,620 --> 00:00:33,320 And in this example, there are three different clusters. 8 00:00:33,320 --> 00:00:37,390 A fuchsia cluster, a green cluster and a blue cluster and 9 00:00:37,390 --> 00:00:41,830 the shadings of the individual data points indicates the uncertainty of our 10 00:00:41,830 --> 00:00:46,300 assignment of that data point to each of these different clusters. 11 00:00:46,300 --> 00:00:50,860 And a key question here is how are we going to output the soft assignments 12 00:00:50,860 --> 00:00:53,970 just from this set of unlabeled data points. 13 00:00:53,970 --> 00:00:58,080 well to begin with lets just assume we actually know the cluster parameters and 14 00:00:58,080 --> 00:01:01,310 we just want to compute the soft assignments having fixed the values 15 00:01:01,310 --> 00:01:03,648 of this cluster parameters. 16 00:01:03,648 --> 00:01:06,220 The soft assignments are quantified by something called 17 00:01:06,220 --> 00:01:10,930 the responsibility vector, so for each observation I. 18 00:01:10,930 --> 00:01:16,108 We form a responsibility vector with elements Ri1, 19 00:01:16,108 --> 00:01:19,680 Ri2, all the way up to Ri 20 00:01:20,750 --> 00:01:25,410 capital k where capital k is the total number of clusters 21 00:01:27,300 --> 00:01:31,310 in our mixture model and each element Ri little k, 22 00:01:32,890 --> 00:01:38,090 represents the responsibility that cluster k takes for observation i. 23 00:01:38,090 --> 00:01:43,053 So in particular, it's the probability that observation i is 24 00:01:43,053 --> 00:01:47,089 assigned to k cluster, so that's this term here. 25 00:01:47,089 --> 00:01:50,970 Given, remember this bar here means given. 26 00:01:52,860 --> 00:01:55,060 The set of cluster weights and shapes. 27 00:01:56,200 --> 00:01:59,740 So that's here, remember the pi's or the weights and 28 00:01:59,740 --> 00:02:04,680 the muse in the sigma's specify the shapes of each of the clusters and 29 00:02:04,680 --> 00:02:11,930 were going to condition on observed value of the i data point so that's xi here. 30 00:02:11,930 --> 00:02:14,310 So just to make sure this notation is very clear. 31 00:02:14,310 --> 00:02:18,965 What we're writing here when we write probability of the first thing is 32 00:02:18,965 --> 00:02:22,388 the random variable that the distribution is over. 33 00:02:25,598 --> 00:02:31,560 And then on the right-hand side of this given sign this var. 34 00:02:31,560 --> 00:02:35,120 These are going to be a set of fixed values. 35 00:02:38,210 --> 00:02:41,210 That define the probability distribution. 36 00:02:46,595 --> 00:02:49,855 But before we get to specifying this probability in equations, 37 00:02:49,855 --> 00:02:53,255 let's gain some intuition by looking at a set of pictures. 38 00:02:53,255 --> 00:02:57,031 And in this example, we're going to assume that there's just two clusters, 39 00:02:57,031 --> 00:03:00,925 this green and blue cluster, with spherically symmetric shapes, and we're 40 00:03:00,925 --> 00:03:05,055 going to assume that there's basically equal weightings between the two clusters, 41 00:03:05,055 --> 00:03:07,860 which we see from the fact that they have. 42 00:03:07,860 --> 00:03:13,060 Very similar numbers of data points which are represented by these little stars. 43 00:03:13,060 --> 00:03:16,330 And then we're going to hone in on a single observation 44 00:03:16,330 --> 00:03:18,870 which is outlined with this pink color. 45 00:03:18,870 --> 00:03:21,470 And we're going to look at 46 00:03:21,470 --> 00:03:27,140 the soft assignment of this observation to these two clusters. 47 00:03:27,140 --> 00:03:32,190 And what we see in this case here is that this highlighted observation 48 00:03:32,190 --> 00:03:37,190 is closer to the center of the green cluster than it is to the blue cluster. 49 00:03:37,190 --> 00:03:40,650 So as a result, the green cluster is going to take more responsibility for 50 00:03:40,650 --> 00:03:43,250 this observation than the blue cluster. 51 00:03:43,250 --> 00:03:47,910 Though still both split responsibility for this observation. 52 00:03:47,910 --> 00:03:52,110 But in contrast in this situation here, notice that we've 53 00:03:52,110 --> 00:03:57,410 shifted our highlighted data point towards the blue cluster. 54 00:03:57,410 --> 00:04:02,460 So now this key observation is closer to the center of the blue cluster. 55 00:04:02,460 --> 00:04:05,920 And this blue cluster, as a result, is going to take more responsibility for 56 00:04:05,920 --> 00:04:06,620 this data point. 57 00:04:08,470 --> 00:04:15,050 But, finally, if the key observation were somewhere in-between the two clusters, 58 00:04:15,050 --> 00:04:19,700 then there would be basically split responsibility between the green and 59 00:04:19,700 --> 00:04:22,800 blue cluster, representing a lot of uncertainty. 60 00:04:22,800 --> 00:04:25,720 In the cluster membership for that data point. 61 00:04:27,010 --> 00:04:31,050 But what if one cluster weighed much more heavily in a mixture model than the other 62 00:04:31,050 --> 00:04:31,970 cluster. 63 00:04:31,970 --> 00:04:33,100 So, for example, 64 00:04:33,100 --> 00:04:37,480 what if this green cluster have a lot more mass than the blue one? 65 00:04:37,480 --> 00:04:41,560 And we see this now by the green cluster having lots and lots of data points and 66 00:04:41,560 --> 00:04:43,500 the blue cluster having just a few. 67 00:04:43,500 --> 00:04:46,710 On this case, if we look at this highlighted observation that was 68 00:04:46,710 --> 00:04:49,540 on the boundary of the green and blue clusters. 69 00:04:51,240 --> 00:04:56,600 We're still of course uncertain about the cluster membership, but the green cluster 70 00:04:56,600 --> 00:05:01,580 starts to seem like a much more likely explanation for this observed point. 71 00:05:03,120 --> 00:05:06,590 And so, what happens is because of this imbalance 72 00:05:06,590 --> 00:05:09,590 in the proportion of the clusters in our mixture model, 73 00:05:09,590 --> 00:05:13,600 the green cluster starts to take more responsibility for that point than 74 00:05:13,600 --> 00:05:18,399 the blue point whereas previously it was exactly split between the two clusters. 75 00:05:19,750 --> 00:05:23,030 In equations, we're saying that we need to weigh this initial or 76 00:05:23,030 --> 00:05:28,260 prior probability that a given observation comes from anyone of our clusters 77 00:05:28,260 --> 00:05:33,500 with how likely the observed value is under each of these clusters. 78 00:05:33,500 --> 00:05:35,190 So, that's the second term here. 79 00:05:36,340 --> 00:05:42,460 So, for example let me switch colors here, yeah I'll choose a red color. 80 00:05:42,460 --> 00:05:44,920 If we look at a data point here. 81 00:05:44,920 --> 00:05:49,909 Maybe we would say that this data point is 82 00:05:49,909 --> 00:05:54,913 very unlikely under the green cluster. 83 00:05:59,920 --> 00:06:03,247 Even though our prior probability on the green cluster is much higher. 84 00:06:09,087 --> 00:06:12,430 So we really have to consider both a these two terms together. 85 00:06:14,040 --> 00:06:17,030 Then in order to make this a valid probability, 86 00:06:17,030 --> 00:06:21,260 we need to normalize over all possible cluster assignments. 87 00:06:22,730 --> 00:06:26,450 These equations quantify the discussion that we had in the last section 88 00:06:26,450 --> 00:06:31,200 where we talked about how in to weigh the prior term which represented just for 89 00:06:31,200 --> 00:06:35,120 example that prior probability that out of our set of images were going to grab 90 00:06:35,120 --> 00:06:39,360 out an image and that image is going to be a cloud image with the likelihood term. 91 00:06:39,360 --> 00:06:40,890 Which represented for 92 00:06:40,890 --> 00:06:45,140 a given image based on the values that we're observing on of that image. 93 00:06:45,140 --> 00:06:48,320 How likely are those values under a giving cluster assignment. 94 00:06:49,330 --> 00:06:50,140 And as we see now, 95 00:06:50,140 --> 00:06:55,030 it is indeed the combination of these two terms that determine our soft assignments. 96 00:06:56,150 --> 00:06:59,420 So, in summery, if we know the cluster parameters, 97 00:06:59,420 --> 00:07:02,085 then computing these soft assignments is really trivial. 98 00:07:02,085 --> 00:07:03,320 All we have to do is for 99 00:07:03,320 --> 00:07:08,300 each one of our possible clusters, we compute the prior times the likelyhood. 100 00:07:08,300 --> 00:07:11,860 To form our responsibility and then just normalize this vector, so 101 00:07:11,860 --> 00:07:16,360 that it sums to one over all the possible clusters. 102 00:07:16,360 --> 00:07:20,580 But the story isn't over, because we don't actually know our cluster parameters. 103 00:07:20,580 --> 00:07:23,565 That's something that we also have to somehow infer just from our 104 00:07:23,565 --> 00:07:24,388 unlabeled data.