1 00:00:00,000 --> 00:00:04,065 [MUSIC] 2 00:00:04,065 --> 00:00:08,000 So this leads directly into this discussion that clustering is 3 00:00:08,000 --> 00:00:09,840 an unsupervised task. 4 00:00:09,840 --> 00:00:12,610 So let's talk a little bit about what this means. 5 00:00:12,610 --> 00:00:16,760 Because so far in this specialization we've assumed that we've been presented 6 00:00:16,760 --> 00:00:20,350 with a set of outputs associated with our inputs. 7 00:00:20,350 --> 00:00:24,319 So for example in the regression course when we're talking about predicting 8 00:00:24,319 --> 00:00:26,455 house prices, in our training data set, 9 00:00:26,455 --> 00:00:30,482 we assume that not only do we have the features of the house which are the input, 10 00:00:30,482 --> 00:00:34,356 but we also had the price, the sales price of the house, which was output. 11 00:00:34,356 --> 00:00:38,721 In the classification course, we assume that we had restaurant reviews, 12 00:00:38,721 --> 00:00:42,464 we have a text of those restaurant reviews which are our input and 13 00:00:42,464 --> 00:00:46,828 now we also had the output, whether it is a positive or a negative review for 14 00:00:46,828 --> 00:00:48,797 the reviews in our training set. 15 00:00:51,066 --> 00:00:54,980 So, if we were to imagine that we had labels in this application, 16 00:00:54,980 --> 00:00:59,610 maybe we would have labels of articles in term of topics. 17 00:00:59,610 --> 00:01:04,410 So these articles where the input again would be the text of the article, 18 00:01:04,410 --> 00:01:09,960 just like with reviews but the output would be a topic like sports or 19 00:01:09,960 --> 00:01:13,412 world news or entertainment or science. 20 00:01:13,412 --> 00:01:17,260 And so in this case, we can think of problem as a multi class 21 00:01:17,260 --> 00:01:20,450 classification problem, where imagine you get some new article and 22 00:01:20,450 --> 00:01:25,220 you want to label what topic that article is about. 23 00:01:25,220 --> 00:01:31,000 Well then you just go into your training set 24 00:01:31,000 --> 00:01:34,620 and based on the classifier that you've learned from that training set where 25 00:01:34,620 --> 00:01:40,430 labels were provided, you can then think about labeling this document as one of 26 00:01:40,430 --> 00:01:42,450 these five different topics presented here. 27 00:01:44,570 --> 00:01:49,640 So this is an example of a supervised learning problem, because outputs 28 00:01:49,640 --> 00:01:54,891 are provided or these class labels are provided on our training examples. 29 00:01:54,891 --> 00:01:58,616 But when we're thinking about doing clustering we're assuming there 30 00:01:58,616 --> 00:02:03,100 are no labels provided, or that the labels are really really unreliable. 31 00:02:03,100 --> 00:02:07,641 And the goal here is to uncover this grouping structure 32 00:02:07,641 --> 00:02:11,793 directly from the data, the input values alone. 33 00:02:11,793 --> 00:02:16,114 So in particular, here we might imagine that we have 34 00:02:16,114 --> 00:02:20,250 documents with just two words in the vocabulary. 35 00:02:20,250 --> 00:02:25,488 So let's say here's word one, 36 00:02:25,488 --> 00:02:30,726 counts, and here's word two, 37 00:02:30,726 --> 00:02:36,740 counts, and each one of these points 38 00:02:36,740 --> 00:02:42,570 is a given document, document Xi. 39 00:02:46,250 --> 00:02:48,900 And so in particular, our input 40 00:02:48,900 --> 00:02:53,536 to the algorithm is going to be documents represented as vectors, Xi. 41 00:02:53,536 --> 00:02:59,210 But then the output are going to be cluster labels. 42 00:03:01,050 --> 00:03:06,075 So in particular, maybe we'll label this red cluster 43 00:03:06,075 --> 00:03:10,989 as cluster 1, this green cluster as cluster 2, and 44 00:03:10,989 --> 00:03:16,579 this blue cluster as cluster 3, then associate it with Xi. 45 00:03:16,579 --> 00:03:23,612 The output might be zi = 1, that it's associated with cluster 1, 46 00:03:23,612 --> 00:03:31,300 and we are going to output this for every observation in our data set. 47 00:03:31,300 --> 00:03:37,154 Just to be clear, maybe this red cluster corresponds to articles about world news, 48 00:03:37,154 --> 00:03:41,606 this green cluster corresponds to articles about science, and 49 00:03:41,606 --> 00:03:46,240 this blue cluster corresponds to articles about entertainment. 50 00:03:47,600 --> 00:03:50,230 So we're not provided with those labels, but 51 00:03:50,230 --> 00:03:55,830 we're going to group articles, color them, as we're doing in this picture, 52 00:03:55,830 --> 00:04:00,320 based on similarities in the observed input space. 53 00:04:00,320 --> 00:04:05,710 So in this simple 2D example, this would be similarities between articles, 54 00:04:05,710 --> 00:04:09,000 based on a simple two word vocabulary and 55 00:04:09,000 --> 00:04:12,910 counting how many times those words appear in each of these articles. 56 00:04:15,030 --> 00:04:19,080 Okay, well this is an example of an unsupervised learning task because 57 00:04:19,080 --> 00:04:23,870 all we're presenting our algorithm with are the inputs and 58 00:04:23,870 --> 00:04:27,760 the algorithm is supposed to output a set of labels as the output. 59 00:04:29,400 --> 00:04:31,520 Well, how can we think about learning such labels? 60 00:04:32,810 --> 00:04:37,020 Well a key component of that question is thinking about what defines a cluster? 61 00:04:38,420 --> 00:04:41,085 And once we've defined what a cluster means, 62 00:04:41,085 --> 00:04:45,064 then we need to define algorithms for inferring these clusterings. 63 00:04:45,064 --> 00:04:50,465 Okay, but a cluster is defined by its center, 64 00:04:50,465 --> 00:04:54,483 which is often called a centroid, 65 00:04:54,483 --> 00:04:58,650 as well as the shape of the cluster. 66 00:04:58,650 --> 00:05:04,440 And in this application here, the clusters are defined by ellipses. 67 00:05:04,440 --> 00:05:08,972 So here, this red circle is an example of an ellipse, and 68 00:05:08,972 --> 00:05:16,810 this green cluster has this elongated ellipse that's at one angle and 69 00:05:16,810 --> 00:05:20,980 the blue cluster has another smaller ellipse at a different rotation. 70 00:05:22,270 --> 00:05:26,700 So these are the family of shapes that we're exploring, our ellipses, 71 00:05:26,700 --> 00:05:31,940 different rotations and stretchings of these ellipses, but 72 00:05:31,940 --> 00:05:34,590 that's what we're using to define a cluster. 73 00:05:36,420 --> 00:05:40,032 And then, when we have a given observation, again, 74 00:05:40,032 --> 00:05:45,130 observation xi, which is a specific document. 75 00:05:45,130 --> 00:05:50,020 The way we think about assigning that document to a given cluster is based on 76 00:05:50,020 --> 00:05:53,770 something we'll call the score under each cluster. 77 00:05:53,770 --> 00:05:59,200 So often we just think about defining this score as the distance to the cluster 78 00:05:59,200 --> 00:06:04,460 center, so we can think about, let me choose another color just for 79 00:06:04,460 --> 00:06:08,660 fun to make this stand out a little bit more on this slide. 80 00:06:09,790 --> 00:06:15,098 So the distance to the cluster centers are these 81 00:06:15,098 --> 00:06:20,570 distances here, And 82 00:06:20,570 --> 00:06:26,910 what we see is that the smallest distance is the distance to this red cluster. 83 00:06:26,910 --> 00:06:32,637 So if the distance to the cluster alone is what defines the score, 84 00:06:32,637 --> 00:06:37,961 then this document Xi would be assigned to this red cluster. 85 00:06:37,961 --> 00:06:42,329 [MUSIC]