1 00:00:00,164 --> 00:00:05,007 [MUSIC] 2 00:00:05,007 --> 00:00:09,302 So far we've been focusing on a document retrieval task where somebody is reading 3 00:00:09,302 --> 00:00:13,410 an article and we just want to search over all other articles in the corpus to find 4 00:00:13,410 --> 00:00:17,787 one that's most similar or very similar to the article that the person is reading. 5 00:00:17,787 --> 00:00:21,691 But in a lot of cases, we're actually interested in discovering a structured 6 00:00:21,691 --> 00:00:24,720 representation of our data, of all the articles out there. 7 00:00:25,790 --> 00:00:28,690 So, one example of a structured representation we might be 8 00:00:28,690 --> 00:00:32,040 interested in is discovering a clustering of articles, so 9 00:00:32,040 --> 00:00:35,480 discovering groups of articles that are related to one another. 10 00:00:35,480 --> 00:00:38,330 And we're going to discuss both reasons why we would want to 11 00:00:38,330 --> 00:00:42,650 perform such clustering as well as algorithms for performing this clustering. 12 00:00:43,980 --> 00:00:46,240 So let's dig into the objective of clustering, 13 00:00:46,240 --> 00:00:48,000 as well as some motivating applications for 14 00:00:48,000 --> 00:00:52,060 performing clustering within the context of our document application. 15 00:00:52,060 --> 00:00:55,790 So the goal of clustering is to discover groups of related articles. 16 00:00:55,790 --> 00:00:58,730 So, for example, maybe there's a group where all 17 00:00:58,730 --> 00:01:03,190 the articles within this group relate to subjects related to sports. 18 00:01:03,190 --> 00:01:07,960 And maybe there's another group where all the articles are about world news. 19 00:01:07,960 --> 00:01:12,170 If we can discover this type of structure in our data, then if we're doing something 20 00:01:12,170 --> 00:01:15,480 like the document retrieval task that we talked about before. 21 00:01:15,480 --> 00:01:18,110 Then if a person is reading an article we can look at it 22 00:01:18,110 --> 00:01:20,270 what group that article falls in. 23 00:01:20,270 --> 00:01:22,410 And then when we go to retrieve another article for 24 00:01:22,410 --> 00:01:26,930 that reader, then we can simply search over the articles within that given group. 25 00:01:26,930 --> 00:01:28,950 So if a person's reading a sports article, 26 00:01:28,950 --> 00:01:33,030 maybe we can recommend another sports article just by searching for 27 00:01:33,030 --> 00:01:37,170 the nearest neighbor within this group of sports articles. 28 00:01:37,170 --> 00:01:40,330 But we can actually use clustering to do even fancier things. 29 00:01:40,330 --> 00:01:44,700 So, as an example, maybe we're interested in learning the preferences of a user 30 00:01:44,700 --> 00:01:46,980 over a set of different topics. 31 00:01:46,980 --> 00:01:52,800 In this case, we're assuming that the person isn't just interested in sports. 32 00:01:52,800 --> 00:01:54,850 They might have other interests as well. 33 00:01:54,850 --> 00:01:58,040 And when we're going to present an article to that user, 34 00:01:58,040 --> 00:02:01,300 maybe we'd want to explore some of these other preferences. 35 00:02:01,300 --> 00:02:05,460 So in this case, if we imagine that we have a clustering of our articles into 36 00:02:05,460 --> 00:02:07,650 these groups of related articles, 37 00:02:07,650 --> 00:02:12,780 then a user is going to read some subset of the articles in the corpus, 38 00:02:12,780 --> 00:02:16,110 and so, these are the articles that are not grayed out here. 39 00:02:16,110 --> 00:02:18,550 So these are all the articles that the user read, and 40 00:02:18,550 --> 00:02:21,690 then we can imagine that the user gives us some feedback about the articles, 41 00:02:21,690 --> 00:02:23,170 whether they like the article or not. 42 00:02:23,170 --> 00:02:26,110 So, plus sign is going to say, yep, they like the article, 43 00:02:26,110 --> 00:02:29,120 minus sign is going to be that they did not like the article. 44 00:02:29,120 --> 00:02:33,920 So this user liked those two articles in Cluster 1, 45 00:02:33,920 --> 00:02:38,470 that green Cluster as well this article in Cluster 4, the orange Custer, 46 00:02:38,470 --> 00:02:42,582 didn’t like the article that they read in this blue Cluster, Cluster 2. 47 00:02:42,582 --> 00:02:46,225 Did not like this other article read in Cluster 4, 48 00:02:46,225 --> 00:02:51,160 liked this article in this blue cluster, Cluster 2 and so on. 49 00:02:52,550 --> 00:02:56,680 And after we get this feedback or as we're getting it over time, 50 00:02:56,680 --> 00:03:03,050 we can use this to learn this type of preference factor over topics. 51 00:03:03,050 --> 00:03:06,230 One thing that we're going to discuss is that we don't actually have topic 52 00:03:06,230 --> 00:03:09,540 labels like sports, world news and so on. 53 00:03:09,540 --> 00:03:11,970 But we do know that there are groups of articles and 54 00:03:11,970 --> 00:03:16,750 we can use it to select articles from these groups. 55 00:03:16,750 --> 00:03:20,230 Or we can post facto go in and dig in to these groups and 56 00:03:20,230 --> 00:03:22,299 put labels on them ourselves. 57 00:03:22,299 --> 00:03:27,009 [MUSIC]