1 00:00:00,000 --> 00:00:04,598 [MUSIC] 2 00:00:04,598 --> 00:00:08,710 Welcome to the Clustering and Retrieval course, a machine learning specialization. 3 00:00:08,710 --> 00:00:12,680 And as we're going to see, clustering and retrieval are both widely used and 4 00:00:12,680 --> 00:00:14,330 broadly applicable tools. 5 00:00:14,330 --> 00:00:15,440 These tools can be used for 6 00:00:15,440 --> 00:00:19,920 anything from finding similar products to the one that you're currently browsing. 7 00:00:19,920 --> 00:00:24,100 To discovering groups of patients with related medical conditions. 8 00:00:24,100 --> 00:00:25,540 And in the foundation's course, 9 00:00:25,540 --> 00:00:29,630 we provided the high level introduction to the concepts of clustering and retrieval. 10 00:00:29,630 --> 00:00:32,600 But now in this course, we're going to take a deep dive into the models and 11 00:00:32,600 --> 00:00:34,280 algorithms. 12 00:00:34,280 --> 00:00:37,420 This course is part of a specialization that was designed to be taken 13 00:00:37,420 --> 00:00:38,910 in a specific sequence. 14 00:00:38,910 --> 00:00:41,890 So, although you can take this course as a standalone course. 15 00:00:41,890 --> 00:00:43,790 To get the experience that we intended, 16 00:00:43,790 --> 00:00:47,370 we strongly suggest that you take the entire sequence. 17 00:00:47,370 --> 00:00:51,570 In particular, in the Foundations course, we provided a high level overview of 18 00:00:51,570 --> 00:00:54,899 the concepts that we're going to be cover throughout the entire specialization. 19 00:00:56,020 --> 00:01:01,860 Then we went into much more detailed courses on regression and classification. 20 00:01:01,860 --> 00:01:05,298 And in those courses, we learned general machine learning concepts that 21 00:01:05,298 --> 00:01:08,472 are going to appear again in this course on clustering and retrieval. 22 00:01:08,472 --> 00:01:12,307 And we'll go through a specific list of the concepts that we expect you to know 23 00:01:12,307 --> 00:01:15,520 later on in this introduction to the course. 24 00:01:15,520 --> 00:01:19,170 But first, let's provide an overview of what this course is about. 25 00:01:19,170 --> 00:01:22,190 And remember that in the machine learning foundations course, 26 00:01:22,190 --> 00:01:26,800 we said that machine learning was about extracting intelligence from data. 27 00:01:26,800 --> 00:01:30,040 Well, in the first part of this course, the machine learning method that 28 00:01:30,040 --> 00:01:35,080 we're going to be going through is nearest neighbor search for retrieval task. 29 00:01:35,080 --> 00:01:37,160 So in particular here, 30 00:01:37,160 --> 00:01:41,800 the inputs are going to be the features of a specific query point. 31 00:01:41,800 --> 00:01:45,415 As well as the features of all other datapoints in a specific dataset. 32 00:01:46,750 --> 00:01:50,680 And then the output is going to be the nearest neighbor to the query point. 33 00:01:50,680 --> 00:01:57,330 Being the point within this entire dataset that's most similar to this query point. 34 00:01:57,330 --> 00:01:58,900 So, what's needed for 35 00:01:58,900 --> 00:02:02,540 our nearest neighbor search is to search over all these other datapoints. 36 00:02:02,540 --> 00:02:05,980 And compute the similarity to these datapoints or 37 00:02:05,980 --> 00:02:10,220 their distances and then return the one that's closest. 38 00:02:10,220 --> 00:02:12,970 So as a running example that we're going to use throughout this course, 39 00:02:12,970 --> 00:02:16,290 let's imagine that we had a whole bunch of documents and 40 00:02:16,290 --> 00:02:18,650 somebody's reading a specific document. 41 00:02:18,650 --> 00:02:20,950 So this is this query article in gray here and 42 00:02:20,950 --> 00:02:23,321 let's say that the user likes this article. 43 00:02:23,321 --> 00:02:28,010 And we want to retrieve another article that's similar that they might also like. 44 00:02:28,010 --> 00:02:31,260 So this would be the task of finding the nearest neighbor. 45 00:02:31,260 --> 00:02:34,520 But maybe we also want to retrieve a set of nearest neighbors, so 46 00:02:34,520 --> 00:02:37,370 a set of similar document to display to this user. 47 00:02:39,038 --> 00:02:42,940 And we're going to be providing this search over all these other documents to 48 00:02:42,940 --> 00:02:47,030 retrieve these nearest neighbors just based on the text of the document. 49 00:02:47,030 --> 00:02:49,295 So just based on the features of our datapoints. 50 00:02:50,360 --> 00:02:53,590 And we see retrieval applications almost everywhere. 51 00:02:53,590 --> 00:02:58,180 So perhaps we have some image and want to find a set of similar images. 52 00:02:58,180 --> 00:03:02,090 Or maybe we're shopping for some product and we want to be provided with a set of 53 00:03:02,090 --> 00:03:04,788 other similar items that we might want to consider instead. 54 00:03:04,788 --> 00:03:08,650 Or like we've discussed, 55 00:03:08,650 --> 00:03:13,960 maybe we're reading some article and want to read a related article on a topic. 56 00:03:13,960 --> 00:03:18,570 And this kind of idea appears in a lot of streaming contexts where you might be 57 00:03:18,570 --> 00:03:21,738 listening to a song, watching a movie or watching a TV show. 58 00:03:21,738 --> 00:03:25,200 And you want to be provided with a set of other songs or movies or 59 00:03:25,200 --> 00:03:27,470 TV shows that you might also be interested in. 60 00:03:29,440 --> 00:03:32,420 Or perhaps you're a user on some social network and 61 00:03:32,420 --> 00:03:35,040 based on features of you as a user. 62 00:03:35,040 --> 00:03:38,040 You want to be presented with other people that you might want to connect with. 63 00:03:39,170 --> 00:03:44,760 So in general, this idea of retrieving similar items can be a really, 64 00:03:44,760 --> 00:03:46,460 really powerful tool. 65 00:03:46,460 --> 00:03:50,070 And throughout this course we're going to discuss ways to provide this type of 66 00:03:50,070 --> 00:03:50,570 search. 67 00:03:51,690 --> 00:03:54,490 Then in the second part of the course we're going to turn to 68 00:03:54,490 --> 00:03:56,740 clustering as our machine learning method. 69 00:03:56,740 --> 00:03:57,540 And here, 70 00:03:57,540 --> 00:04:02,000 our inputs are features associated with each of our datapoints within a dataset. 71 00:04:02,000 --> 00:04:05,720 And the output are going to be cluster labels associated with 72 00:04:05,720 --> 00:04:07,310 each of these datapoints. 73 00:04:07,310 --> 00:04:12,310 Where the goal of the clustering method is to discover these disjoint sets 74 00:04:12,310 --> 00:04:13,075 of datapoints. 75 00:04:14,390 --> 00:04:16,980 And we're going to look at a couple case studies when we're talking about 76 00:04:16,980 --> 00:04:17,780 clustering. 77 00:04:17,780 --> 00:04:21,650 But one of them is going to be the same idea of a document retrieval task 78 00:04:21,650 --> 00:04:25,060 where we have an entire corpus of documents. 79 00:04:25,060 --> 00:04:29,060 And here though as opposed to our simple retrieval 80 00:04:29,060 --> 00:04:31,710 setting where we're just performing nearest neighbor search. 81 00:04:31,710 --> 00:04:36,340 Our goal is to discover a structured representation of this input space. 82 00:04:36,340 --> 00:04:41,210 So a structured grouping of our corpus into groups that 83 00:04:41,210 --> 00:04:45,410 might relate to things like topics present in this corpus. 84 00:04:46,610 --> 00:04:48,810 And just like we saw through retrieval, 85 00:04:48,810 --> 00:04:51,560 clustering has applications almost everywhere as well. 86 00:04:52,630 --> 00:04:56,970 So maybe when we're doing image search, we don't just want to take a given image and 87 00:04:56,970 --> 00:04:59,010 provide a set of related images. 88 00:04:59,010 --> 00:05:03,335 But we might want to actually structure our dataset where we're going to discover 89 00:05:03,335 --> 00:05:05,600 groups of related images. 90 00:05:05,600 --> 00:05:09,440 Or maybe we want to discover groups of Coursera learners like yourselves. 91 00:05:09,440 --> 00:05:13,610 Where based on features of the user and past behavior on Coursera. 92 00:05:13,610 --> 00:05:16,290 We can find users who are related to one another and 93 00:05:16,290 --> 00:05:20,770 might have similar interest and use this to better target courses to the learners. 94 00:05:22,250 --> 00:05:25,830 So this very simple ideas of retrieval and clustering. 95 00:05:25,830 --> 00:05:29,250 Actually have very, very significant impact in the world and 96 00:05:29,250 --> 00:05:31,120 our very much taken for granted. 97 00:05:31,120 --> 00:05:35,240 You just go on to your app or your device and you search for something. 98 00:05:35,240 --> 00:05:38,720 And you just assume okay, of course, is going to be able to provide a set of 99 00:05:38,720 --> 00:05:42,600 other things that might be related to the given query that you've made. 100 00:05:43,650 --> 00:05:48,670 Or maybe we can very easily just discover from data that we've seen, 101 00:05:48,670 --> 00:05:51,820 what are related groups of people or items. 102 00:05:51,820 --> 00:05:53,720 Well, how do we actually do this? 103 00:05:53,720 --> 00:05:55,680 So this is what this course is about and 104 00:05:55,680 --> 00:05:58,230 we're also going to touch upon how we do this at scale. 105 00:05:59,450 --> 00:06:03,370 And throughout this course, like always we're going to learn general 106 00:06:03,370 --> 00:06:06,490 machine learning concepts that are very useful. 107 00:06:06,490 --> 00:06:09,260 For example, in this course we're going to talk about the task of 108 00:06:09,260 --> 00:06:11,080 unsupervised learning. 109 00:06:11,080 --> 00:06:15,913 And we're also going to talk about things like MapReduce framework for 110 00:06:15,913 --> 00:06:20,341 scaling up algorithms by parallelizing their computations. 111 00:06:20,341 --> 00:06:24,409 [MUSIC]