1 00:00:00,000 --> 00:00:04,664 [MUSIC] 2 00:00:04,664 --> 00:00:08,560 Let's now dig into a more detailed description of this course content. 3 00:00:09,620 --> 00:00:12,250 So, the course overview that we just talked about 4 00:00:12,250 --> 00:00:15,920 is going to be presented in four different modules in this course. 5 00:00:15,920 --> 00:00:18,520 So let's step through each of these modules and 6 00:00:18,520 --> 00:00:21,123 discuss what we will to present at a high level. 7 00:00:21,123 --> 00:00:25,034 So in our first module, we are going to discuss this document retrieval task. 8 00:00:25,034 --> 00:00:27,443 Where there is some document that a person's reading and 9 00:00:27,443 --> 00:00:29,509 we want to search over all these other documents. 10 00:00:29,509 --> 00:00:32,680 To find a similar one that they might also be interested in reading. 11 00:00:33,930 --> 00:00:36,921 And to do this we are going to talk about nearest neighbor search. 12 00:00:36,921 --> 00:00:40,858 Where we compute the distances between our query article and 13 00:00:40,858 --> 00:00:42,952 every other article out there. 14 00:00:42,952 --> 00:00:45,697 And then retrieve the document that's closest, 15 00:00:45,697 --> 00:00:48,490 the one with the smallest distance. 16 00:00:48,490 --> 00:00:53,290 Well there are a couple of key components of doing this nearest neighbor search. 17 00:00:53,290 --> 00:00:56,490 One of them is how are we going to represent our document. 18 00:00:56,490 --> 00:01:00,240 And the second is how are we going to compute the similarity between documents, 19 00:01:00,240 --> 00:01:02,470 or the distance between them? 20 00:01:02,470 --> 00:01:07,070 Because both, how we represent our data, as well as how we compute these distances 21 00:01:07,070 --> 00:01:09,998 really influences what we think of as a nearest neighbor. 22 00:01:09,998 --> 00:01:14,516 And we're going to go through various choices for each of these components and 23 00:01:14,516 --> 00:01:17,350 talk about some of the tradeoffs that are made. 24 00:01:17,350 --> 00:01:21,203 Then we're going to turn to how to scale up nearest neighbour search to really 25 00:01:21,203 --> 00:01:22,129 large data sets. 26 00:01:22,129 --> 00:01:26,175 Because our brute force approach of computing distances to every single 27 00:01:26,175 --> 00:01:27,708 datapoint in our data set. 28 00:01:27,708 --> 00:01:30,246 Is computationally really, really intensive for 29 00:01:30,246 --> 00:01:32,553 just performing one nearest neighbor query. 30 00:01:32,553 --> 00:01:36,252 So instead we're going to talk about a really cool data structure, 31 00:01:36,252 --> 00:01:37,401 called KD-trees. 32 00:01:37,401 --> 00:01:41,575 That allows us, in the context of nearest neighbor search, to prune out a large 33 00:01:41,575 --> 00:01:46,400 portion of the search space when we're going to compute our nearest neighbor. 34 00:01:46,400 --> 00:01:50,770 Well, at least, it can often do that, though you can get unlucky. 35 00:01:50,770 --> 00:01:56,422 But, KD-trees, although very cool, don't work very well in high dimensional spaces. 36 00:01:56,422 --> 00:01:59,674 So if you think of our document and having a very large vocabulary. 37 00:01:59,674 --> 00:02:02,816 And that our document representation is going to be based on 38 00:02:02,816 --> 00:02:04,459 the size of that vocabulary. 39 00:02:04,459 --> 00:02:08,023 KD-trees no longer seem like that attractive of an approach for 40 00:02:08,023 --> 00:02:10,069 doing our nearest neighbor search. 41 00:02:10,069 --> 00:02:14,423 So instead, we're going to talk about ways to do approximate nearest neighbor search. 42 00:02:14,423 --> 00:02:17,362 Using something called locality sensitive hashing. 43 00:02:17,362 --> 00:02:21,650 Because one of the concepts we going to teach in this course is the fact that in 44 00:02:21,650 --> 00:02:26,286 a lot of applications we don't actually care about providing the exact nearest 45 00:02:26,286 --> 00:02:27,066 neighbour. 46 00:02:27,066 --> 00:02:30,560 But maybe one that is pretty close to that is good enough. 47 00:02:31,728 --> 00:02:34,760 And when we're willing to accept that type of approximation, 48 00:02:34,760 --> 00:02:37,579 we can do things much more efficiently in some cases. 49 00:02:39,220 --> 00:02:43,160 Then, in the second module we're going to turn to this idea of clustering. 50 00:02:43,160 --> 00:02:47,150 Where here our goals are going to be to discover groups of related articles 51 00:02:47,150 --> 00:02:50,160 just based on the content of the articles themselves. 52 00:02:50,160 --> 00:02:54,090 So this is going to be an example of an unsupervised learning task. 53 00:02:54,090 --> 00:02:55,870 And the first algorithm we're going to present for 54 00:02:55,870 --> 00:02:58,815 performing this type of clustering is k-means. 55 00:02:58,815 --> 00:03:03,513 Where the goal of k-means is to minimize the sum of the square distances 56 00:03:03,513 --> 00:03:07,520 between data points and the set of learned cluster centers. 57 00:03:07,520 --> 00:03:12,355 And, so we can think of this just as a partitioning of our data set. 58 00:03:12,355 --> 00:03:17,190 Where we're making hard assignments of data points to specific clusters. 59 00:03:17,190 --> 00:03:21,099 And again this is an example of an unsupervised algorithm. 60 00:03:21,099 --> 00:03:23,745 Because we're just taking the inputs themselves and 61 00:03:23,745 --> 00:03:27,518 trying to learn this structure from them without having any output labels. 62 00:03:27,518 --> 00:03:32,294 We don't know the cluster labels at any of our data points in this data set. 63 00:03:32,294 --> 00:03:36,782 Then later in the second module we are going to turn to trying to scale up 64 00:03:36,782 --> 00:03:41,250 k-means to large data sets using something called map reduce. 65 00:03:41,250 --> 00:03:43,233 Map reduce is a framework for 66 00:03:43,233 --> 00:03:47,787 parallelizing operations over a number of different machines. 67 00:03:47,787 --> 00:03:51,504 We're going to discuss the general map reduce framework that can be used 68 00:03:51,504 --> 00:03:53,588 in a wide range of different settings. 69 00:03:53,588 --> 00:03:57,910 As well as it's specific application to k-means. 70 00:03:57,910 --> 00:04:02,100 Then in our third module, we're going to turn to probabilistic models for 71 00:04:02,100 --> 00:04:03,910 performing clustering. 72 00:04:03,910 --> 00:04:06,640 And the goal here is to capture uncertainty 73 00:04:06,640 --> 00:04:09,300 in the cluster assignments that we're making. 74 00:04:09,300 --> 00:04:13,340 So for example, maybe there's an article where it's not really clear. 75 00:04:13,340 --> 00:04:16,620 Is about science or is it about world news? 76 00:04:16,620 --> 00:04:18,440 Or really we don't have these labels, so 77 00:04:18,440 --> 00:04:23,940 is it more related to the group of articles that have content about science? 78 00:04:23,940 --> 00:04:28,283 Or the group of articles that have concepts about world news without these 79 00:04:28,283 --> 00:04:29,555 labels being shown? 80 00:04:29,555 --> 00:04:32,977 And so there are some articles that it might not really be clear exactly what 81 00:04:32,977 --> 00:04:33,969 assignment to make. 82 00:04:33,969 --> 00:04:38,781 And we want the algorithm to present us with this uncertainty instead of 83 00:04:38,781 --> 00:04:43,287 just providing a hard assignment of a document to a given cluster. 84 00:04:43,287 --> 00:04:47,358 And one thing we can do with this type of output is help us learn a user's 85 00:04:47,358 --> 00:04:49,916 preference over a set of different topics. 86 00:04:49,916 --> 00:04:54,180 Because if we have feedback from the user about articles that they liked or 87 00:04:54,180 --> 00:04:55,086 didn't like. 88 00:04:55,086 --> 00:05:01,261 And we also have the uncertainty of what cluster that article was associated with. 89 00:05:01,261 --> 00:05:04,594 We can put both of these pieces of information together to have 90 00:05:04,594 --> 00:05:05,886 a better description. 91 00:05:05,886 --> 00:05:10,198 Of what impact the feedback from the user should have 92 00:05:10,198 --> 00:05:14,707 over our learned preference on topics from that user. 93 00:05:14,707 --> 00:05:17,543 And the probabilistic model that we're going to use for 94 00:05:17,543 --> 00:05:19,628 clustering is called a mixture model. 95 00:05:19,628 --> 00:05:23,931 And here, when we're thinking about assigning data points to clusters, 96 00:05:23,931 --> 00:05:28,509 it's not just the cluster center that's going to come into play like in K-mint, 97 00:05:28,509 --> 00:05:30,510 but also the shape of the cluster. 98 00:05:31,730 --> 00:05:34,540 And here, again, unlike in k-means, 99 00:05:34,540 --> 00:05:37,600 we're not going to make hard assignments of data points to clusters. 100 00:05:37,600 --> 00:05:40,232 We're going to make what are called soft assignments. 101 00:05:40,232 --> 00:05:43,466 Where data points can have weights in multiple clusters. 102 00:05:43,466 --> 00:05:47,074 Where those weights are based on our uncertainty about the assignment of 103 00:05:47,074 --> 00:05:48,620 the data point to the cluster. 104 00:05:49,780 --> 00:05:54,131 So the goal here is to go from a set of unlabeled data points, 105 00:05:54,131 --> 00:05:59,106 like this gray cloud of points shown here, to a set of color points. 106 00:05:59,106 --> 00:06:01,519 But now, instead of making hard assignments, 107 00:06:01,519 --> 00:06:04,126 there's going to be an entire spectrum of colors. 108 00:06:04,126 --> 00:06:09,404 Representing the allocation of data points to clusters and the uncertainty in that. 109 00:06:09,404 --> 00:06:13,956 So for example if you look at the points line between this blue and 110 00:06:13,956 --> 00:06:16,245 green cluster there is a point. 111 00:06:16,245 --> 00:06:19,385 They're points that have this shading somewhere in-between those two colors. 112 00:06:19,385 --> 00:06:23,649 Meaning there's uncertainty about whether those observations are assigned to 113 00:06:23,649 --> 00:06:25,724 the blue cluster or the green cluster. 114 00:06:25,724 --> 00:06:29,030 And, likewise, between the green and fuchsia clusters. 115 00:06:30,040 --> 00:06:32,110 And, the algorithm we're going to go through for 116 00:06:32,110 --> 00:06:35,480 inferring these soft assignments of data points to clusters. 117 00:06:35,480 --> 00:06:40,766 Is referred to as expectation maximization or the Algorithm. 118 00:06:40,766 --> 00:06:44,414 And finally, in our fourth module, we're going to describe a more intricate 119 00:06:44,414 --> 00:06:46,761 probabilistic model for analyzing documents. 120 00:06:46,761 --> 00:06:49,990 And this is called Latent Dirichlet Allocation or LDA. 121 00:06:49,990 --> 00:06:53,210 And to motivate the ideas behind LDA, 122 00:06:53,210 --> 00:06:56,450 let's look at a specific document shown here. 123 00:06:56,450 --> 00:07:00,970 And what we see that this document has a bunch of science related words. 124 00:07:00,970 --> 00:07:05,640 So maybe, based on the content of this document we might group this 125 00:07:05,640 --> 00:07:10,580 article with other documents that are also about science related topics. 126 00:07:10,580 --> 00:07:13,990 But, this article also has a lot of tech related words so 127 00:07:13,990 --> 00:07:19,260 may be a better description is to group this article with the tech articles. 128 00:07:19,260 --> 00:07:24,110 But instead this document is really about science and technology. 129 00:07:24,110 --> 00:07:26,340 And that's exactly LEA allows us to do. 130 00:07:26,340 --> 00:07:31,470 It allows us to provide what's called mix membership where a given document 131 00:07:31,470 --> 00:07:34,720 is allowed to belong to multiple different topics. 132 00:07:34,720 --> 00:07:38,650 And, not only is a document associated with multiple topics, but 133 00:07:38,650 --> 00:07:43,530 LDA provides the relative proportion of these topics in this document. 134 00:07:44,710 --> 00:07:50,055 In particular, every topic within the LDA model is described 135 00:07:50,055 --> 00:07:54,597 as a distribution over the words in the vocabulary. 136 00:07:54,597 --> 00:07:59,111 And so what we see is that maybe the science topic has more weight on 137 00:07:59,111 --> 00:08:00,845 science related words. 138 00:08:00,845 --> 00:08:05,075 Whereas the tech topic has more weight on words related to technology and 139 00:08:05,075 --> 00:08:06,890 likewise for sports and so on. 140 00:08:06,890 --> 00:08:10,900 But the crazy thing is we're going to be doing this in a completely 141 00:08:10,900 --> 00:08:12,620 unsupervised manner. 142 00:08:12,620 --> 00:08:16,650 We're just going to be providing the words of a document and 143 00:08:16,650 --> 00:08:19,020 this for all documents in the corpus. 144 00:08:19,020 --> 00:08:22,704 So, we have a bunch and bunch of words for every document in the corpus. 145 00:08:22,704 --> 00:08:24,497 And just from that alone, 146 00:08:24,497 --> 00:08:29,888 we're going to infer these topic specific distributions over the vocabulary. 147 00:08:29,888 --> 00:08:36,350 As well as the document specific proportions of topics in that document. 148 00:08:36,350 --> 00:08:39,340 And so we're going to go through exactly how we can extract this kind of 149 00:08:39,340 --> 00:08:40,350 information. 150 00:08:40,350 --> 00:08:45,638 But what we see is that this is another example of discovering structure in data, 151 00:08:45,638 --> 00:08:47,679 in this unsupervised manner. 152 00:08:47,679 --> 00:08:51,879 [MUSIC]