1 00:00:00,000 --> 00:00:04,563 [MUSIC] 2 00:00:04,563 --> 00:00:08,609 So that was our deep dive into retrieval where a lot of our focus was not only 3 00:00:08,609 --> 00:00:12,590 on how do we think about representing our data and computing distances, 4 00:00:12,590 --> 00:00:17,028 which are very critical to the performance of the nearest neighbour algorithm, 5 00:00:17,028 --> 00:00:20,354 but we also focus very heavily on how we think about scaling up 6 00:00:20,354 --> 00:00:23,320 nearest neighbour search to really large data sets. 7 00:00:23,320 --> 00:00:27,370 Thinking about using KD trees, and locality sense of hashing, and 8 00:00:27,370 --> 00:00:30,320 using both of these to perform approximate nearest neighbor search. 9 00:00:31,420 --> 00:00:35,420 But for the remainder of the course, we turn to this idea of clustering. 10 00:00:35,420 --> 00:00:37,250 And in particular, in our second module, 11 00:00:37,250 --> 00:00:40,390 we first discuss, what does clustering mean? 12 00:00:40,390 --> 00:00:43,850 And we return to our document analysis where 13 00:00:43,850 --> 00:00:47,470 the goal here was to discover groups of related articles. 14 00:00:47,470 --> 00:00:49,220 And the first algorithm that we presented for 15 00:00:49,220 --> 00:00:52,110 performing clustering was something called k-means. 16 00:00:52,110 --> 00:00:56,540 And that is probably the most widely used algorithm for clustering. 17 00:00:56,540 --> 00:00:59,500 I remember that in this algorithm we somehow initialized by 18 00:00:59,500 --> 00:01:04,400 just placing a set of cluster centers, and then the algorithm iterates between 19 00:01:04,400 --> 00:01:08,980 making hard assignments of data points to the nearest cluster center, and 20 00:01:08,980 --> 00:01:11,270 then updating the cluster centers. 21 00:01:11,270 --> 00:01:14,790 And this procedure repeats until convergence. 22 00:01:14,790 --> 00:01:20,130 And in particular, we cascade means as a coordinate descent algorithm, and through 23 00:01:20,130 --> 00:01:24,190 this we know that it's going to converge to a local motive of the objective. 24 00:01:24,190 --> 00:01:28,310 So importantly, we know that we're not going to converge necessarily to 25 00:01:28,310 --> 00:01:31,830 the global mode, but the algorithm does converge just to a local mode. 26 00:01:32,920 --> 00:01:35,010 So we went through this in pictures, showing that for 27 00:01:35,010 --> 00:01:38,830 different initializations, we can converge to quite different solutions. 28 00:01:38,830 --> 00:01:43,800 So in practice people often run multiple initializations of the algorithm and 29 00:01:43,800 --> 00:01:45,340 choose the best one. 30 00:01:45,340 --> 00:01:49,840 And we discussed how initialization matters really significantly for 31 00:01:49,840 --> 00:01:54,270 the quality of the solution found, as well as the convergence rates. 32 00:01:54,270 --> 00:01:58,535 But, just like in our first module on retrieval, in our second module, 33 00:01:58,535 --> 00:02:02,795 one of our big emphasis was on the scalability of the procedure. 34 00:02:02,795 --> 00:02:07,325 And for this, we introduce a generic framework called MapReduce for 35 00:02:07,325 --> 00:02:13,255 parellelizing computations across multiple machines under certain specific setups. 36 00:02:13,255 --> 00:02:17,715 In particular, we talked about the MapReduce abstraction in terms of a map 37 00:02:17,715 --> 00:02:23,080 step that has to be data-parallel over elements, and a reduce step 38 00:02:23,080 --> 00:02:28,770 that performs some aggregation that has to be data-parallel over keys. 39 00:02:28,770 --> 00:02:31,320 But if your algorithm has this structure, then you can think about 40 00:02:31,320 --> 00:02:35,220 using MapReduce to paralyze and distribute the computations. 41 00:02:35,220 --> 00:02:39,160 And we showed that indeed K-means has this type of structure. 42 00:02:39,160 --> 00:02:43,090 Where in particular we can think about the classify step of making the hard 43 00:02:43,090 --> 00:02:49,620 assignments of data points to cluster centers as data parallel over elements. 44 00:02:49,620 --> 00:02:51,790 And the recenter step, 45 00:02:51,790 --> 00:02:57,700 where we recompute the centers of each of the clusters as data parallel over keys. 46 00:02:57,700 --> 00:03:00,326 So we have a map and reduce step here, but 47 00:03:00,326 --> 00:03:04,140 in k-means we mention that we have to iterate between these steps again and 48 00:03:04,140 --> 00:03:06,870 again, which is a non-standard implementation of MapReduce. 49 00:03:08,020 --> 00:03:10,829 But it can definitely be applied in this context and 50 00:03:10,829 --> 00:03:13,848 give us really good scaling properties for k-means. 51 00:03:13,848 --> 00:03:18,479 [MUSIC]