1 00:00:00,164 --> 00:00:04,551 [MUSIC] 2 00:00:04,551 --> 00:00:07,186 So far in this module, we've discussed clustering in 3 00:00:07,186 --> 00:00:11,440 the context of document analysis, but we see applications of clustering everywhere. 4 00:00:12,840 --> 00:00:18,305 As an example, maybe you want to cluster images, so in this case that we're 5 00:00:18,305 --> 00:00:23,512 showing here you might want to be able to group all images associated with 6 00:00:23,512 --> 00:00:29,020 pictures of the ocean or a pink flower, dog, sunset, clouds, and so on. 7 00:00:29,020 --> 00:00:33,220 And this can be useful for things like Google Image search, 8 00:00:33,220 --> 00:00:38,512 where a person types a word into Google Images, performs some query on that 9 00:00:38,512 --> 00:00:43,810 images dataset, and Google wants to display a set of images for that word. 10 00:00:43,810 --> 00:00:48,730 But maybe not all of our images, and the data set have been labeled by human. 11 00:00:48,730 --> 00:00:51,370 So if we can discover groups of related images, 12 00:00:51,370 --> 00:00:55,230 this can really help in image search, and 13 00:00:55,230 --> 00:00:59,230 there are other reasons you might want to perform clustering on images, too. 14 00:00:59,230 --> 00:01:05,490 So for example, when, again, using Google Images somebody types in a term. 15 00:01:05,490 --> 00:01:10,638 Like the term cardinal, well, that term might have multiple meanings and 16 00:01:10,638 --> 00:01:15,048 so if we take all images that have already been labeled, again, 17 00:01:15,048 --> 00:01:17,677 maybe this was by a human as cardinal. 18 00:01:17,677 --> 00:01:23,277 Then, maybe we want to perform a clustering on the images associated 19 00:01:23,277 --> 00:01:28,975 with this word because maybe some of the images correspond to a bird, 20 00:01:28,975 --> 00:01:34,200 but others to a baseball team and others to religious figures. 21 00:01:35,480 --> 00:01:37,980 So when a person is going and searching for the word cardinal, 22 00:01:37,980 --> 00:01:42,510 we don't actually know which of this meanings that person meant, so 23 00:01:42,510 --> 00:01:45,690 when we're displaying images instead of having all the images 24 00:01:45,690 --> 00:01:49,800 be images of birds but just different variance of that image. 25 00:01:49,800 --> 00:01:52,090 Maybe we want to have a diverse set of images, 26 00:01:52,090 --> 00:01:55,320 that cover these set of possible meanings. 27 00:01:55,320 --> 00:01:59,670 So if we can cluster images all labeled with the same label, 28 00:01:59,670 --> 00:02:04,620 we can present result that might better explore what the person is 29 00:02:04,620 --> 00:02:07,710 actually intending when they do a Google Image search. 30 00:02:07,710 --> 00:02:12,070 So in this case we're using clustering to structure our output. 31 00:02:12,070 --> 00:02:15,240 We can also think about applying clustering in applications that are very 32 00:02:15,240 --> 00:02:17,220 different than just for the sake of retrieval. 33 00:02:18,330 --> 00:02:19,202 For example, 34 00:02:19,202 --> 00:02:24,377 maybe we can think about grouping patients based on exhibited medical conditions. 35 00:02:24,377 --> 00:02:28,489 If we can form this grouping then we can do things like better 36 00:02:28,489 --> 00:02:32,190 study subpopulations as well as diseases. 37 00:02:32,190 --> 00:02:33,770 So as an example of this, 38 00:02:33,770 --> 00:02:38,680 imagine we have intracranial EEG recordings from a set of patients. 39 00:02:38,680 --> 00:02:41,320 So here, we're showing three different brains and 40 00:02:41,320 --> 00:02:44,020 some electrodes placed on these brains. 41 00:02:44,020 --> 00:02:46,810 And from each one of these patients, 42 00:02:47,990 --> 00:02:52,250 we get a set of recordings of different seizure events. 43 00:02:52,250 --> 00:02:55,360 Then, we can take each one of these seizure events. 44 00:02:55,360 --> 00:03:00,230 And think about clustering based on similarities between the events. 45 00:03:00,230 --> 00:03:02,580 And then, based on these types of clusterings, 46 00:03:02,580 --> 00:03:05,940 we can learn more about different types of seizures. 47 00:03:05,940 --> 00:03:10,431 We can characterize how many different types of seizures do we think there are as 48 00:03:10,431 --> 00:03:14,195 well as what are the properties of each one of these seizure types. 49 00:03:14,195 --> 00:03:18,005 We can also think about clustering patients based on the types of seizures 50 00:03:18,005 --> 00:03:19,530 that they exhibit. 51 00:03:19,530 --> 00:03:23,820 And better understand different subpopulations within the class of people 52 00:03:23,820 --> 00:03:25,220 who have seizure disorders. 53 00:03:26,730 --> 00:03:30,600 Another very different application is looking at products on Amazon. 54 00:03:30,600 --> 00:03:35,520 And maybe here we would like to group products based on having similar 55 00:03:35,520 --> 00:03:36,380 purchase histories. 56 00:03:36,380 --> 00:03:39,740 And when I talk about purchase history of a product what I mean is, 57 00:03:39,740 --> 00:03:43,560 if we look at all users that go and purchase a given item and 58 00:03:43,560 --> 00:03:47,770 look at other items that they purchase either in that same shopping or 59 00:03:47,770 --> 00:03:53,180 in the recent history and we look at that as a description of a product. 60 00:03:53,180 --> 00:03:57,140 So this would be aggregated over all users that purchased this product 61 00:03:57,140 --> 00:04:01,290 then we can think about clustering similar products. 62 00:04:01,290 --> 00:04:06,796 So for example, maybe when people go and buy this crib, 63 00:04:06,796 --> 00:04:13,023 we also see that they are buying a car seat and other baby items. 64 00:04:13,023 --> 00:04:16,968 And so, from this, we can may be we can lump this crib 65 00:04:16,968 --> 00:04:21,451 in with the set of baby items that are available on Amazon and 66 00:04:21,451 --> 00:04:27,010 even though let's say a seller labeled this under a category furniture, 67 00:04:27,010 --> 00:04:32,211 one thing we can do with this clustering is actually either correct or 68 00:04:32,211 --> 00:04:36,360 add a label which is baby, that this is a baby product. 69 00:04:38,250 --> 00:04:43,330 And of course, we could also use this type of structure to do product recommendation, 70 00:04:43,330 --> 00:04:47,226 somebody goes they have a certain things in their shopping cart and 71 00:04:47,226 --> 00:04:51,076 you want to recommend other products they might be interested in. 72 00:04:51,076 --> 00:04:53,896 And we can also think about discovering groups of 73 00:04:53,896 --> 00:04:58,519 users that have similar purchase behaviors, purchase or viewing behaviors. 74 00:04:59,620 --> 00:05:03,950 So there are lots of interesting things we can do through this clustering structure 75 00:05:03,950 --> 00:05:06,650 extracted both on users and products. 76 00:05:07,750 --> 00:05:12,440 The less obvious example were clustering can be useful is in the following task. 77 00:05:12,440 --> 00:05:17,149 So imagine that we want to estimate housing value a very small 78 00:05:17,149 --> 00:05:18,849 spatial locations. 79 00:05:18,849 --> 00:05:23,130 So here, in this image, we're showing the City of Seattle broken into senses tracks. 80 00:05:23,130 --> 00:05:26,180 So these are geographically small regions and 81 00:05:26,180 --> 00:05:29,880 we want to assess value within each one of these neighborhoods. 82 00:05:31,560 --> 00:05:36,780 But an issue here is the fact that there are very few house sales 83 00:05:36,780 --> 00:05:39,090 per region at any given point in time. 84 00:05:39,090 --> 00:05:42,420 And we want to be able to asses the value in that region. 85 00:05:42,420 --> 00:05:44,750 So how do we do it if there isn't much data? 86 00:05:44,750 --> 00:05:49,170 Well, one thing we can think about doing is discover clusters of regions 87 00:05:49,170 --> 00:05:51,990 that historically have behaved similarly. 88 00:05:51,990 --> 00:05:54,520 And if we discover this cluster of regions, 89 00:05:54,520 --> 00:05:57,740 we can think about sharing information between these regions, 90 00:05:57,740 --> 00:06:02,280 pooling our observations to form better estimates of value locally. 91 00:06:03,420 --> 00:06:05,270 A structurally related task but 92 00:06:05,270 --> 00:06:09,460 with a very different application is in forecasting crimes. 93 00:06:09,460 --> 00:06:13,620 So now what this image is showing is Washington, DC. 94 00:06:13,620 --> 00:06:15,600 Again, broken down into census tracks and 95 00:06:15,600 --> 00:06:20,450 the goal is to be able to forecast how many crimes are going to 96 00:06:20,450 --> 00:06:24,100 occur in each one of these census tracks at the next point in time? 97 00:06:25,120 --> 00:06:30,100 And again, here, we can think about clustering regions to share information 98 00:06:30,100 --> 00:06:35,750 and you can show that forming predictions based on a discovered 99 00:06:35,750 --> 00:06:41,460 group of regions that behave similarly, leads to better forecasts of crime rates, 100 00:06:41,460 --> 00:06:44,060 and if we were to treat each region independently. 101 00:06:46,130 --> 00:06:49,920 So as we see, there's a wide range of applications for 102 00:06:49,920 --> 00:06:54,720 clustering and the methods that we described in this module 103 00:06:54,720 --> 00:06:58,220 extend to any one of these applications. 104 00:06:58,220 --> 00:07:03,216 Obviously with some application specific tweaks to capture what 105 00:07:03,216 --> 00:07:06,754 it means to be a data object within a cluster and 106 00:07:06,754 --> 00:07:10,949 what are notions of distances between these objects. 107 00:07:10,949 --> 00:07:15,229 [MUSIC]