1 00:00:00,000 --> 00:00:04,145 [MUSIC] 2 00:00:04,145 --> 00:00:08,822 Okay, well we've talked quite exhaustively about this notion of clustering for 3 00:00:08,822 --> 00:00:12,208 the sake of doing document retrieval, but there are lots, 4 00:00:12,208 --> 00:00:15,595 and lots of other examples where clustering is useful, and 5 00:00:15,595 --> 00:00:19,110 I wanna take some time just to describe a few of them. 6 00:00:19,110 --> 00:00:23,212 So one application is for image search. 7 00:00:23,212 --> 00:00:27,068 So imagine you're going and you're searching, 8 00:00:27,068 --> 00:00:31,666 you go on Google Image search and you type in the word ocean. 9 00:00:31,666 --> 00:00:37,116 Well it would be really helpful if we could structure all the images we have by 10 00:00:37,116 --> 00:00:42,500 some set of categories like ocean, pink flower, dog, sunset, clouds. 11 00:00:43,760 --> 00:00:47,740 So clustering is very helpful for doing structured search. 12 00:00:49,280 --> 00:00:53,580 Another very different application is maybe we wanna group patients by 13 00:00:53,580 --> 00:00:55,670 their medical condition. 14 00:00:55,670 --> 00:00:58,610 So here a goal might be to better 15 00:00:58,610 --> 00:01:03,350 characterize subpopulations as well as different diseases. 16 00:01:03,350 --> 00:01:10,550 So as an example, we can look at a whole bunch of patients that have seizures. 17 00:01:10,550 --> 00:01:14,840 So these three brains represent three different patients, and 18 00:01:14,840 --> 00:01:19,830 they have different recording setups that are measuring their seizure activity. 19 00:01:19,830 --> 00:01:24,500 And so for each of these patients we get a collection of recordings 20 00:01:24,500 --> 00:01:27,250 of different seizures that they exhibit over time. 21 00:01:27,250 --> 00:01:30,070 So each one of these colored squares represents 22 00:01:30,070 --> 00:01:31,650 a different recording of a seizure. 23 00:01:33,500 --> 00:01:36,750 And between these different patients there might be 24 00:01:36,750 --> 00:01:40,780 similar types of seizures that appear in these different patients. 25 00:01:42,310 --> 00:01:46,090 And so what we can do is we can take all of these seizure recordings from these 26 00:01:46,090 --> 00:01:49,730 three different patients and think about clustering them. 27 00:01:49,730 --> 00:01:54,070 And if we identify different types of seizures in this way, 28 00:01:54,070 --> 00:01:58,430 this can allow us to better treat the types of patient that 29 00:01:58,430 --> 00:02:03,620 we're observing based on understanding what types seizures they exhibit. 30 00:02:03,620 --> 00:02:08,860 Well another application is thinking about doing product recommendation on Amazon. 31 00:02:08,860 --> 00:02:14,030 So, for example, on Amazon there are a lot of third parties that come and 32 00:02:14,030 --> 00:02:17,480 they post some product to be sold. 33 00:02:17,480 --> 00:02:20,310 And they provide a label of what that product is. 34 00:02:20,310 --> 00:02:25,050 So, for example, maybe a person wants to sell a crib and 35 00:02:25,050 --> 00:02:30,180 they label the crib, fairly reasonably, as being a furniture item. 36 00:02:30,180 --> 00:02:33,975 So maybe we get posted under the furniture category. 37 00:02:33,975 --> 00:02:38,813 But, if instead, we look at who purchases this item? 38 00:02:38,813 --> 00:02:41,945 And we look at their purchase history, and 39 00:02:41,945 --> 00:02:46,382 we look at other people with similar purchase histories, so 40 00:02:46,382 --> 00:02:51,950 maybe the person who purchased this item also purchased baby car seat, well 41 00:02:51,950 --> 00:02:57,692 then maybe what we can do is maybe we can infer that a better label for this crib, 42 00:02:57,692 --> 00:03:03,528 which had been labeled furniture is really to have labeled it as a baby product. 43 00:03:03,528 --> 00:03:09,340 So in addition to discovering groups of products that are related, that have. 44 00:03:09,340 --> 00:03:13,350 Based on purchase histories of these items we can also 45 00:03:13,350 --> 00:03:16,990 use that to discover groups of related users on Amazon. 46 00:03:16,990 --> 00:03:20,019 And that can be used for targeting products to those users. 47 00:03:22,385 --> 00:03:26,150 And finally we can think about structuring web search results. 48 00:03:26,150 --> 00:03:27,480 So, for example, 49 00:03:27,480 --> 00:03:31,760 search terms can have multiple meanings like the word "cardinal". 50 00:03:31,760 --> 00:03:38,440 If I type this in to Google, maybe I mean I want an article about a cardinal, 51 00:03:38,440 --> 00:03:45,060 the bird, maybe about the baseball team, or about a cardinal, a religious figure. 52 00:03:46,420 --> 00:03:51,450 So if we can structure out articles based on their content, 53 00:03:51,450 --> 00:03:54,820 using the same types of ideas we've talked about in this module, 54 00:03:54,820 --> 00:04:00,680 then I can improve my search results that I provide to people. 55 00:04:00,680 --> 00:04:03,080 And the list of applications goes on and on. 56 00:04:03,080 --> 00:04:06,629 Another one that's quite interesting is thinking about 57 00:04:07,880 --> 00:04:10,020 collections of neighborhoods and 58 00:04:10,020 --> 00:04:15,480 there are a few applications where you want to discover similar neighborhoods. 59 00:04:15,480 --> 00:04:20,020 One is if we wanna estimate the price of a house at a very small local 60 00:04:20,020 --> 00:04:21,730 regional level. 61 00:04:21,730 --> 00:04:26,390 So in this case, it challenges the fact that we only have a few, or 62 00:04:26,390 --> 00:04:30,820 very often, no house sale observations within a very small neighborhood. 63 00:04:31,860 --> 00:04:35,100 So if we wanna estimate the value of the house in that neighborhood 64 00:04:35,100 --> 00:04:37,790 at a point in time, it's very hard to do that 65 00:04:37,790 --> 00:04:42,540 because we have no other houses to base our estimate off of in that neighborhood. 66 00:04:42,540 --> 00:04:46,310 However, if we can discover other neighborhoods that have 67 00:04:47,360 --> 00:04:52,590 similar types of house dynamics, house price dynamics, then 68 00:04:53,630 --> 00:04:58,630 we can come up with a good estimate of the house in the neighborhood with few or 69 00:04:58,630 --> 00:05:03,290 no sales by leveraging information from this other neighborhood 70 00:05:03,290 --> 00:05:09,060 that was discovered to be related to the current neighborhood. 71 00:05:09,060 --> 00:05:13,480 So, the idea is to discover clusters of neighborhoods, and 72 00:05:13,480 --> 00:05:17,595 then within those clusters we can share information like these house sales 73 00:05:17,595 --> 00:05:20,030 informations to form better estimates. 74 00:05:21,280 --> 00:05:25,090 So, this is the solution that I'm describing here, is to cluster regions 75 00:05:25,090 --> 00:05:27,980 with similar trends, and then share information within a cluster. 76 00:05:29,380 --> 00:05:35,580 And the same idea of discovering related regions can be used for helping 77 00:05:35,580 --> 00:05:40,930 to forecast violent crimes, to better task police forces to different regions. 78 00:05:40,930 --> 00:05:45,665 So again once we discover different neighborhoods that have very 79 00:05:45,665 --> 00:05:50,660 similar crime dynamics, we can form better predictions of the rates 80 00:05:50,660 --> 00:05:53,933 of violent crimes in those neighborhoods and 81 00:05:53,933 --> 00:05:58,257 then use that information to task police to those regions. 82 00:05:58,257 --> 00:06:01,929 [MUSIC]