1 00:00:00,178 --> 00:00:04,452 [MUSIC] 2 00:00:04,452 --> 00:00:08,750 So this idea of performing unsupervised learning sounds really crazy. 3 00:00:08,750 --> 00:00:14,620 So we're given no labels ever and somehow we're supposed to output labels. 4 00:00:14,620 --> 00:00:19,000 Well, there are two things that allow us to sometimes do this. 5 00:00:19,000 --> 00:00:23,510 One is by the definition of what a cluster is. 6 00:00:23,510 --> 00:00:27,120 So what is the structure that we're trying to extract from the data? 7 00:00:28,830 --> 00:00:33,320 So, for example, maybe we're looking for these elliptical clusters. 8 00:00:33,320 --> 00:00:36,340 And the second thing that allows us to 9 00:00:36,340 --> 00:00:39,650 sometimes do this is just the structure of the data itself. 10 00:00:39,650 --> 00:00:43,480 So in some cases, the problem is fairly easy. 11 00:00:43,480 --> 00:00:47,407 So here's an example of an easy problem where the data, 12 00:00:47,407 --> 00:00:49,840 they're pretty well separated. 13 00:00:49,840 --> 00:00:54,150 I could go in there and kind of say okay, I'm going to call this one cluster, 14 00:00:54,150 --> 00:00:56,010 this another one, and this another one. 15 00:00:56,010 --> 00:01:00,985 And so you could imagine an algorithm that could uncover this structure from 16 00:01:00,985 --> 00:01:01,691 the data. 17 00:01:03,846 --> 00:01:09,985 But in other cases, it's really basically pretty impossible. 18 00:01:09,985 --> 00:01:13,510 Here, I really don't know how I would want to cluster this 19 00:01:13,510 --> 00:01:15,660 data if I had to choose three different clusters. 20 00:01:15,660 --> 00:01:19,432 Maybe I would do something like this if I had to, 21 00:01:19,432 --> 00:01:23,460 question mark, turns out that would be very wrong. 22 00:01:23,460 --> 00:01:26,209 In this case if we look at the true labels, so 23 00:01:26,209 --> 00:01:30,090 these are the, colors are the data points here. 24 00:01:30,090 --> 00:01:34,200 They're all mixed together and it's just really not clear 25 00:01:34,200 --> 00:01:38,720 how we could extract that from this data without some other information 26 00:01:38,720 --> 00:01:41,970 informing what distinguishes the blue cluster, the green, and the pink. 27 00:01:43,130 --> 00:01:46,942 But then there are cases where it's a little bit more plausible. 28 00:01:46,942 --> 00:01:52,620 So maybe my guess would have been fairly reasonable, something like this. 29 00:01:52,620 --> 00:01:57,710 And indeed that's what the data actually tells us. 30 00:01:57,710 --> 00:02:02,670 So often in most applications that we're interested in, we're in this in between 31 00:02:02,670 --> 00:02:07,230 scenario where it's not immediately clear just by visualizing the data. 32 00:02:07,230 --> 00:02:12,530 But we can't hope to successfully perform unsupervised 33 00:02:12,530 --> 00:02:16,040 learning if we're in scenarios that look like this impossible case. 34 00:02:16,040 --> 00:02:22,280 But regardless of the data structure, the thing that's really, really 35 00:02:22,280 --> 00:02:27,850 crucial to what the performance is, it's how we think about defining a cluster. 36 00:02:29,100 --> 00:02:31,880 So for example, here are six different, 37 00:02:31,880 --> 00:02:35,710 very popular toy examples of challenging clustering problems. 38 00:02:35,710 --> 00:02:37,900 Where visually it just jumps out at us, 39 00:02:37,900 --> 00:02:42,560 especially since in this case things were colored, but even if they weren't colored. 40 00:02:42,560 --> 00:02:47,320 You would probably say okay, you could draw out what are the observations you 41 00:02:47,320 --> 00:02:51,320 would say that went together in one group versus another group. 42 00:02:53,120 --> 00:02:57,281 But these clusters sure don't look like simple ellipses like we saw 43 00:02:57,281 --> 00:03:00,000 in the previous few slides. 44 00:03:00,000 --> 00:03:04,010 And if you provide this data to an algorithm that does things like just 45 00:03:04,010 --> 00:03:08,910 measuring distance from observations to cluster centers, then you get out 46 00:03:08,910 --> 00:03:14,080 results that don't match the kind of questions that jumped at you visually. 47 00:03:14,080 --> 00:03:18,090 So for each one of these different examples 48 00:03:18,090 --> 00:03:22,790 we're showing what a clustering might look like under just distance to your 49 00:03:22,790 --> 00:03:26,500 nearest cluster center type of objective. 50 00:03:26,500 --> 00:03:30,760 So the point of these slides here is that clustering can be really challenging. 51 00:03:30,760 --> 00:03:35,400 Even when the data look really clean, and nice, and well separated, 52 00:03:35,400 --> 00:03:39,925 sometimes it can be hard to define what it means to be a cluster and 53 00:03:39,925 --> 00:03:42,370 to devise algorithms to discover that. 54 00:03:43,740 --> 00:03:46,640 So you have to think very carefully about 55 00:03:46,640 --> 00:03:49,390 what are the implications of the data that you're providing, and 56 00:03:49,390 --> 00:03:52,890 the model that you're specifying, or the algorithm that you're specifying. 57 00:03:52,890 --> 00:03:56,725 So as we go through the next set of algorithms in this module, and 58 00:03:56,725 --> 00:04:01,625 the coming modules, it's just useful to think about the challenges that we might 59 00:04:01,625 --> 00:04:04,554 be faced with based on the structure of our data. 60 00:04:04,554 --> 00:04:08,869 [MUSIC]