1 00:00:00,090 --> 00:00:02,320 In this video, I'd like to start to talk about clustering. 2 00:00:03,420 --> 00:00:04,850 This will be exciting because this 3 00:00:04,930 --> 00:00:06,910 is our first unsupervised learning algorithm 4 00:00:07,350 --> 00:00:09,080 where we learn from unlabeled data 5 00:00:09,840 --> 00:00:10,730 instead of the label data. 6 00:00:11,900 --> 00:00:13,300 So, what is unsupervised learning? 7 00:00:14,390 --> 00:00:15,630 I briefly talked about unsupervised 8 00:00:16,350 --> 00:00:17,470 learning at the beginning of 9 00:00:17,550 --> 00:00:18,560 the class but it's useful 10 00:00:19,030 --> 00:00:20,320 to contrast it with supervised learning. So, here's 11 00:00:21,760 --> 00:00:23,750 a typical supervisory problem where 12 00:00:24,030 --> 00:00:25,470 we are given a label training sets 13 00:00:25,770 --> 00:00:27,470 and the goal is to find 14 00:00:27,980 --> 00:00:29,420 the decision boundary that separates the 15 00:00:29,530 --> 00:00:31,310 positive label examples and the negative label examples. 16 00:00:33,100 --> 00:00:34,400 The supervised learning problem in 17 00:00:34,460 --> 00:00:35,710 this case is given a 18 00:00:35,850 --> 00:00:38,270 set of labels to fit a hypothesis to it. 19 00:00:39,160 --> 00:00:40,560 In contrast, in the unsupervised 20 00:00:41,080 --> 00:00:42,420 learning problem, we're given 21 00:00:42,710 --> 00:00:43,740 data that does not have 22 00:00:43,890 --> 00:00:45,270 any labels associated with it. 23 00:00:46,730 --> 00:00:47,940 So we're given data that looks 24 00:00:48,100 --> 00:00:49,090 like this, here's a set 25 00:00:49,180 --> 00:00:50,470 of points and then no labels. 26 00:00:51,800 --> 00:00:52,860 And so our training set is written 27 00:00:53,220 --> 00:00:54,720 just x1, x2 and 28 00:00:55,210 --> 00:00:56,890 so on up to x(m) 29 00:00:57,450 --> 00:00:58,720 and we don't get any labels y. 30 00:00:59,540 --> 00:01:00,800 And that's why the points plotted 31 00:01:01,160 --> 00:01:02,300 up on the figure don't have 32 00:01:02,430 --> 00:01:04,330 any labels on them. 33 00:01:04,490 --> 00:01:05,510 So in unsupervised learning, what 34 00:01:05,710 --> 00:01:06,860 we do is, we give this sort of 35 00:01:07,280 --> 00:01:09,150 unlabeled training set to 36 00:01:09,250 --> 00:01:10,510 an algorithm and we just 37 00:01:10,600 --> 00:01:12,220 ask the algorithm: find some 38 00:01:12,430 --> 00:01:14,130 structure in the data for us. 39 00:01:15,420 --> 00:01:16,490 Given this data set, one 40 00:01:16,650 --> 00:01:17,810 type of structure we might 41 00:01:18,010 --> 00:01:19,540 have an algorithm find, is that 42 00:01:19,810 --> 00:01:21,440 it looks like this data set has 43 00:01:21,620 --> 00:01:23,740 points grouped into two 44 00:01:24,030 --> 00:01:25,500 separate clusters and so 45 00:01:25,800 --> 00:01:28,230 an algorithm that finds that 46 00:01:28,360 --> 00:01:29,230 clusters like the ones I just 47 00:01:29,450 --> 00:01:30,610 circled, is called a clustering 48 00:01:32,440 --> 00:01:32,440 algorithm. 49 00:01:33,160 --> 00:01:34,620 And this will be our first type of 50 00:01:34,720 --> 00:01:36,590 unsupervised learning, although there 51 00:01:36,790 --> 00:01:38,320 will be other types of unsupervised 52 00:01:39,020 --> 00:01:40,200 learning algorithms that we'll talk 53 00:01:40,320 --> 00:01:41,710 about later that finds other 54 00:01:42,130 --> 00:01:43,710 types of structure or other 55 00:01:43,920 --> 00:01:46,000 types of patterns in the data other than clusters. 56 00:01:46,900 --> 00:01:48,360 We'll talk about this afterwards, we will talk about clustering. 57 00:01:50,020 --> 00:01:51,210 So what is clustering good for? 58 00:01:51,380 --> 00:01:54,350 Early in this class I had already mentioned a few applications. 59 00:01:54,950 --> 00:01:56,540 One is market segmentation, where 60 00:01:56,670 --> 00:01:57,690 you may have a database of 61 00:01:57,770 --> 00:01:58,840 customers and want to group 62 00:01:59,070 --> 00:02:00,380 them into different market segments. 63 00:02:00,950 --> 00:02:02,590 So, you can sell to 64 00:02:02,720 --> 00:02:05,570 them separately or serve your different market segments better. 65 00:02:06,730 --> 00:02:08,370 Social network analysis, there are 66 00:02:08,580 --> 00:02:10,090 actually, you know, groups that have done 67 00:02:10,320 --> 00:02:12,590 this, things like looking at a 68 00:02:12,730 --> 00:02:14,540 group of people, social networks, 69 00:02:15,070 --> 00:02:16,390 so things like Facebook, Google plus 70 00:02:16,710 --> 00:02:18,260 or maybe information about who 71 00:02:18,430 --> 00:02:19,710 are the people that you 72 00:02:20,030 --> 00:02:21,110 email the most frequently and who 73 00:02:21,230 --> 00:02:22,170 are the people that they email 74 00:02:22,310 --> 00:02:23,600 the most frequently, and 75 00:02:23,750 --> 00:02:25,400 to find coherent groups of people. 76 00:02:26,500 --> 00:02:27,600 So, this would be another maybe 77 00:02:28,180 --> 00:02:28,850 clustering algorithm where, you know, you'd want 78 00:02:29,080 --> 00:02:32,200 to find who other coherent groups of friends in a social network. 79 00:02:33,140 --> 00:02:33,990 Here's something that one of my 80 00:02:34,140 --> 00:02:35,170 friends actually worked on, which is, 81 00:02:35,920 --> 00:02:37,200 use clustering to organize compute 82 00:02:37,670 --> 00:02:39,220 clusters or to organize data 83 00:02:39,440 --> 00:02:40,600 centers better because, if you 84 00:02:40,800 --> 00:02:42,450 know which computers in the 85 00:02:42,520 --> 00:02:44,990 data center are in the cluster there tend to work together. 86 00:02:45,400 --> 00:02:46,270 You can use that to reorganize 87 00:02:46,950 --> 00:02:48,390 your resources and how you 88 00:02:48,570 --> 00:02:50,120 lay out its network and 89 00:02:50,260 --> 00:02:52,040 how design your data center and communications. 90 00:02:53,140 --> 00:02:54,540 And lastly something that, actually 91 00:02:54,850 --> 00:02:55,910 another thing I worked on, using 92 00:02:56,130 --> 00:02:57,810 clustering algorithms to understand 93 00:02:58,400 --> 00:03:00,030 galaxy formation and using 94 00:03:00,280 --> 00:03:02,260 that to understand how, to 95 00:03:02,600 --> 00:03:03,860 understand astronomical detail. 96 00:03:06,550 --> 00:03:08,580 So, that's clustering which 97 00:03:08,890 --> 00:03:10,450 is our first example of 98 00:03:10,530 --> 00:03:12,650 an unsupervised learning algorithm. 99 00:03:13,090 --> 00:03:14,200 In the next video, we'll start to 100 00:03:14,370 --> 00:03:16,250 talk about a specific clustering algorithm.