1 00:00:00,000 --> 00:00:04,641 [MUSIC] 2 00:00:04,641 --> 00:00:08,675 From here we're going to explore one of the most fundamental things 3 00:00:08,675 --> 00:00:13,000 that happens in practice with data, the issue of missing data. 4 00:00:13,000 --> 00:00:15,660 So here I have a dataset, but unlike what we've done so 5 00:00:15,660 --> 00:00:21,030 far in this course in specialization, we assume that all data is observed. 6 00:00:21,030 --> 00:00:24,400 But here if you look at the second row, you have a question mark for that person. 7 00:00:24,400 --> 00:00:28,710 You don't know if the loan they took was a three-year loan or a five-year loan. 8 00:00:28,710 --> 00:00:30,830 So what do we do when we have missing data? 9 00:00:30,830 --> 00:00:34,640 We're going to talk about many ways to address that are extremely practical 10 00:00:34,640 --> 00:00:37,720 including a modification of decision trees 11 00:00:37,720 --> 00:00:40,960 where the tree can be learned to take into account missing data. 12 00:00:40,960 --> 00:00:44,160 Make decisions that depend not only on observed values 13 00:00:44,160 --> 00:00:46,680 like your credit was excellent, fair, or poor. 14 00:00:46,680 --> 00:00:48,541 But what happens if your credit is unobserved? 15 00:00:48,541 --> 00:00:49,890 What should you do? 16 00:00:49,890 --> 00:00:53,080 And those techniques are going to be extremely useful in practice and 17 00:00:53,080 --> 00:00:54,270 widely applicable. 18 00:00:56,120 --> 00:00:57,846 The seventh module's going to be amazing. 19 00:00:57,846 --> 00:01:04,072 We're going to look at a question that was asked by Kearns and Valiant in 1988. 20 00:01:04,072 --> 00:01:06,846 In fact, Valiant is a Turing Award winner. 21 00:01:06,846 --> 00:01:08,140 So this is a fundamental question. 22 00:01:08,140 --> 00:01:11,780 The question was, can you combine simple classifiers 23 00:01:11,780 --> 00:01:17,100 in a way that gives you performance of a really complex classifier? 24 00:01:17,100 --> 00:01:20,400 In that question, which is pre-theoretical, was answered a couple of 25 00:01:20,400 --> 00:01:25,780 years later by Schapire in the positive, using something called boosting, 26 00:01:25,780 --> 00:01:30,710 which is amazing algorithm that has had an incredible impact in practice. 27 00:01:30,710 --> 00:01:33,420 In fact, if you know, for example, what a Kaggle competition is, 28 00:01:33,420 --> 00:01:35,940 which is one of those online machine learning competitions, 29 00:01:35,940 --> 00:01:40,300 more than half of the winners use boosting in their solution. 30 00:01:40,300 --> 00:01:44,580 Boosting is a simple technique that has really changed the world, and 31 00:01:44,580 --> 00:01:47,170 we're going to learn the fundamentals of the technique and 32 00:01:47,170 --> 00:01:49,140 you're going to be able to implement yourself. 33 00:01:49,140 --> 00:01:52,740 We're going to talk about one kind of boosting algorithm called AdaBoost 34 00:01:52,740 --> 00:01:55,140 where you can take the input of a classifier. 35 00:01:55,140 --> 00:02:00,660 For example, this decision tree here might say that a loan is likely to be okay, 36 00:02:00,660 --> 00:02:01,820 safe +1. 37 00:02:01,820 --> 00:02:06,085 But you might have another one that says no, it's risky, -1. 38 00:02:06,085 --> 00:02:08,825 And you have others that might say +1 or -1, and 39 00:02:08,825 --> 00:02:12,540 you have the vote of many classifiers in what's called an ensemble. 40 00:02:12,540 --> 00:02:16,064 So boosting is about building these many ensembles where many classifiers vote. 41 00:02:17,370 --> 00:02:20,660 And we do just want to learn the combinations of this vote 42 00:02:20,660 --> 00:02:23,010 to get the best possible prediction. 43 00:02:23,010 --> 00:02:26,980 And by learning those combinations using a boosting algorithm, we're going to be able 44 00:02:26,980 --> 00:02:32,100 to start very simple classifiers but come up with very complex decision boundaries. 45 00:02:32,100 --> 00:02:35,830 And this is exactly the technique that wins most of those Kaggle competitions. 46 00:02:37,540 --> 00:02:39,800 In the eighth module we're going to step back and 47 00:02:39,800 --> 00:02:41,990 look at fundamental concepts in machine learning. 48 00:02:41,990 --> 00:02:43,428 One is called precision and recall. 49 00:02:43,428 --> 00:02:46,240 Let me tell you about an example of that. 50 00:02:46,240 --> 00:02:50,290 Say I own a restaurant and I want to increase the number of guests that I have, 51 00:02:50,290 --> 00:02:52,330 the number of customers, by 30%. 52 00:02:52,330 --> 00:02:53,307 How do I do that? 53 00:02:53,307 --> 00:02:55,061 Well, I'm going to start a marketing campaign but 54 00:02:55,061 --> 00:02:56,702 I don't want to be just like everybody else. 55 00:02:56,702 --> 00:02:59,950 I want to be kind of an authentic nice marketing campaign. 56 00:02:59,950 --> 00:03:04,350 So when I use the reviews of my restaurant and try to find great things to say, 57 00:03:04,350 --> 00:03:07,520 and every time somebody enters a review on our website, 58 00:03:07,520 --> 00:03:12,470 I get some sentence shown in my website saying how great my restaurant is. 59 00:03:12,470 --> 00:03:16,760 So given the reviews, I want to make a prediction as to what sentence were most 60 00:03:16,760 --> 00:03:22,450 positive, like easily the best sushi in Seattle, and also who are great 61 00:03:22,450 --> 00:03:27,320 people that we should showcase that say great things about the restaurant. 62 00:03:27,320 --> 00:03:33,100 When you have a setting like this, accuracy is not a good metric. 63 00:03:33,100 --> 00:03:36,320 What you really care about is what's called precision recall. 64 00:03:36,320 --> 00:03:39,659 Precision is going to say, if I pick out a few sentences from the reviews and 65 00:03:39,659 --> 00:03:40,772 show them on my website, 66 00:03:40,772 --> 00:03:43,800 how likely is it that I'm going to show a really negative sentence? 67 00:03:43,800 --> 00:03:46,627 Because if I show a bad sentence like the sushi was terrible, 68 00:03:46,627 --> 00:03:48,860 then that's really bad for my website. 69 00:03:48,860 --> 00:03:52,420 So precision makes sure that I show only positive sentences. 70 00:03:52,420 --> 00:03:54,030 And then when we talk about recall, 71 00:03:54,030 --> 00:03:57,990 which is about finding all the great positive things that people are saying. 72 00:03:57,990 --> 00:04:01,880 So if a classifier has good precision and recall, it means that I find 73 00:04:01,880 --> 00:04:05,948 all the great sentences, and I only show great sentences about my restaurant. 74 00:04:05,948 --> 00:04:08,080 And we're going to talk about that in quite a lot of detail, 75 00:04:08,080 --> 00:04:10,560 because precision-recall is what you will use 76 00:04:10,560 --> 00:04:13,580 most likely if you build a classifying practice. 77 00:04:13,580 --> 00:04:17,370 It's what basically every company builds classifiers uses as its core metric. 78 00:04:19,050 --> 00:04:22,710 And in the final module we're going to address initial scalability. 79 00:04:22,710 --> 00:04:26,310 How do we scale to really massive datasets? 80 00:04:26,310 --> 00:04:31,290 And so as you can see the number of web pages on the web is growing tremendously. 81 00:04:31,290 --> 00:04:34,080 There is about 4.8 billion pages today. 82 00:04:34,080 --> 00:04:36,880 There is about 500 million tweets per day. 83 00:04:36,880 --> 00:04:39,190 Hey, follow me on Twitter, by the way. 84 00:04:39,190 --> 00:04:42,537 I send them one, maybe less. 85 00:04:42,537 --> 00:04:46,768 And if you think about YouTube, there's 5 billion page views on YouTube, 86 00:04:46,768 --> 00:04:48,342 video views every day. 87 00:04:48,342 --> 00:04:53,320 So there's tons of data out there and gradient type 88 00:04:53,320 --> 00:04:58,080 methods don't tend to scale very well when you have massive amounts of data. 89 00:04:58,080 --> 00:05:02,100 And so what we're going to show is a technique called stochastic gradient, 90 00:05:02,100 --> 00:05:06,128 which converges much faster than gradient to the solution. 91 00:05:06,128 --> 00:05:09,020 It's just a very small modification to gradient 92 00:05:09,020 --> 00:05:10,710 which gives you amazing performance. 93 00:05:10,710 --> 00:05:13,290 So in this simple example from sentiment analysis, 94 00:05:13,290 --> 00:05:18,170 we see over 100 times faster performance in the same dataset. 95 00:05:18,170 --> 00:05:21,934 However, stochastic gradient is a extremely finicky technique to get to 96 00:05:21,934 --> 00:05:22,612 work right. 97 00:05:22,612 --> 00:05:25,495 There's many practical problems that you need to address to make it work. 98 00:05:25,495 --> 00:05:29,618 And so we're going to talk about the technique and explain why it works, but 99 00:05:29,618 --> 00:05:32,878 also explain those practical issues that you must address 100 00:05:32,878 --> 00:05:34,620 in order to get it to work well. 101 00:05:35,680 --> 00:05:39,352 So as you can see, it's going to be an action-packed course that's going to cover 102 00:05:39,352 --> 00:05:41,423 a wide range of topics in machine learning. 103 00:05:41,423 --> 00:05:46,769 [MUSIC]