1 00:00:00,253 --> 00:00:03,883 [MUSIC] 2 00:00:03,883 --> 00:00:08,593 In the regression module, we talked about the relationship between error or 3 00:00:08,593 --> 00:00:11,140 accuracy in the complexity of the model. 4 00:00:12,150 --> 00:00:14,680 Let's talk a little bit about the relationship in terms of 5 00:00:14,680 --> 00:00:17,530 the amount of data you have to learn. 6 00:00:17,530 --> 00:00:20,750 And we'll explore the question of how much data we need to learn. 7 00:00:20,750 --> 00:00:24,380 And this is a really difficult and complex question in machine learning. 8 00:00:24,380 --> 00:00:28,146 So of course, the more data you have, the better, 9 00:00:28,146 --> 00:00:31,220 as long as the quality of the data is good. 10 00:00:31,220 --> 00:00:35,674 And then bad data, lots of bad data, is much worse to having much less, 11 00:00:35,674 --> 00:00:40,830 much fewer, data points with really good, clean, high-quality data points. 12 00:00:42,700 --> 00:00:47,560 Now there's some theoretical techniques to analyze how much data you need. 13 00:00:47,560 --> 00:00:52,360 Many of those help you understand kinda the overall trends, but 14 00:00:52,360 --> 00:00:55,800 really tend to be too loose to use in practice. 15 00:00:55,800 --> 00:01:00,330 In practice, there's some empirical techniques to really try to understand 16 00:01:00,330 --> 00:01:05,390 how much error we're making and what that kind of error looks like. 17 00:01:05,390 --> 00:01:09,400 And, in the follow-up courses, we're gonna explore those techniques much further, but 18 00:01:09,400 --> 00:01:12,000 let me give you a little bit of guidance and 19 00:01:12,000 --> 00:01:18,130 a little insight on what that can do within the classification side. 20 00:01:18,130 --> 00:01:22,297 Now an important representation for this relationship between data and 21 00:01:22,297 --> 00:01:25,040 quality is what's called the learning curve. 22 00:01:25,040 --> 00:01:29,237 So a learning curve relates the amount of data that we have for 23 00:01:29,237 --> 00:01:32,380 training with the error that we're making. 24 00:01:32,380 --> 00:01:36,300 And here we're talking about test error. 25 00:01:36,300 --> 00:01:40,211 Now if you have very little data for training, 26 00:01:40,211 --> 00:01:43,630 then your test error is going to be high. 27 00:01:43,630 --> 00:01:46,294 But if you have a lot of data for training, 28 00:01:46,294 --> 00:01:48,510 your test error is going to be low. 29 00:01:48,510 --> 00:01:53,319 And now, the curve is gonna get better and 30 00:01:53,319 --> 00:01:58,550 better as you get more and more and more data. 31 00:02:04,380 --> 00:02:07,560 Whoops, didn't go through that point, so I'm gonna just erase it. 32 00:02:09,180 --> 00:02:09,860 Now here we go. 33 00:02:09,860 --> 00:02:14,145 This is an example learning curve where the quality is getting better as we 34 00:02:14,145 --> 00:02:15,050 add more data. 35 00:02:15,050 --> 00:02:17,420 Now you may ask, is there a limit? 36 00:02:17,420 --> 00:02:22,720 Is this quality just going to get better and better forever as you add more data? 37 00:02:22,720 --> 00:02:29,675 Now we know that error's going to decrease as we add more data, the test error. 38 00:02:29,675 --> 00:02:33,295 However, there is some gap here. 39 00:02:33,295 --> 00:02:36,490 And the question is whether that gap can go to zero, and the answer's, 40 00:02:36,490 --> 00:02:37,295 in general, no. 41 00:02:37,295 --> 00:02:39,175 This gap is called the bias. 42 00:02:40,475 --> 00:02:45,215 So let's discuss a little bit what this bias, or this gap, is. 43 00:02:45,215 --> 00:02:47,949 So intuitively, 44 00:02:47,949 --> 00:02:53,052 it says even with infinite data, 45 00:02:53,052 --> 00:02:58,350 the test error will not go to zero. 46 00:02:59,880 --> 00:03:01,280 So let's understand why a little bit. 47 00:03:02,340 --> 00:03:06,640 More complex models tend to have less bias. 48 00:03:06,640 --> 00:03:10,890 So if you look at a sentiment analysis classifier that we may building, 49 00:03:10,890 --> 00:03:12,920 if you just use single words like awesome, 50 00:03:12,920 --> 00:03:18,032 good, great, terrible, awful, it can do okay. 51 00:03:18,032 --> 00:03:20,840 Maybe it do really well, maybe just does okay. 52 00:03:20,840 --> 00:03:24,929 But even if you have infinite data, even with all the data in the world, 53 00:03:24,929 --> 00:03:28,970 you're never gonna get this sentence right, the sushi was not good. 54 00:03:30,070 --> 00:03:32,010 Because you're not looking at pairs of words, 55 00:03:32,010 --> 00:03:34,940 you're just looking at the words good and not individually. 56 00:03:37,050 --> 00:03:42,880 And so more complex models that, for example, deal with combinations of words, 57 00:03:42,880 --> 00:03:44,820 for example, the simply called the bigram model, 58 00:03:44,820 --> 00:03:47,850 where you look at pairs of secret words like not good. 59 00:03:49,090 --> 00:03:53,760 Those models require more parameters, because there's more possibilities. 60 00:03:53,760 --> 00:03:58,803 They can do better, they may have a parameter for good, say 1.5. 61 00:03:58,803 --> 00:04:01,053 But not good, say -2.1. 62 00:04:01,053 --> 00:04:04,803 And actually get that sentence, the sushi was not good, correctly. 63 00:04:04,803 --> 00:04:06,330 So they have less bias. 64 00:04:07,430 --> 00:04:11,270 They can represent sentences that couldn't be represented as words, but 65 00:04:11,270 --> 00:04:13,820 so they're potentially more accurate. 66 00:04:13,820 --> 00:04:17,158 But they need more data to learn, because there's more parameters. 67 00:04:17,158 --> 00:04:20,150 There's not just a parameter for good, there's now a parameter for not good, and 68 00:04:20,150 --> 00:04:22,230 all possible combinations of words. 69 00:04:22,230 --> 00:04:27,160 And the more parameters your model has, in general, the more data you need to learn. 70 00:04:27,160 --> 00:04:28,390 So let's go back to our example. 71 00:04:29,390 --> 00:04:35,280 We talked about the fact of a amount of training data on the test error. 72 00:04:35,280 --> 00:04:39,470 So let's say that I'm building a classifier using single words. 73 00:04:39,470 --> 00:04:45,840 And the question is, how does that relate to a classifier, based on pairs of words? 74 00:04:45,840 --> 00:04:49,863 Now for a classifier based on bigrams, when you have less data, 75 00:04:49,863 --> 00:04:54,050 it's not going to do as well because it has more parameters to fit. 76 00:04:55,050 --> 00:04:59,700 But when you have more data, it's going to do better because it's going to be able to 77 00:04:59,700 --> 00:05:04,490 capture settings like, the sushi was not good. 78 00:05:04,490 --> 00:05:08,675 And so the behavior you're gonna get is something like this. 79 00:05:12,220 --> 00:05:16,898 And at some point, there's a crossover where it starts doing better than 80 00:05:16,898 --> 00:05:19,170 the classifier with single words. 81 00:05:20,370 --> 00:05:24,850 But notice the background model still has some bias here. 82 00:05:24,850 --> 00:05:28,870 Although the bias is smaller, it still has some bias. 83 00:05:32,509 --> 00:05:36,479 [MUSIC]