1 00:00:00,000 --> 00:00:04,555 [MUSIC] 2 00:00:04,555 --> 00:00:05,577 So how are we gonna do this? 3 00:00:05,577 --> 00:00:11,452 How are we gonna use all of our data as our validation set? 4 00:00:11,452 --> 00:00:14,690 We're gonna use something called K-fold cross validation. 5 00:00:14,690 --> 00:00:17,760 Where the first step, it's just a preprocessing step. 6 00:00:17,760 --> 00:00:22,040 Where we're gonna take our data, and divide it into K different blocks. 7 00:00:22,040 --> 00:00:28,800 And, we have N total observations, so, every block of data is gonna have N over 8 00:00:28,800 --> 00:00:33,420 K observations, and these observations are randomly assigned to each block. 9 00:00:34,510 --> 00:00:40,641 Okay, so, this is really key that we're taking our tabulated data. 10 00:00:40,641 --> 00:00:45,141 And in this image, even though it looks like it just might be parceling out 11 00:00:45,141 --> 00:00:50,080 a table of data, the data in each one of these blocks is randomly assigned. 12 00:00:50,080 --> 00:00:53,630 And for all steps of the algorithm that I'm gonna describe now, 13 00:00:53,630 --> 00:00:56,340 we're gonna use exactly the same data split. 14 00:00:56,340 --> 00:00:59,220 So, exactly the same assignments of 15 00:00:59,220 --> 00:01:01,740 observations to each one of these different blocks. 16 00:01:03,070 --> 00:01:06,290 Okay, then for each one of these K different blocks, 17 00:01:06,290 --> 00:01:10,830 we're gonna cycle through treating each block as the validation set. 18 00:01:10,830 --> 00:01:14,820 And using all the remaining observations to fit the model for 19 00:01:14,820 --> 00:01:15,810 every value of lambda. 20 00:01:15,810 --> 00:01:20,919 So in particular, we're gonna start by saying for a specific value of lambda and 21 00:01:20,919 --> 00:01:24,298 we're gonna do a procedure for each of the K blocks and 22 00:01:24,298 --> 00:01:28,068 at the end we're gonna cycle through all values of lambda. 23 00:01:28,068 --> 00:01:32,423 So for right now, assume that we're looking at a specific lambda value out of 24 00:01:32,423 --> 00:01:35,540 a set of possible values we might look at. 25 00:01:35,540 --> 00:01:39,080 And now we're gonna cycle through each one of our blocks where 26 00:01:39,080 --> 00:01:44,532 at the first iteration we're gonna fit our model using all the remaining data. 27 00:01:44,532 --> 00:01:48,910 That's gonna produce something that I'm calling w hat lambda, so 28 00:01:48,910 --> 00:01:50,720 indexed by this lambda that we're looking at. 29 00:01:52,220 --> 00:01:56,480 So we're considering the first block as our validation set. 30 00:01:56,480 --> 00:01:58,334 Then we're gonna take that fitted model and 31 00:01:58,334 --> 00:02:01,102 we're gonna assess it's performance on this validation site. 32 00:02:01,102 --> 00:02:07,470 That's gonna result in some error which I'm calling error sub one. 33 00:02:07,470 --> 00:02:12,580 Meaning the error on the first block of data for this value of lambda. 34 00:02:12,580 --> 00:02:17,240 Okay, so I'm gonna keep track, of the error for the value of lambda for 35 00:02:17,240 --> 00:02:20,790 each block, and then I'm gonna do this for every value of lambda. 36 00:02:20,790 --> 00:02:23,877 Okay, so I'm gonna move on to the next block, 37 00:02:23,877 --> 00:02:28,783 treat that as my validation set, fit the model on all the remaining data, 38 00:02:28,783 --> 00:02:33,317 compute the error of that fitted model on that second block of data. 39 00:02:33,317 --> 00:02:37,364 Do this on a third block, fit data on all the remaining data, assess 40 00:02:37,364 --> 00:02:42,990 the performance on the third block, and cycle through each of my blocks like this. 41 00:02:42,990 --> 00:02:45,700 And at the end, I've tabulated my error 42 00:02:45,700 --> 00:02:49,230 across each of these K different blocks for this value of lambda. 43 00:02:50,350 --> 00:02:55,280 And what I'm gonna do is I'm gonna compute what's called the cross validation error 44 00:02:55,280 --> 00:02:56,350 of lambda, 45 00:02:56,350 --> 00:03:00,940 which is simply an average of the air that I had on each of the K different blocks. 46 00:03:00,940 --> 00:03:06,710 So now I explicitly see how my measure of air, my summary of air for 47 00:03:06,710 --> 00:03:10,800 the specific value lambda uses all of the data. 48 00:03:10,800 --> 00:03:15,160 it's an average across the validation sets in each of the different blocks. 49 00:03:17,110 --> 00:03:24,190 Then, I'm gonna repeat this procedure for every value that I'm considering of lambda 50 00:03:24,190 --> 00:03:28,550 and I'm gonna choose the lambda that minimizes this cross validation error. 51 00:03:28,550 --> 00:03:33,282 So I had to divide my data into K different blocks in order to run 52 00:03:33,282 --> 00:03:36,475 this K full cross validation algorithm. 53 00:03:36,475 --> 00:03:40,070 So a natural question is what value of K should I use? 54 00:03:40,070 --> 00:03:44,860 Well you can show that the best approximation to the generalization error 55 00:03:44,860 --> 00:03:49,850 of the model is given when you take K to be equal to N. 56 00:03:49,850 --> 00:03:55,740 And what that means is that every block has just one observation. 57 00:03:55,740 --> 00:03:58,820 So this is called leave-one-out cross validation. 58 00:04:00,590 --> 00:04:06,080 So although it has the best approximation of what you're trying to estimate, 59 00:04:06,080 --> 00:04:09,510 it tends to be very computationally intensive, 60 00:04:09,510 --> 00:04:13,570 because what do we have to do for every value of lambda? 61 00:04:13,570 --> 00:04:17,520 We have to do N fits of our model. 62 00:04:17,520 --> 00:04:19,590 And if N is even reasonably large, and 63 00:04:19,590 --> 00:04:24,760 if it's complicated to fit our model each time, that can be quite intensive. 64 00:04:26,730 --> 00:04:30,748 So, instead what people tend to do is use K = 5 or 10, 65 00:04:30,748 --> 00:04:34,953 this is called 5-fold or 10-fold cross validation. 66 00:04:36,640 --> 00:04:40,340 Okay, so this summarizes our cross validation algorithm, which is a really, 67 00:04:40,340 --> 00:04:43,550 really important algorithm for choosing two name parameters. 68 00:04:45,180 --> 00:04:49,160 And even though we discussed this option of forming a training validation and 69 00:04:49,160 --> 00:04:53,000 test set, typically you're in a situation where you don't have enough data 70 00:04:54,010 --> 00:04:56,330 to form each one of those. 71 00:04:56,330 --> 00:04:59,340 Or at least you don't know if you have enough data to have an accurate 72 00:04:59,340 --> 00:05:03,920 approximation of generalization error as well as assessing the difference between 73 00:05:03,920 --> 00:05:08,550 different models, so typically what people do is cross validation. 74 00:05:08,550 --> 00:05:13,265 They hold out some test set and then they do either leave one out, 5-fold, 75 00:05:13,265 --> 00:05:17,260 10-fold cross validation to choose their tuning parameter lambda. 76 00:05:17,260 --> 00:05:21,731 And this is a really critical step in the machine learning workflow is 77 00:05:21,731 --> 00:05:26,586 choosing these tuning parameters in order to select a model and use that for 78 00:05:26,586 --> 00:05:30,690 the predictions or various tasks that you're interested in. 79 00:05:30,690 --> 00:05:35,019 [MUSIC]