1
00:00:00,000 --> 00:00:04,555
[MUSIC]

2
00:00:04,555 --> 00:00:05,577
So how are we gonna do this?

3
00:00:05,577 --> 00:00:11,452
How are we gonna use all of our
data as our validation set?

4
00:00:11,452 --> 00:00:14,690
We're gonna use something
called K-fold cross validation.

5
00:00:14,690 --> 00:00:17,760
Where the first step,
it's just a preprocessing step.

6
00:00:17,760 --> 00:00:22,040
Where we're gonna take our data, and
divide it into K different blocks.

7
00:00:22,040 --> 00:00:28,800
And, we have N total observations, so,
every block of data is gonna have N over

8
00:00:28,800 --> 00:00:33,420
K observations, and these observations
are randomly assigned to each block.

9
00:00:34,510 --> 00:00:40,641
Okay, so, this is really key that
we're taking our tabulated data.

10
00:00:40,641 --> 00:00:45,141
And in this image, even though it looks
like it just might be parceling out

11
00:00:45,141 --> 00:00:50,080
a table of data, the data in each one
of these blocks is randomly assigned.

12
00:00:50,080 --> 00:00:53,630
And for all steps of the algorithm
that I'm gonna describe now,

13
00:00:53,630 --> 00:00:56,340
we're gonna use exactly
the same data split.

14
00:00:56,340 --> 00:00:59,220
So, exactly the same assignments of

15
00:00:59,220 --> 00:01:01,740
observations to each one
of these different blocks.

16
00:01:03,070 --> 00:01:06,290
Okay, then for
each one of these K different blocks,

17
00:01:06,290 --> 00:01:10,830
we're gonna cycle through treating
each block as the validation set.

18
00:01:10,830 --> 00:01:14,820
And using all the remaining
observations to fit the model for

19
00:01:14,820 --> 00:01:15,810
every value of lambda.

20
00:01:15,810 --> 00:01:20,919
So in particular, we're gonna start by
saying for a specific value of lambda and

21
00:01:20,919 --> 00:01:24,298
we're gonna do a procedure for
each of the K blocks and

22
00:01:24,298 --> 00:01:28,068
at the end we're gonna cycle
through all values of lambda.

23
00:01:28,068 --> 00:01:32,423
So for right now, assume that we're
looking at a specific lambda value out of

24
00:01:32,423 --> 00:01:35,540
a set of possible values we might look at.

25
00:01:35,540 --> 00:01:39,080
And now we're gonna cycle through
each one of our blocks where

26
00:01:39,080 --> 00:01:44,532
at the first iteration we're gonna fit
our model using all the remaining data.

27
00:01:44,532 --> 00:01:48,910
That's gonna produce something
that I'm calling w hat lambda, so

28
00:01:48,910 --> 00:01:50,720
indexed by this lambda
that we're looking at.

29
00:01:52,220 --> 00:01:56,480
So we're considering the first
block as our validation set.

30
00:01:56,480 --> 00:01:58,334
Then we're gonna take
that fitted model and

31
00:01:58,334 --> 00:02:01,102
we're gonna assess it's performance
on this validation site.

32
00:02:01,102 --> 00:02:07,470
That's gonna result in some error
which I'm calling error sub one.

33
00:02:07,470 --> 00:02:12,580
Meaning the error on the first block
of data for this value of lambda.

34
00:02:12,580 --> 00:02:17,240
Okay, so I'm gonna keep track,
of the error for the value of lambda for

35
00:02:17,240 --> 00:02:20,790
each block, and then I'm gonna do this for
every value of lambda.

36
00:02:20,790 --> 00:02:23,877
Okay, so
I'm gonna move on to the next block,

37
00:02:23,877 --> 00:02:28,783
treat that as my validation set,
fit the model on all the remaining data,

38
00:02:28,783 --> 00:02:33,317
compute the error of that fitted
model on that second block of data.

39
00:02:33,317 --> 00:02:37,364
Do this on a third block,
fit data on all the remaining data, assess

40
00:02:37,364 --> 00:02:42,990
the performance on the third block, and
cycle through each of my blocks like this.

41
00:02:42,990 --> 00:02:45,700
And at the end, I've tabulated my error

42
00:02:45,700 --> 00:02:49,230
across each of these K different
blocks for this value of lambda.

43
00:02:50,350 --> 00:02:55,280
And what I'm gonna do is I'm gonna compute
what's called the cross validation error

44
00:02:55,280 --> 00:02:56,350
of lambda,

45
00:02:56,350 --> 00:03:00,940
which is simply an average of the air that
I had on each of the K different blocks.

46
00:03:00,940 --> 00:03:06,710
So now I explicitly see how my
measure of air, my summary of air for

47
00:03:06,710 --> 00:03:10,800
the specific value lambda
uses all of the data.

48
00:03:10,800 --> 00:03:15,160
it's an average across the validation
sets in each of the different blocks.

49
00:03:17,110 --> 00:03:24,190
Then, I'm gonna repeat this procedure for
every value that I'm considering of lambda

50
00:03:24,190 --> 00:03:28,550
and I'm gonna choose the lambda that
minimizes this cross validation error.

51
00:03:28,550 --> 00:03:33,282
So I had to divide my data into K
different blocks in order to run

52
00:03:33,282 --> 00:03:36,475
this K full cross validation algorithm.

53
00:03:36,475 --> 00:03:40,070
So a natural question is what
value of K should I use?

54
00:03:40,070 --> 00:03:44,860
Well you can show that the best
approximation to the generalization error

55
00:03:44,860 --> 00:03:49,850
of the model is given when
you take K to be equal to N.

56
00:03:49,850 --> 00:03:55,740
And what that means is that every
block has just one observation.

57
00:03:55,740 --> 00:03:58,820
So this is called leave-one-out
cross validation.

58
00:04:00,590 --> 00:04:06,080
So although it has the best approximation
of what you're trying to estimate,

59
00:04:06,080 --> 00:04:09,510
it tends to be very
computationally intensive,

60
00:04:09,510 --> 00:04:13,570
because what do we have to do for
every value of lambda?

61
00:04:13,570 --> 00:04:17,520
We have to do N fits of our model.

62
00:04:17,520 --> 00:04:19,590
And if N is even reasonably large, and

63
00:04:19,590 --> 00:04:24,760
if it's complicated to fit our model
each time, that can be quite intensive.

64
00:04:26,730 --> 00:04:30,748
So, instead what people tend
to do is use K = 5 or 10,

65
00:04:30,748 --> 00:04:34,953
this is called 5-fold or
10-fold cross validation.

66
00:04:36,640 --> 00:04:40,340
Okay, so this summarizes our cross
validation algorithm, which is a really,

67
00:04:40,340 --> 00:04:43,550
really important algorithm for
choosing two name parameters.

68
00:04:45,180 --> 00:04:49,160
And even though we discussed this option
of forming a training validation and

69
00:04:49,160 --> 00:04:53,000
test set, typically you're in a situation
where you don't have enough data

70
00:04:54,010 --> 00:04:56,330
to form each one of those.

71
00:04:56,330 --> 00:04:59,340
Or at least you don't know if you
have enough data to have an accurate

72
00:04:59,340 --> 00:05:03,920
approximation of generalization error as
well as assessing the difference between

73
00:05:03,920 --> 00:05:08,550
different models, so typically what
people do is cross validation.

74
00:05:08,550 --> 00:05:13,265
They hold out some test set and
then they do either leave one out, 5-fold,

75
00:05:13,265 --> 00:05:17,260
10-fold cross validation to choose
their tuning parameter lambda.

76
00:05:17,260 --> 00:05:21,731
And this is a really critical step
in the machine learning workflow is

77
00:05:21,731 --> 00:05:26,586
choosing these tuning parameters in
order to select a model and use that for

78
00:05:26,586 --> 00:05:30,690
the predictions or
various tasks that you're interested in.

79
00:05:30,690 --> 00:05:35,019
[MUSIC]