1 00:00:00,000 --> 00:00:04,351 [MUSIC] 2 00:00:04,351 --> 00:00:08,440 But instead, let's think about the update just to a single feature. 3 00:00:08,440 --> 00:00:10,190 And think about what that looks like, and 4 00:00:10,190 --> 00:00:13,500 give some intuition for why the update has this form. 5 00:00:13,500 --> 00:00:17,240 So, let's go through and derive what the update is just for a single feature. 6 00:00:17,240 --> 00:00:20,030 Of course, we could go through this matrix notation. 7 00:00:20,030 --> 00:00:25,200 And figure out what is the update for that J row. 8 00:00:25,200 --> 00:00:27,200 But, let's go through. 9 00:00:27,200 --> 00:00:31,400 It's a little bit simpler to go back to this form of our residual summit squares 10 00:00:31,400 --> 00:00:33,230 and drive it directly. 11 00:00:33,230 --> 00:00:36,760 So, I'm just gonna rewrite this more explicitly, 12 00:00:36,760 --> 00:00:41,470 so when we start taking derivatives it's easier to see what's going on. 13 00:00:41,470 --> 00:00:46,550 So, we have the sum of our I equals 1, 14 00:00:46,550 --> 00:00:51,190 2N and we have YI minus. 15 00:00:51,190 --> 00:00:56,620 Now, lets write out what this vector inner product is and what is it? 16 00:00:56,620 --> 00:00:59,020 Well, it's simply our fid. 17 00:00:59,020 --> 00:01:04,130 So, we have W0 times H0 of XI minus 18 00:01:04,130 --> 00:01:10,120 W1H1 of XI minus all the way to 19 00:01:11,240 --> 00:01:15,110 our last feature, W capital D H 20 00:01:16,240 --> 00:01:20,511 capital D of XI squared. 21 00:01:20,511 --> 00:01:25,940 Okay, so remember when we're taking a gradient, well, what's the gradient? 22 00:01:25,940 --> 00:01:28,190 It's just a vector of a bunch of partials 23 00:01:30,050 --> 00:01:35,250 with respect to W0 then W1, all the way up to W capital D. 24 00:01:35,250 --> 00:01:38,950 So, let's just look at one element, which is the partial of this residual sum of 25 00:01:38,950 --> 00:01:45,750 squares with respect to WJ, and that will give us our update for this J entry. 26 00:01:45,750 --> 00:01:50,310 So, the partial derivative of this function with respect to WJ. 27 00:01:50,310 --> 00:01:54,480 You go through and remember what we did in this simple linear regression model. 28 00:01:54,480 --> 00:01:56,660 We keep this sum on the outside, and 29 00:01:56,660 --> 00:02:01,370 then we take the derivative of this function with respect to WJ. 30 00:02:02,400 --> 00:02:04,070 So, the two is going to come down. 31 00:02:04,070 --> 00:02:07,000 We're going to get this function repeated again. 32 00:02:08,090 --> 00:02:13,020 So, YI minus W0H0XI minus W1H1XI 33 00:02:13,020 --> 00:02:17,950 dot dot dot minus WDHDXI squared, and 34 00:02:17,950 --> 00:02:24,576 then we have to multiply what's the coefficient, 35 00:02:24,576 --> 00:02:31,990 sorry, it's no longer squared, it's to the 1 power. 36 00:02:33,500 --> 00:02:41,108 Well, let's go back to what I was asking, what's the coefficient associated with WJ? 37 00:02:41,108 --> 00:02:46,840 Well, it's minus HJ, so it's negative of our Jth feature. 38 00:02:49,597 --> 00:02:54,065 Sorry, my pen stopped writing for 39 00:02:54,065 --> 00:02:58,240 a second there, minus HJ of XI. 40 00:02:58,240 --> 00:03:00,370 Okay, so let's write this more compactly. 41 00:03:00,370 --> 00:03:05,078 We can write this as I = 1 to N of, sorry, 42 00:03:05,078 --> 00:03:11,580 I'll bring out the minus two to the outside here. 43 00:03:11,580 --> 00:03:14,230 Minus two times this sum, 44 00:03:14,230 --> 00:03:19,860 where the sum we're going to have HJ of XI, inside the sum. 45 00:03:19,860 --> 00:03:21,370 That's this term. 46 00:03:21,370 --> 00:03:26,190 And then, we're going to multiply by this function which I'm gonna return to vector 47 00:03:26,190 --> 00:03:31,160 notation, just h(xi), transpose w. 48 00:03:31,160 --> 00:03:35,280 But in this case, there's no square. 49 00:03:36,340 --> 00:03:40,841 Okay, so I'm gonna plug this into 50 00:03:40,841 --> 00:03:45,804 the update to the Jth feature weight, 51 00:03:45,804 --> 00:03:52,476 where I take the previous weight of that feature and 52 00:03:52,476 --> 00:03:58,217 subtract off a step size times the partial, 53 00:03:58,217 --> 00:04:06,128 which is -2 sum i=1 to n hj xi times yi- h transpose of xi, 54 00:04:06,128 --> 00:04:12,050 this being this vector times w also a vector. 55 00:04:13,290 --> 00:04:18,190 Where specifically we're looking at W from this Tth iteration. 56 00:04:21,580 --> 00:04:22,950 Okay. 57 00:04:22,950 --> 00:04:29,570 So, again, we can add a little bit of interpretation here where this part here, 58 00:04:29,570 --> 00:04:34,720 this is, I'm taking the features, all the features for 59 00:04:34,720 --> 00:04:39,930 my Ith observation, multiplying by the entire W vector. 60 00:04:39,930 --> 00:04:47,730 So, this is my predicted value of my Ith observation using W of T. 61 00:04:50,530 --> 00:04:51,110 Okay. 62 00:04:51,110 --> 00:04:53,000 So, let's rewrite this. 63 00:04:53,000 --> 00:04:57,530 So, here, I've just written exactly what I had on the previous slide. 64 00:04:57,530 --> 00:05:01,130 Now, let's think about interpreting this update here. 65 00:05:02,200 --> 00:05:06,060 In particular, let's assume that WJ 66 00:05:06,060 --> 00:05:10,780 corresponds to the coefficient associated with number of baths. 67 00:05:10,780 --> 00:05:14,726 So, let's say the Jth features, 68 00:05:14,726 --> 00:05:19,768 let's just assume is number of bathrooms. 69 00:05:23,135 --> 00:05:29,920 And what happens if in general, I'm overestimating. 70 00:05:29,920 --> 00:05:31,430 Sorry, not overestimating. 71 00:05:31,430 --> 00:05:36,530 I'm underestimating the impact of the number of baths on my predicted 72 00:05:38,030 --> 00:05:39,440 value of the house. 73 00:05:39,440 --> 00:05:44,160 So, what that means is, if I look along this bathrooms direction, and 74 00:05:44,160 --> 00:05:49,190 I look at the slope of this hyperplane, I'm saying that it's 75 00:05:49,190 --> 00:05:54,680 not steep enough so increasing bathrooms actually has more impact 76 00:05:54,680 --> 00:05:59,560 on the value of the house than my currently estimated model thinks it does. 77 00:06:00,890 --> 00:06:02,330 Okay so what's gonna happen? 78 00:06:02,330 --> 00:06:08,088 So, if underestimating 79 00:06:08,088 --> 00:06:16,320 the impact of number of bathrooms. 80 00:06:17,850 --> 00:06:22,941 So, what this corresponds to is 81 00:06:22,941 --> 00:06:28,029 W hat J iteration T is too small, 82 00:06:28,029 --> 00:06:33,119 then what I'm gonna have is that 83 00:06:33,119 --> 00:06:39,152 my observations in general are larger 84 00:06:39,152 --> 00:06:44,620 than my predicted observations, 85 00:06:44,620 --> 00:06:49,333 so this term here on average, 86 00:06:49,333 --> 00:06:52,760 we'll be positive. 87 00:06:52,760 --> 00:06:57,469 Okay, but we're taking this average, multiplying 88 00:06:57,469 --> 00:07:01,670 by the feature that's number bathrooms. 89 00:07:01,670 --> 00:07:02,927 So, on average. 90 00:07:05,737 --> 00:07:11,537 Weighted by number of bathrooms, 91 00:07:11,537 --> 00:07:20,140 where this here is number of bathrooms for house I. 92 00:07:20,140 --> 00:07:23,060 And that's what we're multiplying that by. 93 00:07:25,740 --> 00:07:32,162 So this, sorry, that should say then, will be positive. 94 00:07:36,226 --> 00:07:38,610 And what's the impact of that? 95 00:07:38,610 --> 00:07:41,230 The impact of that is this whole term here, 96 00:07:41,230 --> 00:07:43,480 what we're adding to WJ, will be positive. 97 00:07:43,480 --> 00:07:48,436 So, we're gonna increase WJ hat. 98 00:07:55,007 --> 00:07:57,045 I'll just say WJ, sorry, 99 00:07:57,045 --> 00:08:02,390 I don't know how to annotate this to make it clear what I'm saying. 100 00:08:02,390 --> 00:08:10,770 I'll say WJ T plus one will be greater than WJ T. 101 00:08:10,770 --> 00:08:13,230 We are increasing the value. 102 00:08:13,230 --> 00:08:21,140 And let's talk very quickly about this weighing by the number of baths here. 103 00:08:21,140 --> 00:08:24,510 Why are we weighing this by the number of baths? 104 00:08:25,990 --> 00:08:29,820 Well, of course, the observations that have more of the feature, 105 00:08:29,820 --> 00:08:35,490 more numbers baths should weigh more heavily in our assessment of the fit. 106 00:08:35,490 --> 00:08:38,310 So, that's why whenever we look at the residual, 107 00:08:38,310 --> 00:08:43,480 we weight by the value of the feature that we're considering. 108 00:08:45,740 --> 00:08:46,337 Okay, so 109 00:08:46,337 --> 00:08:51,414 this gives us a little bit of intuition behind this gradient descent algorithm, 110 00:08:51,414 --> 00:08:56,581 particularly looking just feature by feature and what the algorithm looks like. 111 00:08:56,581 --> 00:09:00,169 [MUSIC]