1 00:00:00,000 --> 00:00:04,148 [MUSIC] 2 00:00:04,148 --> 00:00:07,880 Congratulations for getting through this really challenging course. 3 00:00:07,880 --> 00:00:10,966 We covered a lot, a lot of ground discussing many ideas relating to 4 00:00:10,966 --> 00:00:15,060 regression, as well as more general ideas that are foundational to machine learning. 5 00:00:15,060 --> 00:00:17,704 So let's just recap where we've been, and 6 00:00:17,704 --> 00:00:21,870 then look ahead to what's next in a specialization. 7 00:00:21,870 --> 00:00:26,360 Okay, I know, I love to start slides with okay, but that's how it is. 8 00:00:26,360 --> 00:00:30,040 So okay, what have we learned in this regression course? 9 00:00:30,040 --> 00:00:33,250 Well in the first module we talked about something called simple regression. 10 00:00:33,250 --> 00:00:35,620 Let's just remember what simple regression was. 11 00:00:35,620 --> 00:00:39,780 This is where we assume that we just had a single input and we're just fitting 12 00:00:39,780 --> 00:00:44,660 a very simple line as a relationship between the input and the output. 13 00:00:44,660 --> 00:00:48,650 Then we discussed what the cost was of using a given line, and for 14 00:00:48,650 --> 00:00:52,300 this we define something we called the residual sum of squares. 15 00:00:52,300 --> 00:00:55,113 Then once we've defined our residual sum of squares, 16 00:00:55,113 --> 00:00:58,426 this allows us to assess the different fits of different lines. 17 00:00:58,426 --> 00:01:03,280 And then with that we can choose which best fits our specific training data set. 18 00:01:04,410 --> 00:01:09,800 In particular, to find the best fitting line, we talked about our objective being 19 00:01:09,800 --> 00:01:14,860 an objective over two variables, which represented our slope and our intercept. 20 00:01:14,860 --> 00:01:19,270 And the specific algorithm that we went through in this module was 21 00:01:19,270 --> 00:01:24,082 gradient descent, which we discussed was an iterative algorithm that 22 00:01:24,082 --> 00:01:27,469 moves in the direction of the negative gradient. 23 00:01:27,469 --> 00:01:31,840 And for convex functions, converges to the optimum. 24 00:01:32,960 --> 00:01:35,010 We then turn to multiple regression. 25 00:01:35,010 --> 00:01:37,960 And multiple regression allowed us to fit 26 00:01:37,960 --> 00:01:42,370 more complicated relationships between our single input and our output. 27 00:01:42,370 --> 00:01:46,660 We talked about polynomial regression, seasonality, and lots of different 28 00:01:47,680 --> 00:01:51,960 features of our single input that we could use in a multiple regression model. 29 00:01:51,960 --> 00:01:56,790 But then we talked more generically about incorporating different inputs. 30 00:01:56,790 --> 00:01:59,946 Like in our housing applications, square feet, number of bathrooms, 31 00:01:59,946 --> 00:02:03,529 number of bedrooms, lot size, year built, as well as features of these inputs. 32 00:02:03,529 --> 00:02:08,530 And so generically, we wrote our multiple regression model as follows. 33 00:02:08,530 --> 00:02:13,814 So it's just a weighted collection of our features h of our inputs xi for 34 00:02:13,814 --> 00:02:14,968 our ith house. 35 00:02:14,968 --> 00:02:16,406 Plus some epsilon term, 36 00:02:16,406 --> 00:02:20,110 which is our error representing the noise in our observations. 37 00:02:21,140 --> 00:02:26,048 Then for this case of multiple regression we have some more complicated relationship 38 00:02:26,048 --> 00:02:31,310 between, now, a whole bunch of features and our output. 39 00:02:31,310 --> 00:02:34,970 We talked about how we define our residual sum of squares. 40 00:02:34,970 --> 00:02:39,540 And then, in order to derive both a closed form solution as well as 41 00:02:39,540 --> 00:02:41,630 a gradient descent algorithm, 42 00:02:41,630 --> 00:02:45,990 we talked about taking the gradient of this residual sum of squares. 43 00:02:45,990 --> 00:02:49,300 So it's pretty cool how, even when dealing with a large collection of features, 44 00:02:49,300 --> 00:02:55,100 we can derive a closed-form solution which says, just given all of our data, 45 00:02:55,100 --> 00:02:58,520 we just have to compute the term shown here. 46 00:02:58,520 --> 00:03:05,500 And that will give us our estimated set of coefficients for all of our features. 47 00:03:05,500 --> 00:03:10,640 But we also talked about the fact that this could be computationally intensive. 48 00:03:10,640 --> 00:03:12,650 It's operations are cubic and 49 00:03:12,650 --> 00:03:17,620 the number of features that we have in addition to the fact that this matrix, 50 00:03:17,620 --> 00:03:21,960 H transpose, H that we might have to invert, might not be invertible. 51 00:03:21,960 --> 00:03:26,000 And so now we know that this is where ridge regression can be so useful. 52 00:03:26,000 --> 00:03:31,130 In the cases where we might not have a matrix that we can invert, 53 00:03:31,130 --> 00:03:35,450 the ridge regression is just a very simple modification to this closed-form solution, 54 00:03:35,450 --> 00:03:39,520 and leads to forms that are always invertible. 55 00:03:39,520 --> 00:03:43,540 And thus allows us to handle cases where we have lots and lots of features, 56 00:03:43,540 --> 00:03:45,580 even more features than we have observations. 57 00:03:46,990 --> 00:03:50,639 But instead of this closed-form solution which we mentioned could 58 00:03:50,639 --> 00:03:54,544 be very computationally intensive, we talked about a gradient descent 59 00:03:54,544 --> 00:03:59,230 algorithm which is an iterative procedure for solving this optimization objective. 60 00:03:59,230 --> 00:04:05,170 Where we can start at any point in our space, so anywhere. 61 00:04:05,170 --> 00:04:08,350 And compute the gradient, and take these steps. 62 00:04:08,350 --> 00:04:10,970 Where, remember, there was a step size that we had to set. 63 00:04:10,970 --> 00:04:13,150 Where is our step size, here it is. 64 00:04:13,150 --> 00:04:16,410 And that determined a lot of the properties of 65 00:04:16,410 --> 00:04:19,090 how quickly we converged to the optimal. 66 00:04:19,090 --> 00:04:21,530 But we showed that, in our well-behaved cases, 67 00:04:21,530 --> 00:04:24,440 that we would converge to this optimal solution. 68 00:04:24,440 --> 00:04:28,680 And this idea of gradient descent, or this optimization algorithm, is a very 69 00:04:28,680 --> 00:04:33,063 general purpose tool that we're gonna see again later in this specialization. 70 00:04:33,063 --> 00:04:38,549 [MUSIC]