1 00:00:00,000 --> 00:00:05,060 Okay. So, let's look at our matrix equation once 2 00:00:05,060 --> 00:00:09,790 more. We have x which is simply draw is the 3 00:00:09,790 --> 00:00:18,370 data, set of features for data point. The value is y sub i for each data point 4 00:00:18,370 --> 00:00:23,760 which is in this m by one vec, column vector y. 5 00:00:24,120 --> 00:00:29,924 And, we want to make, try to find f such that x times f is almost what? 6 00:00:29,924 --> 00:00:36,395 I'm going to use bold y and bold f to indicate that they're vectors, 7 00:00:36,395 --> 00:00:41,818 A small mistake. I've used bold y here it actually should 8 00:00:41,818 --> 00:00:46,481 be unbold y and that was a typo, I apologize for that. 9 00:00:46,481 --> 00:00:53,149 Let's go on, though. We want to find f at approximately matches 10 00:00:53,149 --> 00:00:59,670 this equation. And we're going to do that by minimizing 11 00:00:59,670 --> 00:01:07,899 the difference between X transposed f and yi for each of these data points by taking 12 00:01:07,899 --> 00:01:12,892 the sum of the squares. That's what we decided earlier. 13 00:01:12,892 --> 00:01:20,011 And if we write the sum of squares in matrix form, it turns out we simply, the 14 00:01:20,011 --> 00:01:25,847 difference between x, f and y transposed and multiplied by itself. 15 00:01:25,847 --> 00:01:31,168 So, essentially, it's the norm or the sum of squares of this difference vector, 16 00:01:31,168 --> 00:01:36,074 which is exactly what we have here. This quantity needs to be minimized. 17 00:01:36,074 --> 00:01:41,464 Now, to minimize a quantity, if you go back to high school, you need to make its 18 00:01:41,464 --> 00:01:45,228 derivative zero. What derivative with respect to what? 19 00:01:45,228 --> 00:01:49,037 The unknown quantity is f, In this case it's a vector. 20 00:01:49,037 --> 00:01:52,486 So, it's a bit more complicated than high school. 21 00:01:52,486 --> 00:01:57,300 You want to take the derivative with respect to every element of x. 22 00:01:57,300 --> 00:02:00,847 Well, If you study the vector calculus, it's 23 00:02:00,847 --> 00:02:05,265 fairly easy to do that. We won't go through that here. But, just 24 00:02:05,265 --> 00:02:10,965 think about f as being one variables for the time being. the way you take the 25 00:02:10,965 --> 00:02:16,452 derivative, derivative is you would take the, you would expand this out and you 26 00:02:16,452 --> 00:02:20,442 will get one term which is f transposed, X transposed Xf, 27 00:02:20,442 --> 00:02:25,643 So you get two f's there. So, when you take derivative of that, you will get 28 00:02:25,643 --> 00:02:31,255 twice X transpose Xf. And then, you'll get two elements where 29 00:02:31,255 --> 00:02:38,127 there is a y transposed X into f, or an f transposed X transposed into y. 30 00:02:38,127 --> 00:02:42,613 And, for each of those, you'll get a y transposed there. 31 00:02:42,613 --> 00:02:47,113 So, Of you set the derivative to zero, you'll 32 00:02:47,113 --> 00:02:53,635 get twice of this being zero which essentially, this minus this being zero 33 00:02:53,635 --> 00:03:00,881 which are called the normal equations. X transposed f times f should be equal to 34 00:03:00,881 --> 00:03:07,512 X transposed y. Now, X transposed f, if you look at it in 35 00:03:07,512 --> 00:03:11,271 matrix form, it's actually no longer a very large matrix. 36 00:03:11,271 --> 00:03:16,572 It's only an n by n matrix, because you took this long matrix multiplied by its 37 00:03:16,572 --> 00:03:22,357 transposed you'll get an n by n matrix. You have f which is a vector of length n, 38 00:03:22,357 --> 00:03:27,377 And X transpose y again will become a much smaller column vector. 39 00:03:27,377 --> 00:03:33,403 And so, this, those two equations is n by n, and this side is n, so you should be 40 00:03:33,403 --> 00:03:39,505 able to solve it exactly as long as, of course, the matrix is not singular or it 41 00:03:39,505 --> 00:03:46,724 actually has a unique solution. Most of the time, it does have a unique 42 00:03:46,724 --> 00:03:50,808 solution. And so, we solve this to get our f. 43 00:03:50,808 --> 00:03:58,406 Once we have our f, our least squares estimate for f is nothing but X transposed 44 00:03:58,406 --> 00:04:02,490 f, Or x prime transposed f, where x prime is 45 00:04:02,490 --> 00:04:08,398 x, one. So, given an x, the value of y most likely 46 00:04:08,398 --> 00:04:17,108 by least squares is, is x, one transpose x, and that's our least squares estimate. 47 00:04:17,108 --> 00:04:24,553 Let's do an example. Let's take a simple four data point 48 00:04:24,553 --> 00:04:29,248 example with only one feature. So, x is just a one variable. 49 00:04:29,248 --> 00:04:37,040 X can be many variables, of course. And, y is a set of values as well. 50 00:04:37,300 --> 00:04:42,004 We saw the normal equations, Normal equations will look something like 51 00:04:42,004 --> 00:04:45,364 this. If you take x transpose x, each element of 52 00:04:45,364 --> 00:04:49,662 x is x, 1.. Y is just the values of y. 53 00:04:49,662 --> 00:04:56,416 The f that you get by solving the normal equation turns out to be this one. 54 00:04:56,416 --> 00:05:07,883 And when you plot f transposed x, which is simply the line 0.11x - 0.26 on a graph, 55 00:05:07,883 --> 00:05:13,296 And you have the y,. X, y coordinates being flooded as well, 56 00:05:13,296 --> 00:05:16,643 X, y. You see that the line actually almost fits 57 00:05:16,643 --> 00:05:20,702 the, the, the data. This is all that it, that there is in 58 00:05:20,702 --> 00:05:24,691 linear regression. Of course, if there are many variables, 59 00:05:24,691 --> 00:05:27,824 you will have, well, possibly, many such plots. 60 00:05:27,824 --> 00:05:30,958 One for each value, each, each, each feature x, 61 00:05:31,172 --> 00:05:36,798 And, different, the f will be much longer in terms of the number coefficients. 62 00:05:36,798 --> 00:05:43,553 So, we have a approximation for y for every value of x that we might come 63 00:05:43,553 --> 00:05:47,540 across. But, we also might want to ask, how good 64 00:05:47,540 --> 00:05:54,568 is this fit to the data. A common measure of how good the fit is, 65 00:05:54,568 --> 00:06:01,080 is called the R squared value. Which is simply the sum of squares of the 66 00:06:01,080 --> 00:06:06,145 errors of our actual estimates. So, you have fx) of x, which is f 67 00:06:06,145 --> 00:06:12,839 transposed xy) sub i for each data point, Minus the actual value. Square it and sum, 68 00:06:12,839 --> 00:06:18,175 And you'll divide that by the variation in y about it's means. 69 00:06:18,175 --> 00:06:23,060 So, y has a certain mean which would be somewhere here. 70 00:06:23,420 --> 00:06:30,113 And you want to see how the error, the sum of all these errors compares with the 71 00:06:30,113 --> 00:06:34,380 actual variation in the y values across their mean. 72 00:06:35,080 --> 00:06:40,552 Doesn't sound like a great measure because, you know, if you, if you have, 73 00:06:40,552 --> 00:06:45,015 have a steeper slope, you, you probably have a higher value over here. But then, 74 00:06:45,015 --> 00:06:50,088 if you have a steeper slope, you're actually tolerating more error in your in 75 00:06:50,088 --> 00:06:53,756 your estimate. But this is a common and easy-to-calculate 76 00:06:53,756 --> 00:06:56,934 measure. There are better measures which tell you 77 00:06:56,934 --> 00:07:02,741 how good each of these coefficients are in terms of how confident one should be in 78 00:07:02,741 --> 00:07:05,920 each coefficient. But, we will not go into that in detail. 79 00:07:05,920 --> 00:07:10,814 For this particular example, the data is pretty much fitting a line. 80 00:07:10,814 --> 00:07:16,068 The R square value comes out to be somewhere around 0.95. As you can see, R 81 00:07:16,068 --> 00:07:20,036 squared value can be close to one, image of this is a good fit. 82 00:07:20,036 --> 00:07:23,300 And can be close to zero if there isn't a good fit. 83 00:07:24,280 --> 00:07:29,444 Let's take, take another example. This is an example which is a little bit 84 00:07:29,444 --> 00:07:32,977 more realistic. It comes from a book called Super 85 00:07:32,977 --> 00:07:38,550 Crunchers written by Ian Aryes in 2007. It talks about the power of reasoning from 86 00:07:38,550 --> 00:07:41,540 data. It's a great book if you want to read it. 87 00:07:42,160 --> 00:07:52,890 He, he tells the story of a wine expert called Orley Ashenfelter who predicted the 88 00:07:52,890 --> 00:08:00,800 quality of wine based on the winter rainfall, the average temperature, and the 89 00:08:00,800 --> 00:08:07,095 harverst rainfall in a season. So, Ashenfelter could predict how good a 90 00:08:07,095 --> 00:08:14,338 particular wine would be depending on the weather in the growing region for that 91 00:08:14,338 --> 00:08:17,838 wine and do this much before the wine hit the market. 92 00:08:17,838 --> 00:08:22,697 And, turns out that his estimates were simply based on linear regression and 93 00:08:22,697 --> 00:08:27,428 turned out to be extremely successful. And, surprised many experienced wine 94 00:08:27,428 --> 00:08:32,670 critics who would have to taste the wine many times and judge it before predicting 95 00:08:32,670 --> 00:08:40,935 whether it be high priced wine or not. The kind of estimate he got was something 96 00:08:40,935 --> 00:08:47,623 like, you know, linear least squares fit which is you know, the f0, the f1, the f2, 97 00:08:47,623 --> 00:08:50,840 and the f3. And he got a very nice fit. 98 00:08:52,800 --> 00:08:59,071 The important thing to note is that, we have possibly positive correlations 99 00:08:59,071 --> 00:09:05,918 between the output variable y and the input variable, such as a positive value 100 00:09:05,918 --> 00:09:10,183 here, here, and here, or we can have a negative correlation. 101 00:09:10,183 --> 00:09:14,523 So, in this case, 0.26 is negative, 0.00386 is negative. 102 00:09:14,523 --> 00:09:20,659 So, if harvest rain rainfall goes higher, the quality goes down unlike the positive 103 00:09:20,659 --> 00:09:24,700 correlation of with respect to say, winter rainfall. 104 00:09:24,700 --> 00:09:30,561 Let's take a look at a few more examples of, of correlation that one might get with 105 00:09:30,561 --> 00:09:34,445 a single variable. Of course, all this applies to multiple 106 00:09:34,445 --> 00:09:37,853 variables. You can always draw graphs of the value 107 00:09:37,853 --> 00:09:40,920 with respect to any of the feature variables. 108 00:09:41,860 --> 00:09:47,486 This is clearly very strong correlations. You will get a very high R squared value 109 00:09:47,486 --> 00:09:53,081 for the data that looks like this. Of course, if it looks a bit more 110 00:09:53,081 --> 00:09:58,377 scattered, you'll get a lower value via R. Squared. And similarly, if it's negatively 111 00:09:58,377 --> 00:10:04,428 correlated, your slopes will be negative which could have lower R squared if they 112 00:10:04,428 --> 00:10:08,268 are more scattered. And, suppose your data is like, like this. 113 00:10:08,268 --> 00:10:11,522 Well, it doesn't look like there's any correlation. 114 00:10:11,522 --> 00:10:16,469 And the R2 squared will reflect because if you look at the variation of y across its 115 00:10:16,469 --> 00:10:20,635 mean, it's fairly large. And whatever line one puts around it, the 116 00:10:20,635 --> 00:10:26,037 variation of the y's with respect to whatever line one decides, decides to draw 117 00:10:26,037 --> 00:10:30,137 will also be very large. And so, this ratio will be close to one 118 00:10:30,137 --> 00:10:34,520 and R2 squared will be close to zero. However, 119 00:10:35,460 --> 00:10:41,713 Let's look at this case for example. This is a little bit more subtle. y is 120 00:10:41,713 --> 00:10:46,049 almost constant as x varies. So, there's a line which goes through 121 00:10:46,049 --> 00:10:48,983 these points. But, what's the R squared value? 122 00:10:48,983 --> 00:10:52,986 Y doesn't change at all, And the line, if you fit it, might be 123 00:10:52,986 --> 00:10:58,055 exact or very close to being exact, so the error might be very small as well. 124 00:10:58,055 --> 00:11:03,391 So, this ratio is again very close to one, and the R squared value is very small. 125 00:11:03,391 --> 00:11:06,860 So, essentially you're saying there's no correlation. 126 00:11:07,440 --> 00:11:12,769 And that's actually true because whatever be the value of x, y doesn't change. 127 00:11:12,769 --> 00:11:17,337 So, the only way you can get non-correlation is not just if it, the 128 00:11:17,337 --> 00:11:22,251 data is scattered all across, But even if it is a straight line here. If 129 00:11:22,251 --> 00:11:26,820 y doesn't change its vector x, then there is no correlation either. 130 00:11:27,580 --> 00:11:33,380 And lastly, we have another situation which we'll come to in the next segment 131 00:11:33,700 --> 00:11:41,101 which is a situation like this. Whatever line one draws through these 132 00:11:41,101 --> 00:11:44,002 points, one will always make a lot of error. 133 00:11:44,002 --> 00:11:47,438 This is an example of non-linear correlation. 134 00:11:47,438 --> 00:11:53,546 The data is correlated so you can draw a line which is like a parabola it could 135 00:11:53,546 --> 00:11:57,974 certainly work, but it doesn't work for linear correlation. 136 00:11:57,974 --> 00:12:04,235 For this and many other reasons as we will come to in the next segment, we have to go 137 00:12:04,235 --> 00:12:08,816 beyond linear least squares to more complicated prediction models.