1
00:00:00,000 --> 00:00:05,060
Okay.
So, let's look at our matrix equation once

2
00:00:05,060 --> 00:00:09,790
more.
We have x which is simply draw is the

3
00:00:09,790 --> 00:00:18,370
data, set of features for data point.
The value is y sub i for each data point

4
00:00:18,370 --> 00:00:23,760
which is in this m by one vec, column
vector y.

5
00:00:24,120 --> 00:00:29,924
And, we want to make, try to find f such
that x times f is almost what?

6
00:00:29,924 --> 00:00:36,395
I'm going to use bold y and bold f to
indicate that they're vectors,

7
00:00:36,395 --> 00:00:41,818
A small mistake.
I've used bold y here it actually should

8
00:00:41,818 --> 00:00:46,481
be unbold y and that was a typo, I
apologize for that.

9
00:00:46,481 --> 00:00:53,149
Let's go on, though.
We want to find f at approximately matches

10
00:00:53,149 --> 00:00:59,670
this equation.
And we're going to do that by minimizing

11
00:00:59,670 --> 00:01:07,899
the difference between X transposed f and
yi for each of these data points by taking

12
00:01:07,899 --> 00:01:12,892
the sum of the squares.
That's what we decided earlier.

13
00:01:12,892 --> 00:01:20,011
And if we write the sum of squares in
matrix form, it turns out we simply, the

14
00:01:20,011 --> 00:01:25,847
difference between x, f and y transposed
and multiplied by itself.

15
00:01:25,847 --> 00:01:31,168
So, essentially, it's the norm or the sum
of squares of this difference vector,

16
00:01:31,168 --> 00:01:36,074
which is exactly what we have here.
This quantity needs to be minimized.

17
00:01:36,074 --> 00:01:41,464
Now, to minimize a quantity, if you go
back to high school, you need to make its

18
00:01:41,464 --> 00:01:45,228
derivative zero.
What derivative with respect to what?

19
00:01:45,228 --> 00:01:49,037
The unknown quantity is f,
In this case it's a vector.

20
00:01:49,037 --> 00:01:52,486
So, it's a bit more complicated than high
school.

21
00:01:52,486 --> 00:01:57,300
You want to take the derivative with
respect to every element of x.

22
00:01:57,300 --> 00:02:00,847
Well,
If you study the vector calculus, it's

23
00:02:00,847 --> 00:02:05,265
fairly easy to do that.
We won't go through that here. But, just

24
00:02:05,265 --> 00:02:10,965
think about f as being one variables for
the time being. the way you take the

25
00:02:10,965 --> 00:02:16,452
derivative, derivative is you would take
the, you would expand this out and you

26
00:02:16,452 --> 00:02:20,442
will get one term which is f transposed, X
transposed Xf,

27
00:02:20,442 --> 00:02:25,643
So you get two f's there. So, when you
take derivative of that, you will get

28
00:02:25,643 --> 00:02:31,255
twice X transpose Xf.
And then, you'll get two elements where

29
00:02:31,255 --> 00:02:38,127
there is a y transposed X into f, or an f
transposed X transposed into y.

30
00:02:38,127 --> 00:02:42,613
And, for each of those, you'll get a y
transposed there.

31
00:02:42,613 --> 00:02:47,113
So,
Of you set the derivative to zero, you'll

32
00:02:47,113 --> 00:02:53,635
get twice of this being zero which
essentially, this minus this being zero

33
00:02:53,635 --> 00:03:00,881
which are called the normal equations.
X transposed f times f should be equal to

34
00:03:00,881 --> 00:03:07,512
X transposed y.
Now, X transposed f, if you look at it in

35
00:03:07,512 --> 00:03:11,271
matrix form, it's actually no longer a
very large matrix.

36
00:03:11,271 --> 00:03:16,572
It's only an n by n matrix, because you
took this long matrix multiplied by its

37
00:03:16,572 --> 00:03:22,357
transposed you'll get an n by n matrix.
You have f which is a vector of length n,

38
00:03:22,357 --> 00:03:27,377
And X transpose y again will become a much
smaller column vector.

39
00:03:27,377 --> 00:03:33,403
And so, this, those two equations is n by
n, and this side is n, so you should be

40
00:03:33,403 --> 00:03:39,505
able to solve it exactly as long as, of
course, the matrix is not singular or it

41
00:03:39,505 --> 00:03:46,724
actually has a unique solution.
Most of the time, it does have a unique

42
00:03:46,724 --> 00:03:50,808
solution.
And so, we solve this to get our f.

43
00:03:50,808 --> 00:03:58,406
Once we have our f, our least squares
estimate for f is nothing but X transposed

44
00:03:58,406 --> 00:04:02,490
f,
Or x prime transposed f, where x prime is

45
00:04:02,490 --> 00:04:08,398
x, one.
So, given an x, the value of y most likely

46
00:04:08,398 --> 00:04:17,108
by least squares is, is x, one transpose
x, and that's our least squares estimate.

47
00:04:17,108 --> 00:04:24,553
Let's do an example.
Let's take a simple four data point

48
00:04:24,553 --> 00:04:29,248
example with only one feature.
So, x is just a one variable.

49
00:04:29,248 --> 00:04:37,040
X can be many variables, of course.
And, y is a set of values as well.

50
00:04:37,300 --> 00:04:42,004
We saw the normal equations,
Normal equations will look something like

51
00:04:42,004 --> 00:04:45,364
this.
If you take x transpose x, each element of

52
00:04:45,364 --> 00:04:49,662
x is x, 1..
Y is just the values of y.

53
00:04:49,662 --> 00:04:56,416
The f that you get by solving the normal
equation turns out to be this one.

54
00:04:56,416 --> 00:05:07,883
And when you plot f transposed x, which is
simply the line 0.11x - 0.26 on a graph,

55
00:05:07,883 --> 00:05:13,296
And you have the y,.
X, y coordinates being flooded as well,

56
00:05:13,296 --> 00:05:16,643
X, y.
You see that the line actually almost fits

57
00:05:16,643 --> 00:05:20,702
the, the, the data.
This is all that it, that there is in

58
00:05:20,702 --> 00:05:24,691
linear regression.
Of course, if there are many variables,

59
00:05:24,691 --> 00:05:27,824
you will have, well, possibly, many such
plots.

60
00:05:27,824 --> 00:05:30,958
One for each value, each, each, each
feature x,

61
00:05:31,172 --> 00:05:36,798
And, different, the f will be much longer
in terms of the number coefficients.

62
00:05:36,798 --> 00:05:43,553
So, we have a approximation for y for
every value of x that we might come

63
00:05:43,553 --> 00:05:47,540
across.
But, we also might want to ask, how good

64
00:05:47,540 --> 00:05:54,568
is this fit to the data.
A common measure of how good the fit is,

65
00:05:54,568 --> 00:06:01,080
is called the R squared value.
Which is simply the sum of squares of the

66
00:06:01,080 --> 00:06:06,145
errors of our actual estimates.
So, you have fx) of x, which is f

67
00:06:06,145 --> 00:06:12,839
transposed xy) sub i for each data point,
Minus the actual value. Square it and sum,

68
00:06:12,839 --> 00:06:18,175
And you'll divide that by the variation in
y about it's means.

69
00:06:18,175 --> 00:06:23,060
So, y has a certain mean which would be
somewhere here.

70
00:06:23,420 --> 00:06:30,113
And you want to see how the error, the sum
of all these errors compares with the

71
00:06:30,113 --> 00:06:34,380
actual variation in the y values across
their mean.

72
00:06:35,080 --> 00:06:40,552
Doesn't sound like a great measure
because, you know, if you, if you have,

73
00:06:40,552 --> 00:06:45,015
have a steeper slope, you, you probably
have a higher value over here. But then,

74
00:06:45,015 --> 00:06:50,088
if you have a steeper slope, you're
actually tolerating more error in your in

75
00:06:50,088 --> 00:06:53,756
your estimate.
But this is a common and easy-to-calculate

76
00:06:53,756 --> 00:06:56,934
measure.
There are better measures which tell you

77
00:06:56,934 --> 00:07:02,741
how good each of these coefficients are in
terms of how confident one should be in

78
00:07:02,741 --> 00:07:05,920
each coefficient.
But, we will not go into that in detail.

79
00:07:05,920 --> 00:07:10,814
For this particular example, the data is
pretty much fitting a line.

80
00:07:10,814 --> 00:07:16,068
The R square value comes out to be
somewhere around 0.95. As you can see, R

81
00:07:16,068 --> 00:07:20,036
squared value can be close to one, image
of this is a good fit.

82
00:07:20,036 --> 00:07:23,300
And can be close to zero if there isn't a
good fit.

83
00:07:24,280 --> 00:07:29,444
Let's take, take another example.
This is an example which is a little bit

84
00:07:29,444 --> 00:07:32,977
more realistic.
It comes from a book called Super

85
00:07:32,977 --> 00:07:38,550
Crunchers written by Ian Aryes in 2007.
It talks about the power of reasoning from

86
00:07:38,550 --> 00:07:41,540
data.
It's a great book if you want to read it.

87
00:07:42,160 --> 00:07:52,890
He, he tells the story of a wine expert
called Orley Ashenfelter who predicted the

88
00:07:52,890 --> 00:08:00,800
quality of wine based on the winter
rainfall, the average temperature, and the

89
00:08:00,800 --> 00:08:07,095
harverst rainfall in a season.
So, Ashenfelter could predict how good a

90
00:08:07,095 --> 00:08:14,338
particular wine would be depending on the
weather in the growing region for that

91
00:08:14,338 --> 00:08:17,838
wine and do this much before the wine hit
the market.

92
00:08:17,838 --> 00:08:22,697
And, turns out that his estimates were
simply based on linear regression and

93
00:08:22,697 --> 00:08:27,428
turned out to be extremely successful.
And, surprised many experienced wine

94
00:08:27,428 --> 00:08:32,670
critics who would have to taste the wine
many times and judge it before predicting

95
00:08:32,670 --> 00:08:40,935
whether it be high priced wine or not.
The kind of estimate he got was something

96
00:08:40,935 --> 00:08:47,623
like, you know, linear least squares fit
which is you know, the f0, the f1, the f2,

97
00:08:47,623 --> 00:08:50,840
and the f3.
And he got a very nice fit.

98
00:08:52,800 --> 00:08:59,071
The important thing to note is that, we
have possibly positive correlations

99
00:08:59,071 --> 00:09:05,918
between the output variable y and the
input variable, such as a positive value

100
00:09:05,918 --> 00:09:10,183
here, here, and here, or we can have a
negative correlation.

101
00:09:10,183 --> 00:09:14,523
So, in this case, 0.26 is negative,
0.00386 is negative.

102
00:09:14,523 --> 00:09:20,659
So, if harvest rain rainfall goes higher,
the quality goes down unlike the positive

103
00:09:20,659 --> 00:09:24,700
correlation of with respect to say, winter
rainfall.

104
00:09:24,700 --> 00:09:30,561
Let's take a look at a few more examples
of, of correlation that one might get with

105
00:09:30,561 --> 00:09:34,445
a single variable.
Of course, all this applies to multiple

106
00:09:34,445 --> 00:09:37,853
variables.
You can always draw graphs of the value

107
00:09:37,853 --> 00:09:40,920
with respect to any of the feature
variables.

108
00:09:41,860 --> 00:09:47,486
This is clearly very strong correlations.
You will get a very high R squared value

109
00:09:47,486 --> 00:09:53,081
for the data that looks like this.
Of course, if it looks a bit more

110
00:09:53,081 --> 00:09:58,377
scattered, you'll get a lower value via R.
Squared. And similarly, if it's negatively

111
00:09:58,377 --> 00:10:04,428
correlated, your slopes will be negative
which could have lower R squared if they

112
00:10:04,428 --> 00:10:08,268
are more scattered.
And, suppose your data is like, like this.

113
00:10:08,268 --> 00:10:11,522
Well, it doesn't look like there's any
correlation.

114
00:10:11,522 --> 00:10:16,469
And the R2 squared will reflect because if
you look at the variation of y across its

115
00:10:16,469 --> 00:10:20,635
mean, it's fairly large.
And whatever line one puts around it, the

116
00:10:20,635 --> 00:10:26,037
variation of the y's with respect to
whatever line one decides, decides to draw

117
00:10:26,037 --> 00:10:30,137
will also be very large.
And so, this ratio will be close to one

118
00:10:30,137 --> 00:10:34,520
and R2 squared will be close to zero.
However,

119
00:10:35,460 --> 00:10:41,713
Let's look at this case for example.
This is a little bit more subtle. y is

120
00:10:41,713 --> 00:10:46,049
almost constant as x varies.
So, there's a line which goes through

121
00:10:46,049 --> 00:10:48,983
these points.
But, what's the R squared value?

122
00:10:48,983 --> 00:10:52,986
Y doesn't change at all,
And the line, if you fit it, might be

123
00:10:52,986 --> 00:10:58,055
exact or very close to being exact, so the
error might be very small as well.

124
00:10:58,055 --> 00:11:03,391
So, this ratio is again very close to one,
and the R squared value is very small.

125
00:11:03,391 --> 00:11:06,860
So, essentially you're saying there's no
correlation.

126
00:11:07,440 --> 00:11:12,769
And that's actually true because whatever
be the value of x, y doesn't change.

127
00:11:12,769 --> 00:11:17,337
So, the only way you can get
non-correlation is not just if it, the

128
00:11:17,337 --> 00:11:22,251
data is scattered all across,
But even if it is a straight line here. If

129
00:11:22,251 --> 00:11:26,820
y doesn't change its vector x, then there
is no correlation either.

130
00:11:27,580 --> 00:11:33,380
And lastly, we have another situation
which we'll come to in the next segment

131
00:11:33,700 --> 00:11:41,101
which is a situation like this.
Whatever line one draws through these

132
00:11:41,101 --> 00:11:44,002
points, one will always make a lot of
error.

133
00:11:44,002 --> 00:11:47,438
This is an example of non-linear
correlation.

134
00:11:47,438 --> 00:11:53,546
The data is correlated so you can draw a
line which is like a parabola it could

135
00:11:53,546 --> 00:11:57,974
certainly work, but it doesn't work for
linear correlation.

136
00:11:57,974 --> 00:12:04,235
For this and many other reasons as we will
come to in the next segment, we have to go

137
00:12:04,235 --> 00:12:08,816
beyond linear least squares to more
complicated prediction models.