1
00:00:00,000 --> 00:00:07,573
Let's look at the problem of prediction
from our framework that we define for

2
00:00:07,573 --> 00:00:15,580
learning way back in the lecture on learn.
Once again, we have many data points.

3
00:00:15,580 --> 00:00:19,801
This time, we'll use a slightly different
notation.

4
00:00:19,801 --> 00:00:25,678
We have m data points, each having
features, and we'll have n - one features

5
00:00:25,678 --> 00:00:31,059
for each data point.
There's a reason why we're going to have n

6
00:00:31,059 --> 00:00:37,800
- one, it will become clear shortly.
And, of course, with each set of features

7
00:00:37,800 --> 00:00:45,540
or instance of features, we have an output
variable, more than one possibly, which

8
00:00:45,540 --> 00:00:54,228
will be the ones we want to predict.
Just as before we have a large number of

9
00:00:54,228 --> 00:00:58,386
data points each having a number of
features.

10
00:00:58,386 --> 00:01:03,930
And given a new data point, we would like
to know it's value.

11
00:01:03,930 --> 00:01:11,783
In the case of learning, we had categories
that we wanted to predict whether this was

12
00:01:11,783 --> 00:01:17,420
a buyer or a browser, a good comment or a
negative comment.

13
00:01:17,420 --> 00:01:21,713
But now, we have a situation where these
are numbers.

14
00:01:21,713 --> 00:01:27,575
The y's are numbers and we would like to
predict the actual value of y.

15
00:01:27,575 --> 00:01:32,364
For example, xi's could be features about
certain products,

16
00:01:32,364 --> 00:01:38,474
And the numbers, y, might be the price
that they fetch in different markets.

17
00:01:38,474 --> 00:01:45,162
Or the volume of sales that you might
expect based on the actions that you might

18
00:01:45,162 --> 00:01:51,350
have taken about each product.
So, instead of trying to decide what class

19
00:01:51,350 --> 00:01:57,618
a new instance is, you want to predict
something about the instance, such as it's

20
00:01:57,618 --> 00:02:00,908
price or how much it's going to sell, a
value.

21
00:02:00,908 --> 00:02:06,157
So, in this sense, there's a close
relationship between learning and

22
00:02:06,157 --> 00:02:12,660
prediction. But, we are actually trying to
predict the value, rather than predict the

23
00:02:12,660 --> 00:02:16,264
class.
So, to a certain extent, what we study in

24
00:02:16,264 --> 00:02:22,749
learning is also prediction but we were
predicting categories as opposed to actual

25
00:02:22,749 --> 00:02:25,839
values.
And, when we talk about predict in this

26
00:02:25,839 --> 00:02:28,660
lecture, we'll be predicting actual
values.

27
00:02:28,660 --> 00:02:34,121
There is, of course a close relationship
that if you know that the class of

28
00:02:34,121 --> 00:02:40,092
something that you observe, you can make
predictions about what it would do in the

29
00:02:40,092 --> 00:02:45,917
future based on your understanding of how
other instances of that class behaves.

30
00:02:45,917 --> 00:02:51,670
But, that requires a certain degree of
reasoning and it's not directly coming

31
00:02:51,670 --> 00:02:55,529
from the data.
For example, if you found that somebody

32
00:02:55,529 --> 00:03:00,417
was likely to be a buyer.
You would naturally predict that they will

33
00:03:00,417 --> 00:03:06,836
go and click on an ad or buy something.
That's not something that was learned from

34
00:03:06,836 --> 00:03:12,465
the data, it was something which was part
of how the data has been quoted,

35
00:03:12,465 --> 00:03:18,404
And involved some degree of reasoning and
preparation of the data as well as

36
00:03:18,404 --> 00:03:21,720
inference once one has observed that
class.

37
00:03:21,720 --> 00:03:26,578
It's a subtle but important point that one
needs to understand,

38
00:03:26,578 --> 00:03:32,440
Which is a distinction between prediction
of values and learning categories.

39
00:03:32,800 --> 00:03:39,474
In that sense, prediction, learning,
reasoning will all come together later in

40
00:03:39,474 --> 00:03:46,064
this lecture. But, for the moment, let's
talk about simply trying to predict the

41
00:03:46,064 --> 00:03:50,820
value y.
In particular, we'll be talking about just

42
00:03:50,820 --> 00:03:55,970
one y output variable, if you have
knowledge about a large number of

43
00:03:55,970 --> 00:04:00,060
instances where you have features and the
y values.

44
00:04:00,720 --> 00:04:07,442
The technique that we'll talk about is
called linear prediction, or regression,

45
00:04:07,442 --> 00:04:12,181
or least squares.
And, follows fairly naturally from our

46
00:04:12,181 --> 00:04:19,076
formulation of the learning problem in
terms of finding the expected value of y

47
00:04:19,076 --> 00:04:22,520
given an instance x.
It turns out,

48
00:04:22,520 --> 00:04:26,600
And this is very important for linear
prediction,

49
00:04:26,600 --> 00:04:33,160
That the function, f of x, which we are so
familiar with from our previous lectures,

50
00:04:33,160 --> 00:04:39,517
also minimizes the error between the
predicted y and fX).

51
00:04:39,517 --> 00:04:47,543
Of x. And, in particular, it minimizes the
expected sum of squares of this error.

52
00:04:47,543 --> 00:04:55,240
Now, the proof of this is not difficult
but we won't go into this here.

53
00:04:56,600 --> 00:05:04,084
Let's take it as given that if we want to
find the expected value of y given x, we

54
00:05:04,084 --> 00:05:10,930
might as well minimize the error, or
rather the sum of squares of the error

55
00:05:10,930 --> 00:05:17,399
values between y and f of x.
It's important to clarify the notation

56
00:05:17,399 --> 00:05:25,203
here because it could get a bit confusing.
Bold x means one data point which may

57
00:05:25,203 --> 00:05:32,998
consist of n - one different features.
A bold x subscripted is not one of the

58
00:05:32,998 --> 00:05:37,025
features,
It's actually one particular instance, and

59
00:05:37,025 --> 00:05:42,938
is one whole data point.
We don't actually have many features as

60
00:05:42,938 --> 00:05:48,670
far as output variables are concerned,
We are only dealing with one output

61
00:05:48,670 --> 00:05:52,279
variable. So,
Bold y is just y, in this case.

62
00:05:52,279 --> 00:05:59,245
Normal, un-bold y, you may think about it.
And, y sub i is not any of these but is

63
00:05:59,245 --> 00:06:03,890
actually the output variable for the i-th
data point.

64
00:06:03,890 --> 00:06:10,499
Now, let's take a look at what this means.
For one particular data point, the

65
00:06:10,499 --> 00:06:18,002
difference between the output variable and
the predicted output variable which is f

66
00:06:18,002 --> 00:06:23,611
of the data point, x sub i, is just this
quantity.

67
00:06:23,611 --> 00:06:29,252
We square it,
And take the average of that square across

68
00:06:29,252 --> 00:06:35,290
all the possible data points that we have.
There are m different data points so we

69
00:06:35,290 --> 00:06:41,329
take the sum of all these squared errors
and divide by m to get the average error.

70
00:06:41,329 --> 00:06:47,750
And that's nothing but the expected value,
of the error as measured by the square of

71
00:06:47,750 --> 00:06:51,190
the difference between y and the predicted
y.

72
00:06:51,190 --> 00:06:57,458
Even if we didn't get the relationship
between the expected value and the sum of

73
00:06:57,458 --> 00:07:01,280
squares, just think of it this way.
We have a prediction,

74
00:07:01,660 --> 00:07:07,228
F, which says that y should be f of x,
And we have the actual values y, yi,.

75
00:07:07,385 --> 00:07:13,581
And we just want to make sure that the
difference isn't too large. And, the best

76
00:07:13,581 --> 00:07:19,856
way we can think of to do that is to
measure the differences and try to find an

77
00:07:19,856 --> 00:07:25,660
f which minimizes the average distance, or
difference between y and f of x.

78
00:07:26,960 --> 00:07:31,961
Let's see what that looks like if we have
a particular form of f.

79
00:07:31,961 --> 00:07:38,272
We're dealing with linear predictions, so
in some sense, we're saying that f is not

80
00:07:38,272 --> 00:07:42,735
just any function,
But it is a particular type of function,

81
00:07:42,735 --> 00:07:51,494
which is a linear function.
So, what f looks like is x times some set

82
00:07:51,494 --> 00:07:57,546
of weights.
If we look at this as a row vector, then x

83
00:07:57,546 --> 00:08:04,686
consists of all the elements of that row
vector, then we have a one because it'll

84
00:08:04,686 --> 00:08:12,415
be a constant.
And then, x, one transposed f is our

85
00:08:12,415 --> 00:08:18,800
function. So, we're essentially taking a
linear combination of the features.

86
00:08:19,620 --> 00:08:27,332
That means, you know, f1  x1 + f2  x2,
etc., for every feature,

87
00:08:27,332 --> 00:08:34,284
Plus some f0 which is just a constant, and
that's our function f.

88
00:08:34,284 --> 00:08:44,124
And what we really want is that, the sum,
the differences between the predicted f

89
00:08:44,124 --> 00:08:49,091
and the actual y's are very small.
So, in some sense, if you look at, if you

90
00:08:49,091 --> 00:08:55,512
write this as a matrix, what that works
out is that this matrix, X, capital X,

91
00:08:55,512 --> 00:09:00,992
where each row of the matrix is just that
particular data point.

92
00:09:00,992 --> 00:09:07,769
That is they are in different columns in
this matrix, each row has n features

93
00:09:07,769 --> 00:09:12,428
corresponding to the n features in the
data point,

94
00:09:12,428 --> 00:09:19,510
Multiplied by f, which is one unknown
vector that we're trying to find, gives

95
00:09:19,510 --> 00:09:23,890
me, as close as possible the, i-th data
point.

96
00:09:23,890 --> 00:09:30,320
So, x1 transpose f is almost y1,
X1 transpose f is almost y1, and so on.

97
00:09:30,320 --> 00:09:35,653
We have this extra one there just to make
sure that we don't just take the

98
00:09:35,653 --> 00:09:41,554
combination of the features, but we also
allow for a constant factor in this linear

99
00:09:41,554 --> 00:09:44,683
combination.
Please look through this matrix

100
00:09:44,683 --> 00:09:50,514
formulation and convince yourself that
what we're really trying to do is minimize

101
00:09:50,514 --> 00:09:54,390
the difference between extra, xi
transposed f and y1..

102
00:09:54,544 --> 00:10:00,724
And, if you write it down on the matrix,
we're really trying to find x times f is

103
00:10:00,724 --> 00:10:05,359
almost equal to y.
So, we are trying to find an f such that x

104
00:10:05,359 --> 00:10:10,801
times f is almost equal to y.
Well, some of you might have seen such

105
00:10:10,801 --> 00:10:16,900
things, matrix equations before, and might
wonder why we have this almost equal to.

106
00:10:16,900 --> 00:10:22,774
Well, the reason is that, the number of
rows in this matrix is much larger than

107
00:10:22,774 --> 00:10:28,873
the number of columns because you have a
few features, maybe a few dozen features.

108
00:10:28,873 --> 00:10:34,898
And, you have maybe hundreds of thousands
of data points that you may have observed,

109
00:10:34,898 --> 00:10:40,997
And we can't solve this exactly because
there are just too many rows and too few

110
00:10:40,997 --> 00:10:42,311
columns,
So we have to solve it approximately. Of

111
00:10:42,311 --> 00:10:47,881
course, if this were m by, n by n, we
could solve it exactly which is what you

112
00:10:47,881 --> 00:10:52,604
learned in high school is how we solve
systems of linear equations.

113
00:10:52,604 --> 00:10:57,539
Instead, we'll have to solve it
approximately using a technique called

114
00:10:57,539 --> 00:11:00,360
Least Squares.
And here's how that works.