1 00:00:00,000 --> 00:00:07,573 Let's look at the problem of prediction from our framework that we define for 2 00:00:07,573 --> 00:00:15,580 learning way back in the lecture on learn. Once again, we have many data points. 3 00:00:15,580 --> 00:00:19,801 This time, we'll use a slightly different notation. 4 00:00:19,801 --> 00:00:25,678 We have m data points, each having features, and we'll have n - one features 5 00:00:25,678 --> 00:00:31,059 for each data point. There's a reason why we're going to have n 6 00:00:31,059 --> 00:00:37,800 - one, it will become clear shortly. And, of course, with each set of features 7 00:00:37,800 --> 00:00:45,540 or instance of features, we have an output variable, more than one possibly, which 8 00:00:45,540 --> 00:00:54,228 will be the ones we want to predict. Just as before we have a large number of 9 00:00:54,228 --> 00:00:58,386 data points each having a number of features. 10 00:00:58,386 --> 00:01:03,930 And given a new data point, we would like to know it's value. 11 00:01:03,930 --> 00:01:11,783 In the case of learning, we had categories that we wanted to predict whether this was 12 00:01:11,783 --> 00:01:17,420 a buyer or a browser, a good comment or a negative comment. 13 00:01:17,420 --> 00:01:21,713 But now, we have a situation where these are numbers. 14 00:01:21,713 --> 00:01:27,575 The y's are numbers and we would like to predict the actual value of y. 15 00:01:27,575 --> 00:01:32,364 For example, xi's could be features about certain products, 16 00:01:32,364 --> 00:01:38,474 And the numbers, y, might be the price that they fetch in different markets. 17 00:01:38,474 --> 00:01:45,162 Or the volume of sales that you might expect based on the actions that you might 18 00:01:45,162 --> 00:01:51,350 have taken about each product. So, instead of trying to decide what class 19 00:01:51,350 --> 00:01:57,618 a new instance is, you want to predict something about the instance, such as it's 20 00:01:57,618 --> 00:02:00,908 price or how much it's going to sell, a value. 21 00:02:00,908 --> 00:02:06,157 So, in this sense, there's a close relationship between learning and 22 00:02:06,157 --> 00:02:12,660 prediction. But, we are actually trying to predict the value, rather than predict the 23 00:02:12,660 --> 00:02:16,264 class. So, to a certain extent, what we study in 24 00:02:16,264 --> 00:02:22,749 learning is also prediction but we were predicting categories as opposed to actual 25 00:02:22,749 --> 00:02:25,839 values. And, when we talk about predict in this 26 00:02:25,839 --> 00:02:28,660 lecture, we'll be predicting actual values. 27 00:02:28,660 --> 00:02:34,121 There is, of course a close relationship that if you know that the class of 28 00:02:34,121 --> 00:02:40,092 something that you observe, you can make predictions about what it would do in the 29 00:02:40,092 --> 00:02:45,917 future based on your understanding of how other instances of that class behaves. 30 00:02:45,917 --> 00:02:51,670 But, that requires a certain degree of reasoning and it's not directly coming 31 00:02:51,670 --> 00:02:55,529 from the data. For example, if you found that somebody 32 00:02:55,529 --> 00:03:00,417 was likely to be a buyer. You would naturally predict that they will 33 00:03:00,417 --> 00:03:06,836 go and click on an ad or buy something. That's not something that was learned from 34 00:03:06,836 --> 00:03:12,465 the data, it was something which was part of how the data has been quoted, 35 00:03:12,465 --> 00:03:18,404 And involved some degree of reasoning and preparation of the data as well as 36 00:03:18,404 --> 00:03:21,720 inference once one has observed that class. 37 00:03:21,720 --> 00:03:26,578 It's a subtle but important point that one needs to understand, 38 00:03:26,578 --> 00:03:32,440 Which is a distinction between prediction of values and learning categories. 39 00:03:32,800 --> 00:03:39,474 In that sense, prediction, learning, reasoning will all come together later in 40 00:03:39,474 --> 00:03:46,064 this lecture. But, for the moment, let's talk about simply trying to predict the 41 00:03:46,064 --> 00:03:50,820 value y. In particular, we'll be talking about just 42 00:03:50,820 --> 00:03:55,970 one y output variable, if you have knowledge about a large number of 43 00:03:55,970 --> 00:04:00,060 instances where you have features and the y values. 44 00:04:00,720 --> 00:04:07,442 The technique that we'll talk about is called linear prediction, or regression, 45 00:04:07,442 --> 00:04:12,181 or least squares. And, follows fairly naturally from our 46 00:04:12,181 --> 00:04:19,076 formulation of the learning problem in terms of finding the expected value of y 47 00:04:19,076 --> 00:04:22,520 given an instance x. It turns out, 48 00:04:22,520 --> 00:04:26,600 And this is very important for linear prediction, 49 00:04:26,600 --> 00:04:33,160 That the function, f of x, which we are so familiar with from our previous lectures, 50 00:04:33,160 --> 00:04:39,517 also minimizes the error between the predicted y and fX). 51 00:04:39,517 --> 00:04:47,543 Of x. And, in particular, it minimizes the expected sum of squares of this error. 52 00:04:47,543 --> 00:04:55,240 Now, the proof of this is not difficult but we won't go into this here. 53 00:04:56,600 --> 00:05:04,084 Let's take it as given that if we want to find the expected value of y given x, we 54 00:05:04,084 --> 00:05:10,930 might as well minimize the error, or rather the sum of squares of the error 55 00:05:10,930 --> 00:05:17,399 values between y and f of x. It's important to clarify the notation 56 00:05:17,399 --> 00:05:25,203 here because it could get a bit confusing. Bold x means one data point which may 57 00:05:25,203 --> 00:05:32,998 consist of n - one different features. A bold x subscripted is not one of the 58 00:05:32,998 --> 00:05:37,025 features, It's actually one particular instance, and 59 00:05:37,025 --> 00:05:42,938 is one whole data point. We don't actually have many features as 60 00:05:42,938 --> 00:05:48,670 far as output variables are concerned, We are only dealing with one output 61 00:05:48,670 --> 00:05:52,279 variable. So, Bold y is just y, in this case. 62 00:05:52,279 --> 00:05:59,245 Normal, un-bold y, you may think about it. And, y sub i is not any of these but is 63 00:05:59,245 --> 00:06:03,890 actually the output variable for the i-th data point. 64 00:06:03,890 --> 00:06:10,499 Now, let's take a look at what this means. For one particular data point, the 65 00:06:10,499 --> 00:06:18,002 difference between the output variable and the predicted output variable which is f 66 00:06:18,002 --> 00:06:23,611 of the data point, x sub i, is just this quantity. 67 00:06:23,611 --> 00:06:29,252 We square it, And take the average of that square across 68 00:06:29,252 --> 00:06:35,290 all the possible data points that we have. There are m different data points so we 69 00:06:35,290 --> 00:06:41,329 take the sum of all these squared errors and divide by m to get the average error. 70 00:06:41,329 --> 00:06:47,750 And that's nothing but the expected value, of the error as measured by the square of 71 00:06:47,750 --> 00:06:51,190 the difference between y and the predicted y. 72 00:06:51,190 --> 00:06:57,458 Even if we didn't get the relationship between the expected value and the sum of 73 00:06:57,458 --> 00:07:01,280 squares, just think of it this way. We have a prediction, 74 00:07:01,660 --> 00:07:07,228 F, which says that y should be f of x, And we have the actual values y, yi,. 75 00:07:07,385 --> 00:07:13,581 And we just want to make sure that the difference isn't too large. And, the best 76 00:07:13,581 --> 00:07:19,856 way we can think of to do that is to measure the differences and try to find an 77 00:07:19,856 --> 00:07:25,660 f which minimizes the average distance, or difference between y and f of x. 78 00:07:26,960 --> 00:07:31,961 Let's see what that looks like if we have a particular form of f. 79 00:07:31,961 --> 00:07:38,272 We're dealing with linear predictions, so in some sense, we're saying that f is not 80 00:07:38,272 --> 00:07:42,735 just any function, But it is a particular type of function, 81 00:07:42,735 --> 00:07:51,494 which is a linear function. So, what f looks like is x times some set 82 00:07:51,494 --> 00:07:57,546 of weights. If we look at this as a row vector, then x 83 00:07:57,546 --> 00:08:04,686 consists of all the elements of that row vector, then we have a one because it'll 84 00:08:04,686 --> 00:08:12,415 be a constant. And then, x, one transposed f is our 85 00:08:12,415 --> 00:08:18,800 function. So, we're essentially taking a linear combination of the features. 86 00:08:19,620 --> 00:08:27,332 That means, you know, f1 x1 + f2 x2, etc., for every feature, 87 00:08:27,332 --> 00:08:34,284 Plus some f0 which is just a constant, and that's our function f. 88 00:08:34,284 --> 00:08:44,124 And what we really want is that, the sum, the differences between the predicted f 89 00:08:44,124 --> 00:08:49,091 and the actual y's are very small. So, in some sense, if you look at, if you 90 00:08:49,091 --> 00:08:55,512 write this as a matrix, what that works out is that this matrix, X, capital X, 91 00:08:55,512 --> 00:09:00,992 where each row of the matrix is just that particular data point. 92 00:09:00,992 --> 00:09:07,769 That is they are in different columns in this matrix, each row has n features 93 00:09:07,769 --> 00:09:12,428 corresponding to the n features in the data point, 94 00:09:12,428 --> 00:09:19,510 Multiplied by f, which is one unknown vector that we're trying to find, gives 95 00:09:19,510 --> 00:09:23,890 me, as close as possible the, i-th data point. 96 00:09:23,890 --> 00:09:30,320 So, x1 transpose f is almost y1, X1 transpose f is almost y1, and so on. 97 00:09:30,320 --> 00:09:35,653 We have this extra one there just to make sure that we don't just take the 98 00:09:35,653 --> 00:09:41,554 combination of the features, but we also allow for a constant factor in this linear 99 00:09:41,554 --> 00:09:44,683 combination. Please look through this matrix 100 00:09:44,683 --> 00:09:50,514 formulation and convince yourself that what we're really trying to do is minimize 101 00:09:50,514 --> 00:09:54,390 the difference between extra, xi transposed f and y1.. 102 00:09:54,544 --> 00:10:00,724 And, if you write it down on the matrix, we're really trying to find x times f is 103 00:10:00,724 --> 00:10:05,359 almost equal to y. So, we are trying to find an f such that x 104 00:10:05,359 --> 00:10:10,801 times f is almost equal to y. Well, some of you might have seen such 105 00:10:10,801 --> 00:10:16,900 things, matrix equations before, and might wonder why we have this almost equal to. 106 00:10:16,900 --> 00:10:22,774 Well, the reason is that, the number of rows in this matrix is much larger than 107 00:10:22,774 --> 00:10:28,873 the number of columns because you have a few features, maybe a few dozen features. 108 00:10:28,873 --> 00:10:34,898 And, you have maybe hundreds of thousands of data points that you may have observed, 109 00:10:34,898 --> 00:10:40,997 And we can't solve this exactly because there are just too many rows and too few 110 00:10:40,997 --> 00:10:42,311 columns, So we have to solve it approximately. Of 111 00:10:42,311 --> 00:10:47,881 course, if this were m by, n by n, we could solve it exactly which is what you 112 00:10:47,881 --> 00:10:52,604 learned in high school is how we solve systems of linear equations. 113 00:10:52,604 --> 00:10:57,539 Instead, we'll have to solve it approximately using a technique called 114 00:10:57,539 --> 00:11:00,360 Least Squares. And here's how that works.