In the last video, we talked about the recommender system problem, where for example, you may have a set of movies and you may have a set of users, each of whom has rated some subset of the movies, rated the movies 1 to 5 stars or 0 to 5 stars, and what I would like to do, is look at these users and predict how they would have rated other movies that they have not yet rated. In this video, I would like to talk about our first approach to building a recommender system, this approach is called content based recommendations. Here's our data set from before, and just to remind you of a bit of notation, I was using Nu to denote the number of users, and so that's equal to 4, and Nm to denote the number of movies, I have five movies. So, how do I predict what these missing values would be? Let's suppose that for each of these movies, I have a set of features for them. In particular, lets say that for each of the movies I have two features, which I'm going to denote X1 and X2, where X1 measures the degree to which a movie is a romantic movie and X2 measures the degree to which a movie is an action movie. So if you take a movie Love at last, you know, 0.9 rating on the romance scale, it is a highly romantic movie but zero on the action scale, so almost no action in that movie. Romance forever was 1.0, lot of romance and 0.01 action, I don't know maybe there's a minor car crash in that movie or something, so little bit of action. Skipping one let's do Swords vs,. karate, maybe that has a zero romance rating and no romance at all in that but plenty of action and you know, non-stop car chases. Maybe again there is tiny bit of romance in that movie, but mainly action, and and Cute puppies of love again but mainly a romance movie with no action at all. So if we have features like these then each movie can be represented with a feature vector. Let's take movie 1, so just call these movies you know, movies 1 2, 3, 4 and 5. For my first movie, Love at last, I have my two features, 0.9 and 0, and so these are features X1 and X2, and let's add an extra feature as usual, which is my interceptor feature X0, which is equal to 1 and so, putting these together, I would then have a feature X1, the superscript 1 denotes it's the feature vector for my first movie, and this feature vector is equal to one. The first one there is this interceptor, and then my two features 0.9, 0, like so. So, for Love at last, I would have a feature vector X1, for the movie Romance Forever, we have the separate feature vector X2 and so on, and for Swords vs. karate I would have a different feature vector x superscript 5. Also, consistent with our early notation that we were using, we're going to set N to be the number of features, not counting this X zero intercept term so n is equal to two because we have two features x1 and x2 capturing the degree of romance and the degree of action in each movie. Now in order to make predictions, here is one thing we could do, which is that we could treat predicting the ratings of each user as a separate linear regression problem. So specifically lets say that for each user j we are going to learn a parameter vector theta J which would be in R3 in this case, more generally theta j would be in r n+1, where n is the number of features, not counting the intercept term, and we're going to predict user J as rating movie I, with just the inner product between the parameters vector theta and the features "XI". So, let's take a specific example. Let's take user one. So that would be Alice and associated with Alice would be some parameter vector, theta 1 and our second user Bob will be associated, with a different parameter vector theta 2. Carol will be associated with a different parameter vector theta 3 and Dave a different parameter vector, theta 4. So lets say we want to make a prediction for what Alice will think of the movie, Cute puppies of love. Well that movie is going to have some parameter vector X3, where we have that X3 is going to be equal to 1 which is my intercept term, and then 0.99, and then 0. And let's say for this example, let's say that you know we have somehow already gotten a parameter vector theta 1 for Alice--we we will say later exactly how we come up with this parameter vector--but let's just say for now that you know some unspecified learning algorithm has learned the parameter vector theta 1 and it is equal to 0 5 0. And so our prediction for this entry is going to be equal to theta 1, that is Alice's parameter vector, transpose X3, that is the feature vector for the Cute Puppies of Love movie number 3. And so the inner product between these two vectors is going to be 5 x 0.99. Which is equal to 4.95. And so my prediction for value this over here is going to be 4.95. And maybe that seems like a reasonable value, if indeed this is my parameter vector theta 1. So all we doing here is we are applying a different copy of essentially linear regression for each user and we are saying that what Alice does, is Alice has seem some parameter vector theta 1 that she uses, that we use to predict her ratings as a function of how romantic and how action packed the movie is and Bob, and Carol, and Dave each of them have a different linear function of the romantic-ness and action-ness or the degree of romance and the degree of action in a movie, and that that is how we're going to predict their star ratings. More formally here is how we can write down the problem. Our notation is that RIJ is equal to one, if user J has rated movie I, and YIJ is the rating of that movie if that rating exists. That is if that user has actually rated that movie. And on the previous slide we also defined theta J which is a parameter for each user XI which is a feature vector for specific movie and for each user and each movie you would predict that rating, as follows. So let me introduce, just temporarily, introduce one extra bit of notation mj, we are gonna use mj to denote the number of users rated by movie j, we're gonna need this notation only for this slide. Now, in order to learn the parameter vector for theta j, well, how can we do so? This is basically a linear regression problem. So what we can do, is just choose a parameter vector, theta j, so the predicted value here are as close as possible to the values that we observed in our training set, the values we observed in our data. So, let's write that down. In order to learn the parameter vector theta j, let's minimize over my parameter vector theta j, of sum-- and I want to sum over all movies that user j has rated--so we write this as sum over all values of i that is a colon rij equals 1. So the way to read this summation index is this is summation over all the values of i, so that r i j is equal to 1. So this is going to be summing over all the movies that user j has rated. And then I am going to compute theta j transpose xi so that's the prediction of user j's rating on movie i, minus y i j, so that's the actual observed rating squared, and then, let me just divide by the number of movies that user J, has actually rated, so just divide by 1 over 2MJ. And so this is just like the least squares regression, it's just like linear regression where we want to choose the parameter vector theta J, to minimize this type of squared error term. And if you want to, you can also add in a regularization term so plus lambda over 2m, and this is really 2MJ because, this as if we have MJ examples right? Because if user J has rated that many movies, it's sort of like we have that many data points with which to fit the parameters theta J. And then let me add in my usual regularization term here of theta J K squared. As usual this sum is from K equals 1 through N so here theta J is going to be an N plus 1 dimensional vector where, in our early example, n was equal to two, but more generally, n is the number of features we have per movie. And so as usual we don't regularize over theta 0. We don't regularize over the biased term because the sum is from K1 through N. If you minimize this as a function of theta J you get a good solution, you get a pretty good estimate of a parameter vector theta j with which to make the predictions for user J's movie ratings. For recommender systems, we're going to change this notation a little bit. So to simplify the subsequent math, I'm actually going to get rid of this term MJ. So that's just a constant right so I can delete it without changing the value of theta J that I get out of this optimization, so if you imagine taking this whole equation, taking this whole expression and multiplying it by MJ and get rid of that constant, and when I minimize this I should still get the same value of theta J as before. So, just to repeat what we wrote on the previous slide, here is our optimization objective: In order to learn theta J, which is a parameter for user J, we're going to minimize over theta j this optimization objectives. So this is our usual squared error term and then this is our regularization term. Now of course in building a recommender system we don't just want to learn parameters for a single user, we want to learn parameters for all of our users, I have n subscript u users, so I want to learn all of these parameters and so what I'm going to do is take this minimization, take this optimization objective and just add an extra summation there. So, you know, this expression here with the one half on top again, so it's exactly the same as what we have on top except that now, instead of just doing this for a specific user theta J, I'm going to sum my objective over all of my users and then minimize this overall optimization objective. Minimize this overall cost function. And when I minimize this as a function of theta 1, theta 2, up to theta NU, I will get a separate parameter vector each user and I can then use that to make predictions for all of my users for all of my N subscript u users. So putting everything together, this was our optimization objective on top and to give this thing a name, I'll just call this J of theta 1, dot, dot, dot theta NU. So J as usual is my optimization objective which I'm trying to minimize. Next, in order to actually do the minimization, if you were to derive the gradient descent updates, these are the equations you would get, so you would take theta JK and subtract from it alpha, which is the learning rate, times these terms here on the right. So we have slightly different cases so when K equals 0 and when K is not equal to 0, because our regularization term here regularizes only the values of theta JK for K not equal to zero. So we don't regularize theta 0 so the slightly different updates for k equals zero, and k not equal to 0. And this term, over here, for example is just a partial derivative with respect to your parameter, that of your optimization objective, right? And so, this is just gradient descent and I've already computed the derivatives and plugged them into here. If these gradient descent updates look a lot like what we had for linear regression, that's because these are essentially the same as linear regression. The only minor difference is that for linear regression we have these 1 over M terms - it's really 1 over MJ - but because earlier when we were deriving the optimization objective we got rid of this, that's why we don't have this 1 over M term. But otherwise it's really sum over my training examples of, you know, the error times XK plus that regularization term contributes to the derivative. So if you are using gradient descent, here is how you can minimize the cost function j, to learn all the parameters, and using these formulas for the derivatives, if you want, you can also plug them into a more advanced optimization algorithm like cluster gradient or LBFGS or what have you, and use that to try to minimize the cost function J as well. So hopefully you now know how you can apply essentially a variation on linear regression in order to predict different movie ratings by different users. This particular algorithm is called a content based recommendations, or content based approach because we assume that we have available to us, features for the different movies. So we have features that capture what is the content of these movies. How romantic is this movie? How much action is in this movie? And we are really using features of the content of the movies to make our predictions. But for many movies we don't actually have such features, or it may be very difficult to get such features for all of our movies, for all of whatever items we are trying to sell. So in the next video, we'll start to talk about an approach to recommender systems that isn't content based and does not assume that we have someone else giving us all of these features, for all of the movies in our data set.