In this video, I'd like to tell you a bit about the math behind large margin classification. This video is optional, so please feel free to skip it. It may also give you better intuition about how the optimization problem of the support vex machine, how that leads to large margin classifiers. In order to get started, let me first remind you of a couple of properties of what vector inner products look like. Let's say I have two vectors U and V, that look like this. So both two dimensional vectors. Then let's see what U transpose V looks like. And U transpose V is also called the inner products between the vectors U and V. Use a two dimensional vector, so I can on plot it on this figure. So let's say that's the vector U. And what I mean by that is if on the horizontal axis that value takes whatever value U1 is and on the vertical axis the height of that is whatever U2 is the second component of the vector U. Now, one quantity that will be nice to have is the norm of the vector U. So, these are, you know, double bars on the left and right that denotes the norm or length of U. So this just means; really the euclidean length of the vector U. And this is Pythagoras theorem is just equal to U1 squared plus U2 squared square root, right? And this is the length of the vector U. That's a real number. Just say you know, what is the length of this, what is the length of this vector down here. What is the length of this arrow that I just drew, is the normal view? Now let's go back and look at the vector V because we want to compute the inner product. So V will be some other vector with, you know, some value V1, V2. And so, the vector V will look like that, towards V like so. Now let's go back and look at how to compute the inner product between U and V. Here's how you can do it. Let me take the vector V and project it down onto the vector U. So I'm going to take a orthogonal projection or a 90 degree projection, and project it down onto U like so. And what I'm going to do measure length of this red line that I just drew here. So, I'm going to call the length of that red line P. So, P is the length or is the magnitude of the projection of the vector V onto the vector U. Let me just write that down. So, P is the length of the projection of the vector V onto the vector U. And it is possible to show that unit product U transpose V, that this is going to be equal to P times the norm or the length of the vector U. So, this is one way to compute the inner product. And if you actually do the geometry figure out what P is and figure out what the norm of U is. This should give you the same way, the same answer as the other way of computing unit product. Right. Which is if you take U transpose V then U transposes this U1 U2, its a one by two matrix, 1 times V. And so this should actually give you U1, V1 plus U2, V2. And so the theorem of linear algebra that these two formulas give you the same answer. And by the way, U transpose V is also equal to V transpose U. So if you were to do the same process in reverse, instead of projecting V onto U, you could project U onto V. Then, you know, do the same process, but with the rows of U and V reversed. And you would actually, you should actually get the same number whatever that number is. And just to clarify what's going on in this equation the norm of U is a real number and P is also a real number. And so U transpose V is the regular multiplication as two real numbers of the length of P times the normal view. Just one last detail, which is if you look at the norm of P, P is actually signed so to the right. And it can either be positive or negative. So let me say what I mean by that, if U is a vector that looks like this and V is a vector that looks like this. So if the angle between U and V is greater than ninety degrees. Then if I project V onto U, what I get is a projection it looks like this and so that length P. And in this case, I will still have that U transpose V is equal to P times the norm of U. Except in this example P will be negative. So, you know, in inner products if the angle between U and V is less than ninety degrees, then P is the positive length for that red line whereas if the angle of this angle of here is greater than 90 degrees then P here will be negative of the length of the super line of that little line segment right over there. So the inner product between two vectors can also be negative if the angle between them is greater than 90 degrees. So that's how vector inner products work. We're going to use these properties of vector inner product to try to understand the support vector machine optimization objective over there. Here is the optimization objective for the support vector machine that we worked out earlier. Just for the purpose of this slide I am going to make one simplification or once just to make the objective easy to analyze and what I'm going to do is ignore the indeceptrums. So, we'll just ignore theta 0 and set that to be equal to 0. To make things easier to plot, I'm also going to set N the number of features to be equal to 2. So, we have only 2 features, X1 and X2. Now, let's look at the objective function. The optimization objective of the SVM. What we have only two features. When N is equal to 2. This can be written, one half of theta one squared plus theta two squared. Because we only have two parameters, theta one and thetaa two. What I'm going to do is rewrite this a bit. I'm going to write this as one half of theta one squared plus theta two squared and the square root squared. And the reason I can do that, is because for any number, you know, W, right, the square roots of W and then squared, that's just equal to W. So square roots and squared should give you the same thing. What you may notice is that this term inside is that's equal to the norm or the length of the vector theta and what I mean by that is that if we write out the vector theta like this, as you know theta one, theta two. Then this term that I've just underlined in red, that's exactly the length, or the norm, of the vector theta. We are calling the definition of the norm of the vector that we have on the previous line. And in fact this is actually equal to the length of the vector theta, whether you write it as theta zero, theta 1, theta 2. That is, if theta zero is equal to zero, as I assume here. Or just the length of theta 1, theta 2; but for this line I am going to ignore theta 0. So let me just, you know, treat theta as this, let me just write theta, the normal theta as this theta 1, theta 2 only, but the math works out either way, whether we include theta zero here or not. So it's not going to matter for the rest of our derivation. And so finally this means that my optimization objective is equal to one half of the norm of theta squared. So all the support vector machine is doing in the optimization objective is it's minimizing the squared norm of the square length of the parameter vector theta. Now what I'd like to do is look at these terms, theta transpose X and understand better what they're doing. So given the parameter vector theta and given and example x, what is this is equal to? And on the previous slide, we figured out what U transpose V looks like, with different vectors U and V. And so we're going to take those definitions, you know, with theta and X(i) playing the roles of U and V. And let's see what that picture looks like. So, let's say I plot. Let's say I look at just a single training example. Let's say I have a positive example the drawing was across there and let's say that is my example X(i), what that really means is plotted on the horizontal axis some value X(i) 1 and on the vertical axis X(i) 2. That's how I plot my training examples. And although we haven't been really thinking of this as a vector, what this really is, this is a vector from the origin from 0, 0 out to the location of this training example. And now let's say we have a parameter vector and I'm going to plot that as vector, as well. What I mean by that is if I plot theta 1 here and theta 2 there so what is the inner product theta transpose X(i). While using our earlier method, the way we compute that is we take my example and project it onto my parameter vector theta. And then I'm going to look at the length of this segment that I'm coloring in, in red. And I'm going to call that P superscript I to denote that this is a projection of the i-th training example onto the parameter vector theta. And so what we have is that theta transpose X(i) is equal to following what we have on the previous slide, this is going to be equal to P times the length of the norm of the vector theta. And this is of course also equal to theta 1 x1 plus theta 2 x2. So each of these is, you know, an equally valid way of computing the inner product between theta and X(i). Okay. So where does this leave us? What this means is that, this constrains that theta transpose X(i) be greater than or equal to one or less than minus one. What this means is that it can replace the use of constraints that P(i) times X be greater than or equal to one. Because theta transpose X(i) is equal to P(i) times the norm of theta. So writing that into our optimization objective. This is what we get where I have, instead of theta transpose X(i), I now have this P(i) times the norm of theta. And just to remind you we worked out earlier too that this optimization objective can be written as one half times the norm of theta squared. So, now let's consider the training example that we have at the bottom and for now, continuing to use the simplification that theta 0 is equal to 0. Let's see what decision boundary the support vector machine will choose. Here's one option, let's say the support vector machine were to choose this decision boundary. This is not a very good choice because it has very small margins. This decision boundary comes very close to the training examples. Let's see why the support vector machine will not do this. For this choice of parameters it's possible to show that the parameter vector theta is actually at 90 degrees to the decision boundary. And so, that green decision boundary corresponds to a parameter vector theta that points in that direction. And by the way, the simplification that theta 0 equals 0 that just means that the decision boundary must pass through the origin, (0,0) over there. So now, let's look at what this implies for the optimization objective. Let's say that this example here. Let's say that's my first example, you know, X1. If we look at the projection of this example onto my parameters theta. That's the projection. And so that little red line segment. That is equal to P1. And that is going to be pretty small, right. And similarly, if this example here, if this happens to be X2, that's my second example. Then, if I look at the projection of this this example onto theta. You know. Then, let me draw this one in magenta. This little magenta line segment, that's going to be P2. That's the projection of the second example onto my, onto the direction of my parameter vector theta which goes like this. And so, this little projection line segment is getting pretty small. P2 will actually be a negative number, right so P2 is in the opposite direction. This vector has greater than 90 degree angle with my parameter vector theta, it's going to be less than 0. And so what we're finding is that these terms P(i) are going to be pretty small numbers. So if we look at the optimization objective and see, well, for positive examples we need P(i) times the norm of theta to be bigger than either one. But if P(i) over here, if P1 over here is pretty small, that means that we need the norm of theta to be pretty large, right? If P1 of theta is small and we want P1 you know times in all of theta to be bigger than either one, well the only way for that to be true for the profit that these two numbers to be large if P1 is small, as we said we want the norm of theta to be large. And similarly for our negative example, we need P2 times the norm of theta to be less than or equal to minus one. And we saw in this example already that P2 is going pretty small negative number, and so the only way for that to happen as well is for the norm of theta to be large, but what we are doing in the optimization objective is we are trying to find a setting of parameters where the norm of theta is small, and so you know, so this doesn't seem like such a good direction for the parameter vector and theta. In contrast, just look at a different decision boundary. Here, let's say, this SVM chooses that decision boundary. Now the is going to be very different. If that is the decision boundary, here is the corresponding direction for theta. So, with the direction boundary you know, that vertical line that corresponds to it is possible to show using linear algebra that the way to get that green decision boundary is have the vector of theta be at 90 degrees to it, and now if you look at the projection of your data onto the vector x, lets say its before this example is my example of x1. So when I project this on to x, or onto theta, what I find is that this is P1. That length there is P1. The other example, that example is and I do the same projection and what I find is that this length here is a P2 really that is going to be less than 0. And you notice that now P1 and P2, these lengths of the projections are going to be much bigger, and so if we still need to enforce these constraints that P1 of the norm of theta is phase number one because P1 is so much bigger now. The normal can be smaller. And so, what this means is that by choosing the decision boundary shown on the right instead of on the left, the SVM can make the norm of the parameters theta much smaller. So, if we can make the norm of theta smaller and therefore make the squared norm of theta smaller, which is why the SVM would choose this hypothesis on the right instead. And this is how the SVM gives rise to this large margin certification effect. Mainly, if you look at this green line, if you look at this green hypothesis we want the projections of my positive and negative examples onto theta to be large, and the only way for that to hold true this is if surrounding the green line. There's this large margin, there's this large gap that separates positive and negative examples is really the magnitude of this gap. The magnitude of this margin is exactly the values of P1, P2, P3 and so on. And so by making the margin large, by these tyros P1, P2, P3 and so on that's the SVM can end up with a smaller value for the norm of theta which is what it is trying to do in the objective. And this is why this machine ends up with enlarge margin classifiers because itss trying to maximize the norm of these P1 which is the distance from the training examples to the decision boundary. Finally, we did this whole derivation using this simplification that the parameter theta 0 must be equal to 0. The effect of that as I mentioned briefly, is that if theta 0 is equal to 0 what that means is that we are entertaining decision boundaries that pass through the origins of decision boundaries pass through the origin like that, if you allow theta zero to be non 0 then what that means is that you entertain the decision boundaries that did not cross through the origin, like that one I just drew. And I'm not going to do the full derivation that. It turns out that this same large margin proof works in pretty much in exactly the same way. And there's a generalization of this argument that we just went through them long ago through that shows that even when theta 0 is non 0, what the SVM is trying to do when you have this optimization objective. Which again corresponds to the case of when C is very large. But it is possible to show that, you know, when theta is not equal to 0 this support vector machine is still finding is really trying to find the large margin separator that between the positive and negative examples. So that explains how this support vector machine is a large margin classifier. In the next video we will start to talk about how to take some of these SVM ideas and start to apply them to build a complex nonlinear classifiers.