In this video, I'd like to
tell you a bit about the
math behind large margin classification.
This video is optional, so please feel free to skip it.
It may also give you better
intuition about how the
optimization problem of the
support vex machine, how that
leads to large margin classifiers.
In order to get started, let
me first remind you of a
couple of properties of what vector inner products look like.
Let's say I have two vectors
U and V, that look like this.
So both two dimensional vectors.
Then let's see what U
transpose V looks like.
And U transpose V is
also called the inner products
between the vectors U and V.
Use a two dimensional vector, so
I can on plot it on this figure.
So let's say
that's the vector U. And
what I mean by that is
if on the horizontal axis that
value takes whatever value
U1 is and on the
vertical axis the height
of that is whatever U2
is the second component
of the vector U. Now, one
quantity that will be nice
to have is the norm
of the vector U. So, these
are, you know, double bars on
the left and right that denotes
the norm or length of
U. So this just means; really the
euclidean length of the
vector U. And this
is Pythagoras theorem is just
equal to U1
squared plus U2
squared square root, right?
And this is the length of the vector U. That's a real number.
Just say you know, what is the length
of this, what is the
length of this vector down here.
What is the length of this
arrow that I just drew, is the normal view?
Now let's go back and
look at the vector V because we want to compute the inner product.
So V will be some other
vector with, you know,
some value V1, V2.
And so, the vector
V will look like that, towards V like so.
Now let's go back
and look at how to compute
the inner product between U
and V. Here's how you can do it.
Let me take the vector V and
project it down onto the
vector U. So I'm going
to take a orthogonal projection or
a 90 degree projection, and project
it down onto U like so.
And what I'm going to do
measure length of this
red line that I just drew here.
So, I'm going to call the length of
that red line P. So, P
is the length or is
the magnitude of the projection
of the vector V onto the
vector U. Let me just write that down.
So, P is the length
of the projection of the
vector V onto the
vector U. And it is
possible to show that unit
product U transpose V, that
this is going to be equal
to P  times the
norm or the length of
the vector U. So, this
is one way to compute the inner product.
And if you actually do
the geometry figure out what
P is and figure out what the norm of U is.
This should give you the same
way, the same answer as
the other way of computing unit product.
Right.
Which is if you take U
transpose V then U transposes
this U1 U2, its a
one by two matrix, 1
times V. And so
this should actually give you
U1, V1 plus U2, V2.
And so the theorem of
linear algebra that these two
formulas give you the same answer.
And by the way, U transpose
V is also equal to
V transpose U. So if
you were to do the same process
in reverse, instead of projecting
V onto U, you could project
U onto V. Then, you know, do
the same process, but with the rows of U and V reversed.
And you would actually, you should
actually get the same number whatever that number is.
And just to clarify what's
going on in this equation the
norm of U is a real
number and P is also a real number.
And so U transpose V is
the regular multiplication as two real numbers of
the length of P times the normal view.
Just one last detail, which is
if you look at the norm of
P, P is actually signed so to the right.
And it can either be positive or negative.
So let me say what I mean
by that, if U
is a vector that looks like
this and V is a vector that looks like this.
So if the angle between U
and V is greater than ninety degrees.
Then if I project V onto
U, what I get
is a projection it looks like
this and so that length
P. And in this
case, I will still have
that U transpose V is
equal to P times the
norm of U. Except in
this example P will be negative.
So, you know, in inner products if the angle
between U and V is less
than ninety degrees, then P
is the positive length for that red line
whereas if the angle of this
angle of here is greater
than 90 degrees then P
here will be negative of
the length of the super line of that little line segment right over there.
So the inner product between two
vectors can also be negative
if the angle between them is greater than 90 degrees.
So that's how vector inner
products work. We're going to
use these properties of vector
inner product to try
to understand the support
vector machine optimization objective over there. Here
is the optimization objective for the
support vector machine that we worked out earlier. Just for
the purpose of this slide I
am going to make one simplification or
once just to make the objective easy
to analyze and what I'm going to do is
ignore the indeceptrums. So, we'll just ignore theta 0 and set that to be equal to 0. To
make things easier to plot, I'm also going to set N the number of features to be equal to 2. So, we have only 2 features,
X1 and X2.
Now, let's look at the objective function.
The optimization objective of the
SVM. What we have only two features.
When N is equal to 2.
This can be written,
one half of
theta one squared plus theta two squared.
Because we only have two parameters, theta one and thetaa two.
What I'm going to do is rewrite this a bit.
I'm going to write this as one
half of theta one
squared plus theta two squared and
the square root squared.
And the reason I can do that,
is because for any number, you know, W, right, the
square roots of W and
then squared, that's just equal
to W. So square roots
and squared should give you the same thing.
What you may notice is that
this term inside is that's
equal to the norm
or the length of the
vector theta and what
I mean by that is that
if we write out the
vector theta like this, as
you know theta one, theta two.
Then this term that I've just
underlined in red, that's exactly
the length, or the norm, of the vector theta.
We are calling the definition
of the norm of the vector that we have on the previous line.
And in fact this is actually
equal to the length of the
vector theta, whether you write
it as theta zero, theta 1, theta 2.
That is, if theta zero is equal to zero, as I assume here.
Or just the length of theta
1, theta 2; but for
this line I am going to ignore theta 0.
So let me just, you know, treat theta
as this, let me just
write theta, the normal
theta as this theta 1,
theta 2 only, but the
math works out either way,
whether we include theta zero here or not.
So it's not going to matter for the rest of our derivation.
And so finally this means
that my optimization objective is equal
to one half of the
norm of theta squared.
So all the support vector machine
is doing in the optimization
objective is it's minimizing the
squared norm of the square
length of the parameter vector theta.
Now what I'd like to do
is look at these terms, theta
transpose X and understand better what they're doing.
So given the parameter vector theta and given
and example x, what is this is equal to?
And on the previous slide, we
figured out what U transpose
V looks like, with different
vectors U and V. And so we're
going to take those definitions, you know, with theta
and X(i) playing the
roles of U and V.
And let's see what that picture looks like.
So, let's say I plot. Let's say I look at
just a single training example. Let's say I
have a positive example the drawing
was across there and let's say that is
my example X(i), what
that really means is
plotted on the horizontal axis
some value X(i) 1
and on the vertical axis
X(i) 2.
That's how I plot my training examples.
And although we haven't been really
thinking of this as a vector, what
this really is, this is a
vector from the origin
from 0, 0 out to
the location of this training example.
And now let's say we have
a parameter vector and
I'm going to plot
that as vector, as well.
What I mean by that is if I plot theta 1
here and theta 2 there
so what is the inner
product theta transpose X(i).
While using our earlier method,
the way we compute that is we
take my example and
project it onto my parameter vector theta.
And then I'm going to look
at the length of this segment
that I'm coloring in, in red.
And I'm going to
call that P superscript I
to denote that this is a
projection of the i-th training example
onto the parameter vector theta.
And so what we have is
that theta transpose X(i) is
equal to following what
we have on the previous slide, this
is going to be equal to
P times the
length of the norm of the vector theta.
And this is of course also equal to
theta 1 x1
plus theta 2 x2. So each
of these is, you know, an equally
valid way of computing the
inner product between theta and X(i).
Okay.
So where does this leave us?
What this means is that, this
constrains that theta transpose X(i)
be greater than or equal to one or less than minus one.
What this means is that it
can replace the use of constraints
that P(i) times X
be greater than or equal to one.
Because theta transpose X(i) is
equal to P(i) times the norm of theta.
So writing that into our optimization objective.
This is what we get
where I have, instead of
theta transpose X(i), I now
have this P(i) times the norm of theta.
And just to remind you we
worked out earlier too that
this optimization objective can be
written as one half times
the norm of theta squared.
So, now let's consider
the training example that we
have at the bottom and
for now, continuing to use the simplification that
theta 0 is equal to 0.
Let's see what decision boundary the support vector machine will choose.
Here's one option, let's say
the support vector machine were to
choose this decision boundary.
This is not a very good choice because it has very small margins.
This decision boundary comes very close to the training examples.
Let's see why the support vector machine will not do this.
For this choice of parameters
it's possible to show that the
parameter vector theta is actually
at 90 degrees to the decision boundary.
And so, that green decision boundary
corresponds to a parameter vector
theta that points in that direction.
And by the way, the simplification that
theta 0 equals 0 that
just means that the decision boundary
must pass through the origin, (0,0) over there.
So now, let's
look at what this implies
for the optimization objective.
Let's say that this example here.
Let's say that's my first example, you know,
X1.
If we look at the projection
of this example onto my parameters theta.
That's the projection.
And so that little red line segment.
That is equal to P1.
And that is going to be pretty small, right.
And similarly, if this
example here, if this happens
to be X2, that's my second example.
Then, if I look at the projection of this this example onto theta.
You know.
Then, let me draw this one in magenta.
This little magenta line segment, that's
going to be P2. That's
the projection of the second example
onto my, onto the direction
of my parameter vector theta which goes like this.
And so, this little
projection line segment is getting pretty small.
P2 will actually be a negative number, right so P2 is
in the opposite direction.
This vector has greater
than 90 degree angle with my
parameter vector theta, it's going to be less than 0.
And so what we're finding is that
these terms P(i) are
going to be pretty small numbers.
So if we look at
the optimization objective and see, well, for positive examples
we need P(i) times
the norm of theta to be bigger than either one.
But if P(i) over
here, if P1 over here
is pretty small, that means
that we need the norm of
theta to be pretty large, right? If
P1 of theta is small
and we want P1 you
know times in all of theta
to be bigger than either one, well
the only way for that
to be true for the profit that
these two numbers to be large
if P1 is small, as we
said we want the norm of theta to be large.
And similarly for our
negative example, we need P2
times the norm of
theta to be
less than or equal to minus one.
And we saw in this
example already that P2
is going pretty small negative number,
and so the only way for
that to happen as well is
for the norm of theta to be
large, but what
we are doing in the optimization
objective is we are
trying to find a setting
of parameters where the norm
of theta is small, and so
you know, so this doesn't
seem like such a good direction
for the parameter vector and theta.
In contrast, just look at a different decision boundary.
Here, let's say, this SVM chooses
that decision boundary.
Now the is going to be very different.
If that is the decision boundary,
here is the
corresponding direction for theta.
So, with the direction
boundary you know, that
vertical line that corresponds
to it is possible to
show using linear algebra that
the way to get that green decision
boundary is have the vector of theta be
at 90 degrees to it,
and now if you look
at the projection of your
data onto the vector
x, lets say its before
this example is my example of x1. So when
I project this on to x,
or onto theta, what I find is that this is P1.
That length there is P1.
The other example, that
example is and I
do the same projection and
what I find is that this
length here is a
P2 really that is going to be less than 0.
And you notice that now
P1 and P2, these lengths
of the projections are going to
be much bigger, and so
if we still need to enforce
these constraints that P1 of
the norm of theta is phase
number one because P1 is so much bigger now.
The normal can be smaller.
And so, what this means is
that by choosing the decision
boundary shown on the right
instead of on the left, the
SVM can make the
norm of the parameters theta
much smaller. So, if we can
make the norm of theta smaller and
therefore make the squared norm of
theta smaller, which is
why the SVM
would choose this hypothesis on the right instead.
And this is how
the SVM gives rise
to this large margin certification effect.
Mainly, if you look
at this green line, if you look at this green
hypothesis we want the
projections of my positive
and negative examples onto theta to be large, and
the only way for that to
hold true this is if surrounding the green line.
There's this large margin, there's
this large gap that separates
positive and negative examples is
really the magnitude of this gap.
The magnitude of this margin
is exactly the values of
P1, P2, P3 and so on.
And so by making the margin
large, by these tyros P1,
P2, P3 and so on that's
the SVM can end up with
a smaller value for the
norm of theta which is what it is trying to do in the objective.
And this is why this
machine ends up with enlarge
margin classifiers because itss
trying to maximize the norm
of these P1 which is the distance from
the training examples to the decision boundary.
Finally, we did this whole derivation
using this simplification that the
parameter theta 0 must be equal to 0.
The effect of that as
I mentioned briefly, is that if
theta 0 is equal to
0 what that means
is that we are entertaining decision
boundaries that pass through the
origins of decision boundaries pass through
the origin like that, if you
allow theta zero to
be non 0 then what
that means is that you entertain the
decision boundaries that did not
cross through the origin, like that one I just drew.
And I'm not going to do
the full derivation that. It
turns out that this same
large margin proof works in
pretty much in exactly the same way.
And there's a generalization of this
argument that we just went through
them long ago through that shows
that even when theta
0 is non 0, what
the SVM is trying to do when you have this optimization objective.
Which again corresponds to the
case of when C is very large.
But it is possible to
show that, you know, when theta
is not equal to 0 this
support vector machine is still
finding is really trying
to find the large margin
separator that between the positive and negative
examples. So that
explains how this support vector machine is a large margin classifier.
In the next video we
will start to talk about how
to take some of these
SVM ideas and start to
apply them to build a complex
nonlinear classifiers.