By now, you see the range
of different learning algorithms. Within supervised learning,
the performance of many supervised learning algorithms
will be pretty similar
and when that is less more often be
whether you use
learning algorithm A or learning algorithm
B but when that
is small there will often be
things like the amount of data
you are creating these algorithms on.
That's always your skill in
applying this algorithms.
Seems like
your choice of the features that you
designed to give the learning
algorithms and how you
choose the regularization parameter
and things like that. But there's
one more algorithm that is very
powerful and its very
widely used both within industry
and in Academia. And that's called the
support vector machine, and compared to
both the logistic regression and neural networks, the
support vector machine or the SVM
sometimes gives a cleaner
and sometimes more powerful way
of learning complex nonlinear functions.
And so I'd like to take the next
videos to
talk about that.
Later in this course, I will do
a quick survey of the range
of different supervised learning algorithms just
to very briefly describe them
but the support vector machine, given
its popularity and how popular
it is, this will be
the last of the supervised learning algorithms
that I'll spend a significant amount of time on in this course.
As with our development of ever
learning algorithms, we are going to start by talking
about the optimization objective,
so let's get started on
this algorithm.
In order to describe the support
vector machine, I'm actually going
to start with logistic regression
and show how we can modify
it a bit and get what
is essentially the support vector machine.
So, in logistic regression we have
our familiar form of
the hypotheses there and the
sigmoid activation function shown on the right.
And in order to explain
some of the math, I'm going
to use z to denote failure of transpose x here.
Now let's think about what
we will like the logistic regression to do.
If we have an example with
y equals 1, and by
this I mean an example
in either a training set
or the test set, you know, order cross valuation set where y is equal to 1 then
we are sort of hoping that h of x will be close to 1.
So, right, we are hoping to
correctly classify that example
and what, having h of x
close to 1, what that means
is that theta transpose x
must be much larger
than 0, so there's
greater than, greater than sign, that
means much, much greater
than 0 and that's
because it is z, that is theta transpose
x
is when z is much bigger than
0, is far to the
right of this figure that, you know, the
output of logistic regression becomes close to 1.
Conversely, if we have
an example where y is
equal to 0 then what
were hoping for is that the hypothesis
will output the value to
close to 0 and that corresponds to theta transpose x
or z pretty much
less than 0 because
that corresponds to
hypothesis of outputting a value close to 0. If
you look at the
cost function of logistic regression, what
you find is that each
example x, y,
contributes a term like
this to the overall cost function.
All right. So, for the overall cost function, we usually, we will
also have a sum over
all the training examples and 1 over m term.
But this
expression here. That's
the term that a single
training example contributes to
the overall objective function for logistic regression.
Now, if I take the definition
for the full of my hypothesis
and plug it in, over here,
the one I get is that
each training example contributes this term, right?
Ignoring the 1 over
m but it contributes that term
to be my overall cost function for
logistic regression. Now let's
consider the 2 cases
of when y is equal to 1
and when y is equal to 0.
In the first case, let's
suppose that y is equal
to 1. In that case,
only this first row in
the objective matters because this
1 minus y term will be equal
to 0 if y is equal to 1.
So, when y is equal to
1 when in an example, x,
y when y is equal to
1, what we get is this
term minus log 1
over 1 plus e to the negative
z. Where, similar to the last slide,
I'm using z to denote
data transpose x.  And
of course, in the cost we
actually had this minus y
but we just said that you know, if y is
equal to 1. So that's equal
to 1. I just simplified it
a way in the expression that
I have written down here.
And if we plot this function,
as a function of z, what
you find is that you get
this curve shown on the
lower left of this line
and thus we also see
that when z is equal
to large that is to when
theta transpose x is large
that corresponds to a
value of z that gives
us a very small value, a very
small contribution to the
cost function and this
kind of explains why when
logistic regression sees a positive example
with y equals 1 it tries
to set theta transpose x
to be very large because that
corresponds to this term
in a cost function being small. Now, to build
the Support Vector Machine, here is what we are going to do.
We are going to take this cost function, this
minus log 1 over 1 plus e to the negative z and modify it a little bit.
Let me take this point
1 over here and let
me draw the course function that I'm going to
use, the new cost function is gonna
be flat from here on out
and then I'm going to draw something
that grows as a straight
line similar to logistic
regression but this is going to be the
straight line in this posh inning.
So the curve that
I just drew in magenta. The curve that I just drew purple and magenta.
So, it's a pretty
close approximation to the
cost function used by logistic
regression except that it is
now made out of two line segments. This
is flat potion on the right
and then this is a straight
line portion on the
left. And don't worry too much about the slope of the straight line portion.
It doesn't matter that
much but that's the
new cost function we're going to use where y is equal to 1 and
you can imagine you
should do something pretty similar to logistic regression
but it turns out that this will give the
support vector machine computational advantage
that will give us later on
an easier optimization problem, that
will be easier for stock trades and so on.
We just talked about the case
of y equals to 1. The other
case is if y is equal
to 0. In that case,
if you look at the cost
then only this second term
will apply because the first
term goes a way
where if y is equal to 0 then nearly
it was 0 here. So
your left only with the second
term of the expression above
and so the cost of an
example or the contribution
of the cost function is going
to be given by this term
over here and if you
plug that as a function
z. So, I have here z on the
horizontal axis, you end up
with this group and for
the support vector machine, once
again we're going to replace
this blue line with something similar
and see if we can
replace it with a new cost, there
is flat out here. There's 0 out here and then
it grows as a straight
line like so.
So, let me give
these two functions names.
This function on the left, I'm
going to call
cost subscript 1 of z.
And this function on the right, I'm going to call
cost subscript 0
of z. And the subscript just refers
to the cost corresponding to
y is equal to 1 versus y is equal to 0.
Armed with these definitions, we are
now ready to build the support vector machine.
Here is the cost function
j of theta that we have for
logistic regression. In case
this the equation looks a
bit unfamiliar is because previously we
had a minor sign outside, but
here what I did was I
instead moved the minor signs
inside this expression. So it
just, you know, makes it look a
little more different. For the support
vector machine what we are
going to do is essentially take
this, and replace this with
cost 1 of z,
that is cost 1 of theta transpose x.
I'm going
to take this and replace it with cost
0 of z. This is cost 0 of
theta transpose x
where the cost 1 function is
what we had on the previous
line that looks like this and
the cost 0 function, again what
we have on the previous line that
looks like this.
So, what we have for the support
vector machine is an minimizationminimalization
problem of one of
1 over m, sum over
my training examples of y(i) times cost
1 of theta transpose
x(i) plus 1 minus
y(i) times cost zero of theta transpose x(i).
And then
plus my usual regularization
parameter like so. Now
by convention for the Support
Vector Machine, we actually write
things slightly differently. We parametrize
this just very slightly differently.
First, we're going
to get rid of the 1
over m terms and this just,
this just happens
to be a slightly different convention that people
use for support vector machines
compared to for logistic regression. But here's what
I mean, you know, what I'm going
to do is I am just gonna get
rid  of this 1 over m
terms and this should give
me the same optimal value for theta, right.
Because 1 over m is just a constant.
So, you know, whether I solve
this minimization problem with 1
over m in front or not,
I should end up with the same
optimal value of theta.
Here is what I mean, to
give you a concrete example,
suppose I had a minimization
problem that you know minimize over
a real number u of u minus 5 squared,
plus 1, right. Well, the
minimum of this happens, happens
to know the minimum of this is u equals 5.
Now if I want to take
this objective function and multiply
it by 10, so
here my minimization problem is
minimum of u of 10, u minus
5 squared plus 10.
Well the value of u
that minimizes this is still u equals 5, right.
So, multiplying something that
you are minimizing over by some
constant, 10 in this case,
it does not change the value
of u that gives us, that minimizes this function.
So the same way what I've
done by crossing out this
m is, all I
am doing is multiplying my objective
function by some constant m
and it doesn't change the value
of theta that achieves the minimum.
The second bit of notational change,
we're just designating the most
standard convention, when using as
the SVM, instead of logistic regression as a following.
So, for logistic regression, we had
two terms to our objective function.
The first is this term
which is the cost that comes
from the training set and the
second is this term, which
is the regularization term
and what we had, we had to
control the trade off between
these by saying, you know, that we
wanted to minimize A plus
and then my regularization parameter lambda,
and then times some other
term B, right? Where as I
am using A to denote
this first term, and I am
using B to denote that
second term, may be without the
lambda, and instead of
prioritizing this as A plus lambda B,
we could, and so what we
did was by setting different
values for this regularization parameter lambda.
We could trade off the relative
way between how much we
want to fit the training set well,
as minimizing A, versus how
much we care about keeping the values of the parameters small.
So that would be
for the parameters B. For the Support
Vector Machine, just by convention
we're going to use a different
parameter. So instead of using lambda here
to control the relative
waiting between you know, the first and second terms,
we are
still going to use a different
parameter which by convention
is called C and
we instead are going to minimize C times
A plus B. So
for logistic regression if we
send a very large value of
lambda, that means to give
B a very high weight. Here
is that if we set C
to be a very small value. That
corresponds to giving B
much larger weight than C than A.
So this is just a different
way of controlling the trade off
or just a different way of
parametrizing how much we care about optimizing the first term versus how much we care about optimizing the second term.
And if you want you can
think of this as the parameter
C playing a role
similar to 1 over
lambda and it's
not that it's two equations
or these two expressions will be
equal, it's equals 1 over
lambda and it's not that these two equations or these two expressions will be equal. It's equals t 1 over lambda. That's not the case where it bothers that if C is equal to 1 over lambda then
these
two optimization objectives should
give you the same value, same
optimal value of theta.
So
just filling that
in. I'm gonna cross out lambda here
and write in the constant C there.
So,that's gives
us our overall optimization objective
function for the Support Vector
Machine and where you
minimize that function then what
you have is the parameters
learned by SVM.
Finally on
light of logistic regression, the Support
Vector Machine doesn't output the
probability. Instead what we
have is, we have this cost
function which we minimize to
get the parameters theta and what
the Support Vector Machine does,
is it just makes the prediction
of y being equal 1
or 0 directly. So the hypothesis,
where I predict, 1, if
theta transpose x is
greater than or equal to
0 and I'll predict 0 otherwise.
And so, having learned the
parameters theta, this is
the form of the hypothesis for the support vector machine.
So, that was a
mathematical definition of what
a support vector machine does.
In the next few videos, let's
try to get back to
intuition about what this
optimization objective leads to and
whether the source of the hypothesis
a SVM will learn and also
talk about how to modify
this just a little bit to
learn complex, nonlinear functions.