In the last video, I talked
about how when faced with
a machine learning problem, there are
often lots of different ideas on how to improve the algorithm.
In this video let's
talk about the concepts of error
analysis which will help
me give you a way to more
systematically make some of these decisions.
If you're starting work on a
machine learning product or building
a machine learning application, it is
often considered very good practice
to start, not by building
a very complicated system with
lots of complex features and so
on, but to instead start
by building a very simple
algorithm, the you can implement quickly.
And when I start on
a learning problem, what I usually
do is spend at most one
day, literally at most 24
hours to try to get something really quick and dirty.
Frankly not at all sophisticated system.
But get something really quick and
dirty running and implement
it and then test it on
my cross validation data. Once
you've done that, you can
then plot learning curves.
This is what we talked about in the previous set of videos.
But plot learning curves of the
training and test errors to
try to figure out if your
learning algorithm may be suffering
from high bias or high
variance or something else and
use that to try to
decide if having more data
and more features and so on are likely to help.
And the reason that this
is a good approach is often
when you're just starting out on
a learning problem, there's really no
way to tell in advance
whether you need more complex
features or whether you need
more data or something else.
And it's just very hard to tell
in advance, that is in
the absence of evidence, in
the absence of seeing a
learning curve, it's just incredibly
difficult to figure out where you should be spending your time.
And it's often by implementing even
a very, very quick and dirty
implementation and by plotting
learning curves that that helps you make these decisions.
So if you like, you can think
of this as a way of
avoiding what's sometimes called
premature optimization in computer programming.
And this is idea that just
says that we should let
evidence guide our decisions
on where to spend our time
rather than use gut feeling,
which is often wrong.
In addition to plotting learning
curves, one other thing
that's often very useful to do is what's called error analysis.
And what I mean by that is
that when building, say
a spam classifier, I will
often look at my
cross validation set and manually
look at the emails that my
algorithm is making errors on.
So, look at the spam emails
and non-spam emails that the
algorithm is misclassifying, and see
if you can spot any systematic
patterns in what type of examples it is misclassifying.
And often by doing that, this
is the process that would inspire
you to design new features.
Or they'll tell you whether the
current things or current
shortcomings of the system
and give you the inspiration you
need to come up with improvements to it.
Concretely, here's a specific example.
Let's say you've built a spam
classifier and you
have 500 examples in
your cross-validation set.
And let's say in this example, that the
algorithm has a very high error
rate, and it misclassifies a
hundred of these cross-validation examples.
So what I do is manually
examine these 100 errors, and
manually categorize them, based
on things like what type
of email it is and
what cues or what features you
think might have helped the algorithm classify them incorrectly.
So, specifically, by what
type of email it is,
you know, if I look through these
hundred errors I may find
that maybe the most
common types of spam
emails in misclassifies are maybe
emails on pharmacy, so basically
these are emails trying to
sell drugs, maybe emails that are
trying to sell replicas -
those are those fake watches fake you know, random things.
Maybe have some emails trying to steal passwords.
These are also called phishing emails.
But that's another big category of emails and maybe other categories.
So, in terms
of classify what type of email
it is, I would actually go through
and count up, you know, of
my 100 emails, maybe I
find that twelve of the
mislabeled emails are pharma emails.
And maybe four of them
are emails trying to sell
replicas, they sell fake watches or something.
And maybe I find that 53
of them are these,
what's called phishing emails, basically emails
trying to persuade you to
give them your password, and 31 emails are other types of emails.
And it's by counting up the
number of emails in these
different categories that you might
discover, for example, that the
algorithm is doing really particularly
poorly on emails trying to
steal passwords, and that
may suggest that it might
be worth your effort to look
more carefully at that type
of email, and see if
you can come up with better features
to categorize them correctly.
And also, what I might
do is look at what cues,
or what features, additional features
might have helped the algorithm classify the emails.
So let's say that some of
our hypotheses about things or
features that might help us
classify emails better are trying
to detect deliberate misspellings versus
unusual email routing versus unusual, you know,
spamming punctuation, such as
people use a lot of exclamation marks.
And once again, I would manually
go through and let's say
I find five cases of
this, and 16 of
this, and 32 of this and
a bunch of other types of emails as well.
And if this is what
you get on your cross validation
set then it really tells
you that, you know, maybe deliberate spelling
is a sufficiently rare phenomenon
that maybe is not really worth
all your time trying to write
algorithms to detect that.
But if you find a lot
of spammers are using, you
know, unusual punctuation then
maybe that's a strong sign
that it might actually be
worth your while to spend
the time to develop more sophisticated
features based on the punctuation.
So, this sort of error
analysis which is really
the process of manually examining
the mistakes that the algorithm
makes, can often help
guide you to the most fruitful avenues to pursue.
And this also explains why I
often recommend implementing a quick
and dirty implementation of an algorithm.
What we really want to do
is figure out what are
the most difficult examples for an algorithm to classify.
And very often for different
algorithms, for different learning algorithms,
they'll often find, you
know, similar categories of examples difficult.
And by having a quick and
dirty implementation, that's often a
quick way to let you
identify some errors and quickly
identify what are the
hard examples so that you can focus your efforts on those.
Lastly, when developing learning algorithms,
one other useful tip is
to make sure that you have
a way, that you have a
numerical evaluation of your learning algorithm.
Now what I mean by that is that
if you're developing a learning algorithm,
it is often incredibly helpful
if you have a way of
evaluating your learning algorithm
that just gives you back a single real number.
Maybe accuracy, maybe error.
But the single real number that tells you how well your learning algorithm is doing.
I'll talk more about this specific
concepts in later videos, but here's a specific example.
Let's say we are trying to
decide whether or not we
should treat words like discount,
discounts, discounter, discounting, as the same word.
So maybe one way to
do that is to just
look at the first few characters in a word.
Like, you know, if you just look at
the first few characters of
a word, then you figure
out that maybe all of these
words are roughly -   have similar meanings.
In natural language processing, the
way that this is done is
actually using a type of software called stemming software.
If you ever want to do
this yourself, search on a
web search engine for the
Porter Stemmer and that
would be, you know, one reasonable piece of
software for doing this sort
of stemming, which will let
you treat all of these discount,
discounts, and so on as the same word.
But using a stemming software
that basically looks at the
first few alphabets of the
word more or less, it can help but it can hurt.
And it can hurt because, for
example, this software may
mistake the words universe and
university as being the
same thing because, you know,
these two words start off
with very similar characters, with the same alphabets.
So if you're trying
to decide whether or not
to use stemming software for
a stem classifier, it is not always easy to tell.
And in particular, error analysis
may not actually be helpful
for deciding if this
sort of stemming idea is a good idea.
Instead, the best way
to figure out if using stemming
software is good to help
your classifier is if you
have a way to very quickly
just try it and see if it works.
And in order to do this,
having a way to numerically
evaluate your algorithm, is going to be very helpful.
Concretely, maybe the most
natural thing to do is
to look at the cross validation
error of the algorithm's performance with and without stemming.
So, if you run your
algorithm without stemming and you
end up with, let's say,
five percent classification error, and
you re-run it and you
end up with, let's say, three
percent classification error, then this
decrease in error very quickly
allows you to decide that,
you know, it looks like using stemming is a good idea.
For this particular problem, there's
a very natural single real
number evaluation metric, namely, the cross validation error.
We'll see later, examples where coming
up with this, sort of, single
row number evaluation metric may need a little bit more work.
But as we'll see in
the later video, doing so would
also then let you
make these decisions much more quickly
of, say, whether or not to use stemming.
And just this one more quick example.
Let's say that you're also trying
to decide whether or not
to distinguish between upper versus lower case.
So, you know, is the red
mom with uppercase M
versus lower case m,
I mean, should that be treated as
the same word or as different words?
Should these be treated as the same feature or as different features?
And so once again,
because we have a way
to evaluate our algorithm, if
you try this out here, if
I stop distinguishing upper
and lower case, maybe I
end up with 3.2%
error and I find that
therefore this does worse
than, you know, if I use only
stemming, and so this lets
me very quickly decide to go
ahead and to distinguish or to
not distinguish between upper and lower case.
So when you' re developing
a learning algorithm, very often
you'll be trying out lots of
new ideas and lots of new versions of your learning algorithm.
If every time you try
out a new idea if you
end up manually examining a
bunch of examples, you begin to
see better or worse, you
know, that's going to make it
really hard to make decisions
on do you use stemming or not.
Do you distinguish upper or lowercase or not?
But by having a single rule
number evaluation metric, you can
then just look and see oh, did the error go up or go down?
And you can use that much
more rapidly, try out
new ideas and almost right
away tell if your new
idea has improved or worsened
the performance of the learning algorithm
and this will let
you often make much faster progress.
So the recommended, strongly recommended
way to do error analysis is
on the cross validation set rather than the test set.
But, you know, there are
people that will do this on
the test set even though that's
definitely a less mathematically appropriate
set of your list, recommended what
you think to do than to
do error analysis on your
cross validation sector.
So, to wrap up this video, when starting
on the new machine learning problem, what
I almost always recommend is
to implement a quick and
dirty implementation of your learning algorithm.
And I've almost never seen
anyone spend too little time on this quick and dirty implementation.
I pretty much only ever see
people spend much too much
time building their first, you know,
supposedly quick and dirty implementations.
So really, don't worry about
it being too quick, or don't worry about it being too dirty.
But really implement something as
quickly as you can, and once
you have the initial implementation this
is then a powerful tool for
deciding where to spend your
time next, because first we
can look at the errors it makes,
and do this sort of error analysis
to see what mistakes it makes
and use that to inspire further development.
And second, assuming your
quick and dirty implementation incorporated a
single real number evaluation metric, this
can then be a vehicle for
you to try out different ideas
and quickly see if the
different ideas you're trying out
are improving the performance of
your algorithm and therefore let
you maybe much more quickly
make decisions about what things
to fold, and what things to
incorporate into your learning algorithm.