[MUSIC] In the regression module, we talked
about the relationship between error or accuracy in the complexity of the model. Let's talk a little bit about
the relationship in terms of the amount of data you have to learn. And we'll explore the question of
how much data we need to learn. And this is a really difficult and
complex question in machine learning. So of course,
the more data you have, the better, as long as the quality
of the data is good. And then bad data, lots of bad data,
is much worse to having much less, much fewer, data points with really good,
clean, high-quality data points. Now there's some theoretical techniques
to analyze how much data you need. Many of those help you understand
kinda the overall trends, but really tend to be too
loose to use in practice. In practice, there's some empirical
techniques to really try to understand how much error we're making and
what that kind of error looks like. And, in the follow-up courses, we're gonna
explore those techniques much further, but let me give you a little
bit of guidance and a little insight on what that can
do within the classification side. Now an important representation for
this relationship between data and quality is what's called
the learning curve. So a learning curve relates
the amount of data that we have for training with the error that we're making. And here we're talking about test error. Now if you have very little data for
training, then your test error is going to be high. But if you have a lot of data for
training, your test error is going to be low. And now, the curve is gonna get better and better as you get more and
more and more data. Whoops, didn't go through that point,
so I'm gonna just erase it. Now here we go. This is an example learning curve where
the quality is getting better as we add more data. Now you may ask, is there a limit? Is this quality just going to get better
and better forever as you add more data? Now we know that error's going to decrease
as we add more data, the test error. However, there is some gap here. And the question is whether that gap
can go to zero, and the answer's, in general, no. This gap is called the bias. So let's discuss a little bit
what this bias, or this gap, is. So intuitively, it says even with infinite data, the test error will not go to zero. So let's understand why a little bit. More complex models
tend to have less bias. So if you look at a sentiment analysis
classifier that we may building, if you just use single words like awesome, good, great, terrible,
awful, it can do okay. Maybe it do really well,
maybe just does okay. But even if you have infinite data,
even with all the data in the world, you're never gonna get this sentence
right, the sushi was not good. Because you're not looking
at pairs of words, you're just looking at the words good and
not individually. And so more complex models that, for
example, deal with combinations of words, for example,
the simply called the bigram model, where you look at pairs of
secret words like not good. Those models require more parameters,
because there's more possibilities. They can do better, they may have
a parameter for good, say 1.5. But not good, say -2.1. And actually get that sentence,
the sushi was not good, correctly. So they have less bias. They can represent sentences that
couldn't be represented as words, but so they're potentially more accurate. But they need more data to learn,
because there's more parameters. There's not just a parameter for good,
there's now a parameter for not good, and all possible combinations of words. And the more parameters your model has, in
general, the more data you need to learn. So let's go back to our example. We talked about the fact of a amount
of training data on the test error. So let's say that I'm building
a classifier using single words. And the question is, how does that relate
to a classifier, based on pairs of words? Now for a classifier based on bigrams,
when you have less data, it's not going to do as well because
it has more parameters to fit. But when you have more data, it's going to
do better because it's going to be able to capture settings like,
the sushi was not good. And so the behavior you're gonna
get is something like this. And at some point, there's a crossover
where it starts doing better than the classifier with single words. But notice the background model
still has some bias here. Although the bias is smaller,
it still has some bias. [MUSIC]