1
00:00:00,253 --> 00:00:03,883
[MUSIC]

2
00:00:03,883 --> 00:00:08,593
In the regression module, we talked
about the relationship between error or

3
00:00:08,593 --> 00:00:11,140
accuracy in the complexity of the model.

4
00:00:12,150 --> 00:00:14,680
Let's talk a little bit about
the relationship in terms of

5
00:00:14,680 --> 00:00:17,530
the amount of data you have to learn.

6
00:00:17,530 --> 00:00:20,750
And we'll explore the question of
how much data we need to learn.

7
00:00:20,750 --> 00:00:24,380
And this is a really difficult and
complex question in machine learning.

8
00:00:24,380 --> 00:00:28,146
So of course,
the more data you have, the better,

9
00:00:28,146 --> 00:00:31,220
as long as the quality
of the data is good.

10
00:00:31,220 --> 00:00:35,674
And then bad data, lots of bad data,
is much worse to having much less,

11
00:00:35,674 --> 00:00:40,830
much fewer, data points with really good,
clean, high-quality data points.

12
00:00:42,700 --> 00:00:47,560
Now there's some theoretical techniques
to analyze how much data you need.

13
00:00:47,560 --> 00:00:52,360
Many of those help you understand
kinda the overall trends, but

14
00:00:52,360 --> 00:00:55,800
really tend to be too
loose to use in practice.

15
00:00:55,800 --> 00:01:00,330
In practice, there's some empirical
techniques to really try to understand

16
00:01:00,330 --> 00:01:05,390
how much error we're making and
what that kind of error looks like.

17
00:01:05,390 --> 00:01:09,400
And, in the follow-up courses, we're gonna
explore those techniques much further, but

18
00:01:09,400 --> 00:01:12,000
let me give you a little
bit of guidance and

19
00:01:12,000 --> 00:01:18,130
a little insight on what that can
do within the classification side.

20
00:01:18,130 --> 00:01:22,297
Now an important representation for
this relationship between data and

21
00:01:22,297 --> 00:01:25,040
quality is what's called
the learning curve.

22
00:01:25,040 --> 00:01:29,237
So a learning curve relates
the amount of data that we have for

23
00:01:29,237 --> 00:01:32,380
training with the error that we're making.

24
00:01:32,380 --> 00:01:36,300
And here we're talking about test error.

25
00:01:36,300 --> 00:01:40,211
Now if you have very little data for
training,

26
00:01:40,211 --> 00:01:43,630
then your test error is going to be high.

27
00:01:43,630 --> 00:01:46,294
But if you have a lot of data for
training,

28
00:01:46,294 --> 00:01:48,510
your test error is going to be low.

29
00:01:48,510 --> 00:01:53,319
And now, the curve is gonna get better and

30
00:01:53,319 --> 00:01:58,550
better as you get more and
more and more data.

31
00:02:04,380 --> 00:02:07,560
Whoops, didn't go through that point,
so I'm gonna just erase it.

32
00:02:09,180 --> 00:02:09,860
Now here we go.

33
00:02:09,860 --> 00:02:14,145
This is an example learning curve where
the quality is getting better as we

34
00:02:14,145 --> 00:02:15,050
add more data.

35
00:02:15,050 --> 00:02:17,420
Now you may ask, is there a limit?

36
00:02:17,420 --> 00:02:22,720
Is this quality just going to get better
and better forever as you add more data?

37
00:02:22,720 --> 00:02:29,675
Now we know that error's going to decrease
as we add more data, the test error.

38
00:02:29,675 --> 00:02:33,295
However, there is some gap here.

39
00:02:33,295 --> 00:02:36,490
And the question is whether that gap
can go to zero, and the answer's,

40
00:02:36,490 --> 00:02:37,295
in general, no.

41
00:02:37,295 --> 00:02:39,175
This gap is called the bias.

42
00:02:40,475 --> 00:02:45,215
So let's discuss a little bit
what this bias, or this gap, is.

43
00:02:45,215 --> 00:02:47,949
So intuitively,

44
00:02:47,949 --> 00:02:53,052
it says even with infinite data,

45
00:02:53,052 --> 00:02:58,350
the test error will not go to zero.

46
00:02:59,880 --> 00:03:01,280
So let's understand why a little bit.

47
00:03:02,340 --> 00:03:06,640
More complex models
tend to have less bias.

48
00:03:06,640 --> 00:03:10,890
So if you look at a sentiment analysis
classifier that we may building,

49
00:03:10,890 --> 00:03:12,920
if you just use single words like awesome,

50
00:03:12,920 --> 00:03:18,032
good, great, terrible,
awful, it can do okay.

51
00:03:18,032 --> 00:03:20,840
Maybe it do really well,
maybe just does okay.

52
00:03:20,840 --> 00:03:24,929
But even if you have infinite data,
even with all the data in the world,

53
00:03:24,929 --> 00:03:28,970
you're never gonna get this sentence
right, the sushi was not good.

54
00:03:30,070 --> 00:03:32,010
Because you're not looking
at pairs of words,

55
00:03:32,010 --> 00:03:34,940
you're just looking at the words good and
not individually.

56
00:03:37,050 --> 00:03:42,880
And so more complex models that, for
example, deal with combinations of words,

57
00:03:42,880 --> 00:03:44,820
for example,
the simply called the bigram model,

58
00:03:44,820 --> 00:03:47,850
where you look at pairs of
secret words like not good.

59
00:03:49,090 --> 00:03:53,760
Those models require more parameters,
because there's more possibilities.

60
00:03:53,760 --> 00:03:58,803
They can do better, they may have
a parameter for good, say 1.5.

61
00:03:58,803 --> 00:04:01,053
But not good, say -2.1.

62
00:04:01,053 --> 00:04:04,803
And actually get that sentence,
the sushi was not good, correctly.

63
00:04:04,803 --> 00:04:06,330
So they have less bias.

64
00:04:07,430 --> 00:04:11,270
They can represent sentences that
couldn't be represented as words, but

65
00:04:11,270 --> 00:04:13,820
so they're potentially more accurate.

66
00:04:13,820 --> 00:04:17,158
But they need more data to learn,
because there's more parameters.

67
00:04:17,158 --> 00:04:20,150
There's not just a parameter for good,
there's now a parameter for not good, and

68
00:04:20,150 --> 00:04:22,230
all possible combinations of words.

69
00:04:22,230 --> 00:04:27,160
And the more parameters your model has, in
general, the more data you need to learn.

70
00:04:27,160 --> 00:04:28,390
So let's go back to our example.

71
00:04:29,390 --> 00:04:35,280
We talked about the fact of a amount
of training data on the test error.

72
00:04:35,280 --> 00:04:39,470
So let's say that I'm building
a classifier using single words.

73
00:04:39,470 --> 00:04:45,840
And the question is, how does that relate
to a classifier, based on pairs of words?

74
00:04:45,840 --> 00:04:49,863
Now for a classifier based on bigrams,
when you have less data,

75
00:04:49,863 --> 00:04:54,050
it's not going to do as well because
it has more parameters to fit.

76
00:04:55,050 --> 00:04:59,700
But when you have more data, it's going to
do better because it's going to be able to

77
00:04:59,700 --> 00:05:04,490
capture settings like,
the sushi was not good.

78
00:05:04,490 --> 00:05:08,675
And so the behavior you're gonna
get is something like this.

79
00:05:12,220 --> 00:05:16,898
And at some point, there's a crossover
where it starts doing better than

80
00:05:16,898 --> 00:05:19,170
the classifier with single words.

81
00:05:20,370 --> 00:05:24,850
But notice the background model
still has some bias here.

82
00:05:24,850 --> 00:05:28,870
Although the bias is smaller,
it still has some bias.

83
00:05:32,509 --> 00:05:36,479
[MUSIC]