1
00:00:04,920 --> 00:00:07,270
Let's take a couple
minutes now to dig in and

2
00:00:07,270 --> 00:00:10,630
see what's going to happen in
each module of this course.

3
00:00:10,630 --> 00:00:11,880
So this was the overview.

4
00:00:11,880 --> 00:00:15,740
We're going to have 9 modules and
some models are going to appear in

5
00:00:15,740 --> 00:00:19,870
multiple modules, some modules
are going to appear in one module.

6
00:00:19,870 --> 00:00:22,880
Some concepts are going to
cover multiple modules.

7
00:00:22,880 --> 00:00:27,400
But in general, we're going to see
a pre-cohesive presentation where we start

8
00:00:27,400 --> 00:00:30,930
with everything about linear
classifiers and logistic regression and

9
00:00:30,930 --> 00:00:32,750
everything about decision trees.

10
00:00:32,750 --> 00:00:36,120
And then we'll cover boosting and
some advanced topics.

11
00:00:37,710 --> 00:00:40,400
So in particular,
we're going to start linear classifiers.

12
00:00:40,400 --> 00:00:42,850
And we discussed those
in the first course.

13
00:00:42,850 --> 00:00:48,050
So for example in a sentiment analysis
case we have two words that I care about,

14
00:00:48,050 --> 00:00:50,590
the number of times the word
awful appears in that review and

15
00:00:50,590 --> 00:00:53,180
the number of times the word
awesome appears in the review.

16
00:00:53,180 --> 00:00:57,070
And the linear classifier might say
that every awesome is worth one,

17
00:00:57,070 --> 00:00:58,780
every awful is worth minus 1.5.

18
00:00:58,780 --> 00:01:03,830
And when you have a particular
review if it has for

19
00:01:03,830 --> 00:01:07,650
example like the one on the bottom has
three awesome and zero awful when that

20
00:01:07,650 --> 00:01:12,350
classified that this positive because
this score is greater than zero and

21
00:01:12,350 --> 00:01:14,440
that's true for
everything below that line.

22
00:01:14,440 --> 00:01:18,030
And everything above the line had
a score of less than zero, and

23
00:01:18,030 --> 00:01:20,440
we're going to classify
those as negative points.

24
00:01:20,440 --> 00:01:23,400
We discussed this concept of linear
classifiers in the first course.

25
00:01:23,400 --> 00:01:26,360
We'll go in depth and really
understand those in the first module.

26
00:01:26,360 --> 00:01:30,060
But then we'll extend it with
something called logistic regression,

27
00:01:30,060 --> 00:01:33,500
which is going to allow us to
not only predict plus one or

28
00:01:33,500 --> 00:01:38,230
minus one but
assign a probability to every data point.

29
00:01:38,230 --> 00:01:42,420
So for example, the points with bright
green are very likely to be positive.

30
00:01:42,420 --> 00:01:47,160
The point is bright feature
very like to be negative.

31
00:01:47,160 --> 00:01:49,928
But the white pan in between
are areas where less certain.

32
00:01:49,928 --> 00:01:51,630
So it's 50/50 probability.

33
00:01:51,630 --> 00:01:53,730
We are going to be able to
predict those probabilities and

34
00:01:53,730 --> 00:01:58,540
that's going to really change how
our classifiers will get used.

35
00:01:58,540 --> 00:02:01,430
Now the way to define a linear
classifier in logistic regression.

36
00:02:01,430 --> 00:02:05,510
In the second module we're going to
figure out how to learn the parameters,

37
00:02:05,510 --> 00:02:08,690
the coefficients of
the classifiers from data.

38
00:02:08,690 --> 00:02:12,530
So we find the notion of likelihood
which measures how well a line

39
00:02:13,590 --> 00:02:18,650
is classified as data and for
different values of the coefficient

40
00:02:18,650 --> 00:02:21,650
we're going to have different lines or
different classifiers and

41
00:02:21,650 --> 00:02:27,230
we're going to use gradient descent
to find the best possible classifier.

42
00:02:27,230 --> 00:02:32,290
Similar way to what we did in linear
regression in the previous course.

43
00:02:33,970 --> 00:02:37,340
Overfitting can be a really
significant thing in classification.

44
00:02:37,340 --> 00:02:40,140
So here I'm plotting the classification
error in the Y-axis,

45
00:02:40,140 --> 00:02:42,660
as we make models more and more complex.

46
00:02:42,660 --> 00:02:43,840
And as you know,

47
00:02:43,840 --> 00:02:48,090
the training error will go to zero
as you make the model more complex.

48
00:02:48,090 --> 00:02:53,620
But the true error will go down and
eventually go up as we overfit.

49
00:02:53,620 --> 00:02:56,560
So, in classification, the way we are
going to experience that is our decision

50
00:02:56,560 --> 00:03:00,215
boundaries will start very simple like
a line separating positives and negatives,

51
00:03:00,215 --> 00:03:04,370
and then maybe it will become a curved
parabola that fits the data pretty well.

52
00:03:04,370 --> 00:03:07,310
But as we make the model too
complex we end up with these

53
00:03:07,310 --> 00:03:11,580
crazy decision boundaries and eventually
the last one here we see is really fitting

54
00:03:11,580 --> 00:03:16,306
the data way too well
in an unbelievable way.

55
00:03:16,306 --> 00:03:20,230
So we'll see these amazing visualizations
and we'll address this issue by

56
00:03:20,230 --> 00:03:24,170
introducing regularization of a very
similar way that we did with linear

57
00:03:24,170 --> 00:03:27,830
regression in the previous course and that
will be the focus of the third module.

58
00:03:29,250 --> 00:03:30,280
In the fourth module,

59
00:03:30,280 --> 00:03:34,140
we're going to explore a whole new kind
of classifier called the decision tree.

60
00:03:34,140 --> 00:03:36,720
Decision trees are extremely
useful in practice.

61
00:03:36,720 --> 00:03:38,740
They're simple, easy to understand.

62
00:03:38,740 --> 00:03:43,870
It can fit data really well providing
non linear feature for the data.

63
00:03:43,870 --> 00:03:48,140
So, here's an example where I'm trying to
predict whether a loaning might take in

64
00:03:48,140 --> 00:03:52,760
the bank is going to be a good loan
safe or is it going to be a risky loan.

65
00:03:52,760 --> 00:03:56,960
So, you might walk in the bank and
the bank person might ask you, what's your

66
00:03:56,960 --> 00:04:00,750
credit history like have you been really
good, have you paid the previous loans,

67
00:04:00,750 --> 00:04:04,950
if you have Then the log is likely
to be safe, likely to be a good log.

68
00:04:04,950 --> 00:04:08,960
If your credit has just been okay,
fair, then it depends.

69
00:04:08,960 --> 00:04:11,230
If it's a short loan,
then I'm worried, but

70
00:04:11,230 --> 00:04:14,130
if it's a long loan,
then eventually you'll pay it off.

71
00:04:14,130 --> 00:04:19,340
Now, if your credit history is bad,
don't worry, not everything is bad.

72
00:04:19,340 --> 00:04:20,780
Maybe he'll ask about your income.

73
00:04:20,780 --> 00:04:23,890
If your income is high and
you're asking for

74
00:04:23,890 --> 00:04:26,540
long term loan,
then maybe it's okay to make your loan.

75
00:04:26,540 --> 00:04:30,070
But if your credit history is bad and
your income is low, then you know,

76
00:04:30,070 --> 00:04:31,890
don't even ask me for it.

77
00:04:31,890 --> 00:04:36,510
So that's where a decision tree can
capture this really elaborate, or

78
00:04:36,510 --> 00:04:39,670
very explainable cuts over the data.

79
00:04:40,830 --> 00:04:45,850
In Module 5, we're going to see that
over 50 is not just a bad, bad problem

80
00:04:45,850 --> 00:04:50,450
with Logistic Regression, but it's also
a bad, bad problem with decision trees.

81
00:04:50,450 --> 00:04:54,730
Here, as you make those trees deeper and
deeper and deeper,

82
00:04:54,730 --> 00:04:58,980
those decision boundaries can get very,
very complicated, and really overfit.

83
00:04:58,980 --> 00:05:01,870
So, we're going to have
to do something about it.

84
00:05:01,870 --> 00:05:06,100
What we're going to do is use a
fundamental concept called Occam's Razor,

85
00:05:06,100 --> 00:05:09,100
where you try to find the simplest
explanation for your data.

86
00:05:09,100 --> 00:05:14,440
And this concept goes back way before
Occam was around the 13th Century.

87
00:05:14,440 --> 00:05:18,130
It goes back to Pythagoras and
Aristotle, and

88
00:05:18,130 --> 00:05:22,870
those folks said the simplest
explanation is often the best one.

89
00:05:22,870 --> 00:05:27,807
So we're going to take this really complex
deep tree's and find simple ones that give

90
00:05:27,807 --> 00:05:31,248
you better performance and
are less prone to overfitting.

91
00:05:31,248 --> 00:05:31,748
[MUSIC]