1
00:00:00,008 --> 00:00:04,761
[MUSIC]

2
00:00:04,761 --> 00:00:09,719
Now that we've seen how regularization can
play a role in the classification setting,

3
00:00:09,719 --> 00:00:14,677
let's observe in our data set what happens
to the boundaries as we introduce our

4
00:00:14,677 --> 00:00:17,170
regularization penalty.

5
00:00:17,170 --> 00:00:21,000
We're going to work with
that Degree 20 features.

6
00:00:21,000 --> 00:00:28,519
So feature discretion model with
features of polynomial degrees 20,

7
00:00:28,519 --> 00:00:36,422
which led to this technical term that
I called the Crazy Decision Boundary.

8
00:00:36,422 --> 00:00:40,702
And the parameters had
very large magnitude,

9
00:00:40,702 --> 00:00:47,740
in fact it varied from minus 3,170
to 3,803, they were very big.

10
00:00:47,740 --> 00:00:52,198
Now we're going to take the same setting,
same number of features, same features,

11
00:00:52,198 --> 00:00:55,766
same data, same everything but
just vary the parameter lambda and

12
00:00:55,766 --> 00:01:00,810
see what happens and here we're showing
the results of doing exactly that.

13
00:01:00,810 --> 00:01:05,566
When lamba is equal to zero,
we get very large

14
00:01:05,566 --> 00:01:10,573
coefficients when lamba
is large like ten we get

15
00:01:10,573 --> 00:01:16,847
good size coefficients,
let's say smaller coefficients

16
00:01:20,308 --> 00:01:25,319
Okay, and so for lambda equals zero,

17
00:01:25,319 --> 00:01:30,957
we had that crazy decision boundary and
for

18
00:01:30,957 --> 00:01:37,860
large lambdas,
we have a nicer smoother boundary.

19
00:01:41,540 --> 00:01:46,161
In fact, I trust that this boundary
with lambda equal ten much more

20
00:01:46,161 --> 00:01:48,812
than the one with lambda equals zero.

21
00:01:48,812 --> 00:01:53,478
And the decision model of lambda equals
ten looks a lot like that really beautiful

22
00:01:53,478 --> 00:01:56,839
one I got with the parabola,
and fit my data really well but

23
00:01:56,839 --> 00:02:01,298
here there are tons of more features and
nevertheless, hiding a little bit of

24
00:02:01,298 --> 00:02:05,910
regularization help us get that really
nice separating plane that I can trust.

25
00:02:07,650 --> 00:02:12,060
We can also look to the coefficient parts,
what happens to a coefficient

26
00:02:12,060 --> 00:02:15,920
as we increase the penalty lambda,
so in the beginning,

27
00:02:15,920 --> 00:02:20,970
when we have a unregularized problem
this coefficient tends to be large.

28
00:02:20,970 --> 00:02:24,400
But as we increase lambda, they tend to
become smaller and smaller and smaller and

29
00:02:24,400 --> 00:02:25,613
go to zero.

30
00:02:25,613 --> 00:02:32,700
I've used the review data set,
property review data set here and

31
00:02:32,700 --> 00:02:37,390
I picked a few words and fit a legislation
model just using those words with

32
00:02:37,390 --> 00:02:39,680
different levels of regularization.

33
00:02:39,680 --> 00:02:43,730
So for example, the words that have
positive coefficients tend to be

34
00:02:43,730 --> 00:02:47,940
associated with positive aspects of
reviews, while the ones with negative

35
00:02:47,940 --> 00:02:52,480
coefficients tend to be associated
with negative aspects of reviews.

36
00:02:52,480 --> 00:02:57,050
What is the word in quotes that
has the most positive weight?

37
00:02:57,050 --> 00:03:00,300
Well, if you look at the key here, you'll
see the word that has the most positive

38
00:03:00,300 --> 00:03:05,630
weight is actually
the emoticon smiley face,

39
00:03:05,630 --> 00:03:10,718
well, the word that has most
negative weight is another another

40
00:03:10,718 --> 00:03:15,337
emoticon the sad face.

41
00:03:15,337 --> 00:03:20,230
And the beginning all these words
have pretty large coefficients

42
00:03:20,230 --> 00:03:24,400
except the words near zero,
which are words like this

43
00:03:26,300 --> 00:03:30,470
and review, which are not associated with
either positive things or negative things,

44
00:03:30,470 --> 00:03:35,470
although if the word review shows
up it's slightly correlated with

45
00:03:35,470 --> 00:03:41,130
a negative review but in general, this
coefficients much more than the others.

46
00:03:41,130 --> 00:03:46,080
And as I increase
the regularization lambda,

47
00:03:46,080 --> 00:03:49,990
you see the coefficients can
become smaller and smaller, and

48
00:03:49,990 --> 00:03:54,550
if I were to keep drawing this,
they will eventually go to zero.

49
00:03:54,550 --> 00:04:01,230
And now, if I were to use cross
validation to pick the best lambda,

50
00:04:01,230 --> 00:04:05,790
I'll get a result kind of around here,
and I'm going to call that lambda star.

51
00:04:06,910 --> 00:04:10,380
And so that's what you do is cross
validation, to find that point

52
00:04:10,380 --> 00:04:15,120
where it's fitting data pretty well,
but it's not over-fitting it too much.

53
00:04:17,180 --> 00:04:20,140
And as a last point, I'm going to show
you something that is pretty exciting.

54
00:04:20,140 --> 00:04:24,350
It's really beautiful about
regularization with regression.

55
00:04:24,350 --> 00:04:29,770
Regularization doesn't only address
the crazy wiggly decision boundaries but

56
00:04:29,770 --> 00:04:33,000
addressing with those
over-confidence problems that we saw

57
00:04:33,000 --> 00:04:34,510
with over-fitting regularization.

58
00:04:34,510 --> 00:04:38,480
So I'm taking the same coefficient,
the same thing that I've learned.

59
00:04:38,480 --> 00:04:41,790
The lambda is increasing,
the range of coefficients is decreasing,

60
00:04:41,790 --> 00:04:44,360
they're getting smaller but
I'm talking the bottom here.

61
00:04:44,360 --> 00:04:47,830
The actual decision boundaries
that we learned and

62
00:04:47,830 --> 00:04:50,298
the notion of uncertainty on their data.

63
00:04:50,298 --> 00:04:58,320
So if lambda is equal to zero we have
this highly over confident predictions.

64
00:05:00,330 --> 00:05:05,640
If lambda is a one, not only do I get
a more natural kind of parabola like

65
00:05:05,640 --> 00:05:10,450
decision boundary even though
I'm using Degree 20 features,

66
00:05:10,450 --> 00:05:12,710
polynomial degree 20 is features.

67
00:05:12,710 --> 00:05:15,826
I get a very natural certainty region.

68
00:05:18,625 --> 00:05:20,870
So the region why I don't
know if it's positive or

69
00:05:20,870 --> 00:05:23,917
negative is really those points in
the boundary which kind of between

70
00:05:23,917 --> 00:05:27,580
those clusters of positive points and
the clusters of negative points.

71
00:05:27,580 --> 00:05:33,264
And you get this kind of
beautiful smooth transition.

72
00:05:33,264 --> 00:05:37,974
So by introducing regularization,
we've now addressed those two fundamental

73
00:05:37,974 --> 00:05:41,793
problems where over-fitting
comes in in logistic aggression.

74
00:05:41,793 --> 00:05:45,909
[MUSIC]