1
00:00:00,000 --> 00:00:04,480
[MUSIC]

2
00:00:04,480 --> 00:00:08,323
Let's take a moment to explore
the decision boundary that we played with

3
00:00:08,323 --> 00:00:10,390
quite a bit in this lecture.

4
00:00:10,390 --> 00:00:15,286
So this is one where 1.0 times number of
#awesomes minus 1.5 times the number of

5
00:00:15,286 --> 00:00:16,720
#awfuls is equal to 0.

6
00:00:18,220 --> 00:00:23,964
In this case, all the points to
the bottom of the line have score of

7
00:00:23,964 --> 00:00:29,660
xi greater than 0, so
they're all labeled as positive.

8
00:00:29,660 --> 00:00:33,380
And all the ones on
the other side have score

9
00:00:34,590 --> 00:00:38,220
of xi less than 0 so
we labeled them as negative.

10
00:00:38,220 --> 00:00:41,830
But let's think about
the probabilities a little bit more.

11
00:00:41,830 --> 00:00:47,280
So if we take this point over here,
it's close to the boundary

12
00:00:48,990 --> 00:00:53,280
but it's on the positive side so
it should be outputting probabilities that

13
00:00:53,280 --> 00:00:56,620
are greater than 0.5 but
not that much greater than 0.5.

14
00:00:56,620 --> 00:00:58,770
So let's see what the score looks like for
this one.

15
00:01:00,180 --> 00:01:03,440
So I looked at score of xi for this one.

16
00:01:03,440 --> 00:01:05,900
And we can measure directly.

17
00:01:05,900 --> 00:01:08,600
So it has four #awesomes,
so those counts as four.

18
00:01:08,600 --> 00:01:14,323
And it has two #awfuls so this kind is -3,

19
00:01:14,323 --> 00:01:20,201
so you have the score is 4- 1.5 times 2,

20
00:01:20,201 --> 00:01:24,086
which gives you a total of 1.

21
00:01:24,086 --> 00:01:28,746
And if you push that
through the the sigmoid,

22
00:01:28,746 --> 00:01:33,895
you realize the probability
that y is equal to +1

23
00:01:33,895 --> 00:01:38,812
given this particular xi is equal to 0.73.

24
00:01:38,812 --> 00:01:40,550
And you can read that
from the graph as well.

25
00:01:40,550 --> 00:01:47,360
So if I go like this and
I push it to the right, I should get 0.73.

26
00:01:47,360 --> 00:01:51,730
Now, if you take another example
that's further away from the boundary.

27
00:01:51,730 --> 00:01:55,110
Like this one over here where I
feel more sure, more confident.

28
00:01:55,110 --> 00:01:57,790
The probability that y
= +1 should be higher.

29
00:01:57,790 --> 00:02:03,748
So in this case here,
the score of xi is 3.

30
00:02:03,748 --> 00:02:06,637
It has three #awesomes,

31
00:02:06,637 --> 00:02:12,022
no #awfuls which implies
that our prediction,

32
00:02:12,022 --> 00:02:17,141
let's say p hat of y = +1 given xi is 0.95

33
00:02:17,141 --> 00:02:24,630
which is much larger than the one
that was closer to boundary.

34
00:02:24,630 --> 00:02:29,640
So, it's something it's more like this,
0.95.

35
00:02:32,848 --> 00:02:38,390
Very good, now let's take a final
datapoint on the other side of the axis.

36
00:02:38,390 --> 00:02:43,580
So if you take this datapoint over
here and you compute its score.

37
00:02:43,580 --> 00:02:45,830
So it has one awesome.

38
00:02:45,830 --> 00:02:49,590
So the score of xi has one #awesome,
so that counts as 1.

39
00:02:49,590 --> 00:02:51,020
But has three #awfuls.

40
00:02:51,020 --> 00:02:56,240
So that counts as- 1.5 times 3.

41
00:02:56,240 --> 00:03:02,950
So the total score is -3.5,
so it's very negative.

42
00:03:02,950 --> 00:03:07,470
We should be very sure that
it's a negative example, so

43
00:03:07,470 --> 00:03:16,330
the probability that y = +1 given this
particular xi should be really low.

44
00:03:16,330 --> 00:03:20,120
So, we're at -3 here,
the probability should be really low,

45
00:03:20,120 --> 00:03:24,240
you can see from the graph and it turns
out that the probability is 0.03 when you

46
00:03:24,240 --> 00:03:29,085
push it through the sigmoid.

47
00:03:29,085 --> 00:03:32,370
Extremely low probability.

48
00:03:32,370 --> 00:03:35,470
It makes sense,
a review with three #awfuls is just awful.

49
00:03:38,220 --> 00:03:41,545
Now that we've explored the decision
boundaries that come from a logistic

50
00:03:41,545 --> 00:03:42,970
regression model.

51
00:03:42,970 --> 00:03:45,710
And the notions of probability,
let's explore

52
00:03:45,710 --> 00:03:48,940
a little bit more the effects of
the coefficients that we learn from data.

53
00:03:48,940 --> 00:03:54,030
Both on the boundary and also on
the confidence or the sureness that

54
00:03:54,030 --> 00:03:58,170
we have about the particular prediction on
the actual probabilities that we predict.

55
00:03:58,170 --> 00:04:03,090
So, if we take a very simple model with
just wot features, number of #awesomes and

56
00:04:03,090 --> 00:04:04,260
number of #awfuls.

57
00:04:04,260 --> 00:04:08,010
And we look at the coefficients of those
features as well as the constant w0.

58
00:04:08,010 --> 00:04:09,340
Let's see what happens.

59
00:04:09,340 --> 00:04:14,764
So if that w0 has coefficient 0,

60
00:04:14,764 --> 00:04:19,627
then we have is the probability

61
00:04:19,627 --> 00:04:23,760
of y=+1 given xi and w.

62
00:04:23,760 --> 00:04:27,480
This particular w0 is
given by the curve below.

63
00:04:27,480 --> 00:04:32,711
So, we see that if the number of #awesomes
is exactly equal to the number #awfuls,

64
00:04:32,711 --> 00:04:35,370
then the score of the two cancel out.

65
00:04:35,370 --> 00:04:40,331
And we have a score of xi
to be exactly equal to 0.

66
00:04:43,161 --> 00:04:49,701
So, that gives you a predictive
probability of 0.5, just like we saw.

67
00:04:49,701 --> 00:04:54,391
Now if you have one more
#awesome than you have #awfuls,

68
00:04:54,391 --> 00:04:57,359
then that difference becomes one and

69
00:04:57,359 --> 00:05:02,815
your prediction now of the probability
of being a positive review,

70
00:05:02,815 --> 00:05:07,430
just by having that extra #awesome,
this is 0.73.

71
00:05:07,430 --> 00:05:09,740
Now let's see what happens
if you change the constant.

72
00:05:10,950 --> 00:05:13,653
Now if you change
the constant w0 to -2 and

73
00:05:13,653 --> 00:05:18,200
keep everything the same, you see that
the line has shifted to the right.

74
00:05:18,200 --> 00:05:24,504
So now what happens is I need
the number #awesomes to have two more,

75
00:05:24,504 --> 00:05:29,828
the number of #awfuls for
that prediction to be 0.5.

76
00:05:29,828 --> 00:05:35,051
So for the probability that y=+1

77
00:05:35,051 --> 00:05:39,408
given xi and w to be 50 50.

78
00:05:39,408 --> 00:05:44,394
So in this case,
#awfuls count a lot more negatively than

79
00:05:44,394 --> 00:05:49,503
the #awesomes, or
at least that has that extra constant.

80
00:05:49,503 --> 00:05:57,870
Now, if I keep w0 to 0 but I increase
the magnitude of the parameters,

81
00:05:57,870 --> 00:06:02,570
I get a curve on the right which
similar to the curve in the middle so

82
00:06:02,570 --> 00:06:07,830
if the difference between the two is 0,
I still predict 0.5.

83
00:06:07,830 --> 00:06:11,470
However, it grows much, much more steeply.

84
00:06:11,470 --> 00:06:19,746
So, in other words, if you just have one
more #awesome than you have #awfuls,

85
00:06:19,746 --> 00:06:27,820
you're going to say that the probability
of y=+1 given xi and ws is almost 1.

86
00:06:27,820 --> 00:06:31,520
So the bigger you make
the parameters in magnitude,

87
00:06:31,520 --> 00:06:33,240
the more sure you get more quickly.

88
00:06:34,350 --> 00:06:36,330
And the amount to change the constant,

89
00:06:36,330 --> 00:06:39,100
you kind of shifting that line
to the left and to the right.

90
00:06:40,830 --> 00:06:43,470
Now we can see our logistic
regression during the problem.

91
00:06:43,470 --> 00:06:46,130
So I have some training data,
which had some features.

92
00:06:46,130 --> 00:06:50,955
We have this ML model that says
probability that a review is

93
00:06:50,955 --> 00:06:55,884
positive is given by the sigmoid
of the score w transpose H.

94
00:06:55,884 --> 00:07:01,090
Which is 1 / 1 + e to the -w transpose H.

95
00:07:01,090 --> 00:07:05,370
We're going to learn a w hat
that really fits the data well.

96
00:07:05,370 --> 00:07:10,172
So next, we'll discuss
the algorithmic foundations of

97
00:07:10,172 --> 00:07:12,980
how do we fit the w hats from data.

98
00:07:12,980 --> 00:07:17,279
[MUSIC]