1
00:00:00,000 --> 00:00:05,016
[MUSIC]

2
00:00:05,016 --> 00:00:09,820
So we've made a small change to the
gradient algorithm but it is still big.

3
00:00:09,820 --> 00:00:13,150
We start looking at all
the data points one at a time.

4
00:00:13,150 --> 00:00:15,480
Why should this even work?

5
00:00:15,480 --> 00:00:17,221
Why is this a good idea at all?

6
00:00:17,221 --> 00:00:20,576
Let's spend a little bit of time
getting intuition of why it works.

7
00:00:20,576 --> 00:00:24,708
And this intuition is going to
help us understand the behavior of

8
00:00:24,708 --> 00:00:26,619
the algorithm in practice.

9
00:00:26,619 --> 00:00:30,648
In this picture,
I'm showing you the landscape,

10
00:00:30,648 --> 00:00:34,959
the counter plot for
our sentiment analysis problem,

11
00:00:34,959 --> 00:00:40,317
which is the data that we'll be
using throughout our module today.

12
00:00:40,317 --> 00:00:44,165
But just for a subset of the data where
we're only looking at two possible

13
00:00:44,165 --> 00:00:47,840
features, coefficient for awful and
the coefficient for awesome.

14
00:00:48,910 --> 00:00:52,840
And if we were to start
here at this point,

15
00:00:54,080 --> 00:00:57,050
this would be the exact
gradient that would compute.

16
00:00:59,090 --> 00:01:04,154
And that exact gradient gives you the best

17
00:01:04,154 --> 00:01:10,673
possible improvement of going
from wt to wt plus one.

18
00:01:10,673 --> 00:01:13,772
That's the best thing
you could possibly do.

19
00:01:13,772 --> 00:01:18,813
Now, there are many other directions
that also improve the value,

20
00:01:18,813 --> 00:01:22,230
improve the likelihood, the quality.

21
00:01:22,230 --> 00:01:29,127
So for example, if we were to take
this other direction over here and

22
00:01:29,127 --> 00:01:34,090
we reached some parameter
w prime is still okay.

23
00:01:36,051 --> 00:01:41,510
Because it is still going uphill
we're increasing the likelihood.

24
00:01:41,510 --> 00:01:42,410
In other words,

25
00:01:42,410 --> 00:01:48,100
the likelihood of w prime is going to
be greater than the likelihood of wt.

26
00:01:49,350 --> 00:01:54,730
So in fact, any direction
that take uphill will be good

27
00:01:54,730 --> 00:01:56,470
the gradient is just the best direction.

28
00:01:57,790 --> 00:02:01,360
The gradient direction is
the sum of the contributions for

29
00:02:01,360 --> 00:02:02,950
each one of the data points.

30
00:02:02,950 --> 00:02:05,804
So, if I look at that gradient direction.

31
00:02:06,871 --> 00:02:14,580
It is the sum over my data points of the
contributions for each one of these xis.

32
00:02:14,580 --> 00:02:19,200
So this is the, each one of these red
lines here are the contributions from

33
00:02:19,200 --> 00:02:24,070
a data point,

34
00:02:24,070 --> 00:02:28,342
from each

35
00:02:28,342 --> 00:02:32,960
xiyi, which is this part of the equation,
here.

36
00:02:34,070 --> 00:02:39,170
So interestingly,
most contributions point upwards.

37
00:02:39,170 --> 00:02:45,620
So, all of these over here
are pointing in an upward direction.

38
00:02:45,620 --> 00:02:50,180
So if I were to pick any of those,
I would make some progress.

39
00:02:50,180 --> 00:02:53,830
If I picked any of the other ones,
like the ones back here,

40
00:02:53,830 --> 00:02:57,130
this would not make progress,
this would be bad directions.

41
00:03:00,790 --> 00:03:04,770
But on average,
most of them are good directions.

42
00:03:04,770 --> 00:03:09,404
And this is why stochastic gradient works.

43
00:03:09,404 --> 00:03:13,619
In stochastic gradient, we're just
going to pick one of these directions.

44
00:03:16,045 --> 00:03:18,730
And most of them are okay.

45
00:03:18,730 --> 00:03:21,010
So most of the time,
we're going to make progress.

46
00:03:21,010 --> 00:03:24,580
Sometimes when we take a bad direction,
we won't make progress.

47
00:03:24,580 --> 00:03:25,900
We'll make negative progress.

48
00:03:25,900 --> 00:03:28,820
But on average, we're going to
be making positive progress.

49
00:03:28,820 --> 00:03:33,530
And so, that's exactly why
stochastic gradient works.

50
00:03:33,530 --> 00:03:35,820
So if you think about
the stochastic gradient algorithm,

51
00:03:35,820 --> 00:03:38,040
we're going one data point at a time.

52
00:03:38,040 --> 00:03:40,510
And picking the direction
associated with that data point and

53
00:03:40,510 --> 00:03:42,130
maybe taking a little step.

54
00:03:42,130 --> 00:03:46,760
So, at first iteration we might pick this
data point and make some progress and

55
00:03:46,760 --> 00:03:51,490
the second one make this number too
here and make negative progress.

56
00:03:51,490 --> 00:03:55,125
But in the third one, I pick another
one that makes positive progress and

57
00:03:55,125 --> 00:03:57,638
I pick a fourth one that
makes positive progress.

58
00:03:57,638 --> 00:04:01,318
And I pick a fifth one that doesn't and
I pick a sixth one that does and so on.

59
00:04:01,318 --> 00:04:05,677
And so, most of the time we're
making positive progress and

60
00:04:05,677 --> 00:04:08,504
overall the likelihood is improving.

61
00:04:08,504 --> 00:04:12,909
[MUSIC]