1
00:00:00,000 --> 00:00:04,312
[MUSIC]

2
00:00:04,312 --> 00:00:07,694
Next, let's visualize
the path the gradient takes,

3
00:00:07,694 --> 00:00:12,670
as opposed to stochastic gradient,
what I call the convergence paths.

4
00:00:12,670 --> 00:00:16,660
And as you will see, the stochastic
gradient oscillates a bit more, but

5
00:00:16,660 --> 00:00:19,110
gets you close to that optimal solution.

6
00:00:20,810 --> 00:00:24,890
So just before in the black line,
I'm showing the path of gradient,

7
00:00:24,890 --> 00:00:28,440
and you see that that path is very
smooth and it does very nicely.

8
00:00:28,440 --> 00:00:31,660
In the red line, I show you
the path of stochastic gradient.

9
00:00:31,660 --> 00:00:33,300
You see that this is a noisier path.

10
00:00:36,130 --> 00:00:43,210
It does get us to the right solution but
one thing to note though is that It

11
00:00:44,240 --> 00:00:49,760
doesn't convergent stop like gradient
does, it oscillates around the optimal.

12
00:00:52,470 --> 00:00:54,910
And this is going to be one of
the practical issues that we're going to

13
00:00:54,910 --> 00:00:58,900
address when we talk about how to get
stochastic gradient to work in practice

14
00:01:00,010 --> 00:01:01,070
but it's a significant issue.

15
00:01:02,510 --> 00:01:05,960
Another view of stochastic gradient
oscillating around the optimum,

16
00:01:05,960 --> 00:01:07,150
can be viewed in this plot.

17
00:01:07,150 --> 00:01:09,420
The one we've been using for
quite awhile and

18
00:01:09,420 --> 00:01:13,610
you see that gradient is
smoothly making progress.

19
00:01:13,610 --> 00:01:18,820
But stochastic gradient is this
noisy curve as it makes progress,

20
00:01:18,820 --> 00:01:21,100
and as it converges.

21
00:01:21,100 --> 00:01:22,418
It's oscillating around the optimal.

22
00:01:27,007 --> 00:01:28,370
Let's summarize.

23
00:01:28,370 --> 00:01:32,520
Gradient ascent looks for
the direction of greatest improvement and

24
00:01:32,520 --> 00:01:37,359
steepest ascent direction and does that
by summing over all possible data points.

25
00:01:38,830 --> 00:01:44,240
Stochastic gradient on the other hand
tries to find some direction which usually

26
00:01:44,240 --> 00:01:44,940
makes progress.

27
00:01:44,940 --> 00:01:49,000
So, for example by picking one data
point to estimate the gradient and

28
00:01:49,000 --> 00:01:52,670
on average it makes a ton of progress and

29
00:01:52,670 --> 00:01:57,650
because of that it tends to converge much
faster but it's noisier than optimum.

30
00:01:57,650 --> 00:02:02,150
So, even in that simple example we've been
using today It's over a hundred times

31
00:02:02,150 --> 00:02:06,720
faster than gradient conversion
much much faster to converge.

32
00:02:06,720 --> 00:02:10,692
But, it gets noisy in the end.