1
00:00:00,000 --> 00:00:04,843
[MUSIC]

2
00:00:04,843 --> 00:00:09,137
The next few sections of this
module are going to talk about more

3
00:00:09,137 --> 00:00:12,380
practical issues with
a stochastic gradient,

4
00:00:12,380 --> 00:00:16,691
which you need to address for
implementing those algorithms.

5
00:00:16,691 --> 00:00:20,388
I made the next few sections optional
because it can get really tedious to go

6
00:00:20,388 --> 00:00:22,337
through all these practical details.

7
00:00:22,337 --> 00:00:25,667
You now already got a sense
of how finicky it can be.

8
00:00:25,667 --> 00:00:29,456
There's many other practical details,
I'm going to make those sections optional.

9
00:00:29,456 --> 00:00:34,833
So the first one is that learning from
one data point is just way too noisy,

10
00:00:34,833 --> 00:00:37,369
usually you use a few data points.

11
00:00:37,369 --> 00:00:40,741
And this is called mini-batches.

12
00:00:40,741 --> 00:00:43,707
So we really illustrated two extremes so
far.

13
00:00:43,707 --> 00:00:48,083
We illustrated gradient where you make
a full pass through the data, and

14
00:00:48,083 --> 00:00:51,685
you use N data points for
every update of your coefficient.

15
00:00:51,685 --> 00:00:55,635
And then we talked about stochastic
gradient, where you look at just one data

16
00:00:55,635 --> 00:00:58,573
point when you're making up data and
the coefficients.

17
00:00:58,573 --> 00:01:01,307
And the question is,
is there something in between,

18
00:01:01,307 --> 00:01:03,756
where you looked at B data points,
say 100?

19
00:01:03,756 --> 00:01:08,642
And that's called mini-batches,
it reduces noise, increases stability, and

20
00:01:08,642 --> 00:01:10,327
it's the right thing to do.

21
00:01:10,327 --> 00:01:13,851
And 100 is a really good number to use,
by the way.

22
00:01:13,851 --> 00:01:20,987
Here I'm showing the convergence paths of
the same problem we've been looking at,

23
00:01:20,987 --> 00:01:25,620
but comparing batch size
of 1 with batch size of 25.

24
00:01:25,620 --> 00:01:28,078
And here you observe two things.

25
00:01:28,078 --> 00:01:33,420
First, the batch size of
25 makes a convergence

26
00:01:33,420 --> 00:01:37,627
path smoother, which is a good thing.

27
00:01:37,627 --> 00:01:41,294
But the second thing to observe
which is even more interesting is,

28
00:01:41,294 --> 00:01:46,088
when you get around optimum batch size of
one, really oscillates around the optimum.

29
00:01:46,088 --> 00:01:52,334
Well, batch size of 25,
it oscillates better,

30
00:01:52,334 --> 00:01:58,016
so it has better behavior
around the optimal.

31
00:01:58,016 --> 00:01:59,194
And by better behavior,

32
00:01:59,194 --> 00:02:02,288
it's going to make it much easier
to use this approach in practice.

33
00:02:02,288 --> 00:02:05,518
So mini-batches are a great thing to do.

34
00:02:05,518 --> 00:02:11,506
So now we've introduced one more parameter
to be tuned in this stochastic algorithm,

35
00:02:11,506 --> 00:02:13,236
this is the batch size B.

36
00:02:13,236 --> 00:02:17,920
If it's too large then it behaves
just like gradient, for example,

37
00:02:17,920 --> 00:02:22,927
if you use batch size of N, it's
exactly the gradient ascent algorithm,

38
00:02:22,927 --> 00:02:26,922
so in this case,
the red line here is batch size too large.

39
00:02:29,214 --> 00:02:35,115
If the batch size is too small, you have
bad oscillation, or bad behavior of.

40
00:02:35,115 --> 00:02:41,068
So B, too small in this case,
it isn't converged very well.

41
00:02:41,068 --> 00:02:48,039
But if you pick the best batch size,
B, you have very nice behavior.

42
00:02:48,039 --> 00:02:52,111
You quickly get to great solution,
and you stay around that.

43
00:02:52,111 --> 00:02:55,951
So picking the right batch
size makes a big difference.

44
00:02:55,951 --> 00:03:00,456
So let's go back to a simple stochastic
gradient algorithm, and modify it, and

45
00:03:00,456 --> 00:03:02,386
reduce the notion of batch sizes.

46
00:03:02,386 --> 00:03:04,395
So instead of looking one
data point at a time,

47
00:03:04,395 --> 00:03:06,156
we're going to look at
one batch at a time.

48
00:03:06,156 --> 00:03:10,343
And if the batch size is size of B,
we have N over B batch sizes for

49
00:03:10,343 --> 00:03:11,613
data set of size N.

50
00:03:11,613 --> 00:03:16,472
So if we have 1 billion data points and
batch size 100,

51
00:03:16,472 --> 00:03:19,755
it's 1 billion over 100 of those.

52
00:03:19,755 --> 00:03:21,652
And now we go batch by batch, but

53
00:03:21,652 --> 00:03:26,575
instead of considering one data point at
a time in the competition of the gradient,

54
00:03:26,575 --> 00:03:29,830
we now consider the B data
points in that mini-batch.

55
00:03:29,830 --> 00:03:34,757
And the equation here just shows you
the math behind basically the obvious

56
00:03:34,757 --> 00:03:37,618
thing of just taking 100 data points and

57
00:03:37,618 --> 00:03:42,801
use that just to estimate the gradient
instead of 1 or instead of 1 billion.

58
00:03:42,801 --> 00:03:47,759
[MUSIC]