1
00:00:00,493 --> 00:00:03,492
You now know about the stochastic gradient descent algorithm.

2
00:00:03,492 --> 00:00:09,907
But when you're running the algorithm, how do you make sure that it's completely debugged and is converging okay?

3
00:00:09,907 --> 00:00:15,813
Equally important, how do you tune the learning rate alpha with Stochastic Gradient Descent.

4
00:00:15,813 --> 00:00:25,950
In this video we'll talk about some techniques for doing these things, for making sure it's converging and for picking the learning rate alpha.

5
00:00:25,950 --> 00:00:30,600
Back when we were using batch gradient descent, our standard way for making sure that

6
00:00:30,600 --> 00:00:36,493
gradient descent was converging was we would plot the optimization cost function as a function of the number of iterations.

7
00:00:36,493 --> 00:00:44,366
So that was the cost function and we would make sure that this cost function is decreasing on every iteration.

8
00:00:44,366 --> 00:00:50,438
When the training set sizes were small, we could do that because we could compute the sum pretty efficiently.

9
00:00:50,438 --> 00:00:57,950
But when you have a massive training set size then you don't want to have to pause your algorithm periodically.

10
00:00:57,950 --> 00:01:04,045
You don't want to have to pause stochastic gradient descent periodically in order to compute this cost function

11
00:01:04,045 --> 00:01:07,442
since it requires a sum of your entire training set size.

12
00:01:07,442 --> 00:01:12,466
And the whole point of stochastic gradient was that you wanted to start to make progress

13
00:01:12,466 --> 00:01:19,130
after looking at just a single example without needing to occasionally scan through your entire training set

14
00:01:19,130 --> 00:01:25,583
right in the middle of the algorithm, just to compute things like the cost function of the entire training set.

15
00:01:25,583 --> 00:01:32,472
So for stochastic gradient descent, in order to check the algorithm is converging, here's what we can do instead.

16
00:01:32,472 --> 00:01:36,367
Let's take the definition of the cost that we had previously.

17
00:01:36,367 --> 00:01:42,647
So the cost of the parameters theta with respect to a single training example is just one half of the square error on that training example.

18
00:01:42,647 --> 00:01:49,754
Then, while stochastic gradient descent is learning, right before we train on a specific example.

19
00:01:49,754 --> 00:01:54,601
So, in stochastic gradient descent we're going to look at the examples xi, yi, in order, and

20
00:01:54,601 --> 00:01:57,329
then sort of take a little update with respect to this example.

21
00:01:57,329 --> 00:02:04,095
And we go on to the next example, xi plus 1, yi plus 1, and so on, right?

22
00:02:04,095 --> 00:02:05,880
That's what stochastic gradient descent does.

23
00:02:05,880 --> 00:02:15,024
So, while the algorithm is looking at the example xi, yi, but before it has updated the parameters theta

24
00:02:15,024 --> 00:02:20,255
using that an example, let's compute the cost of that example.

25
00:02:20,255 --> 00:02:23,577
Just to say the same thing again, but using slightly different words.

26
00:02:23,577 --> 00:02:33,294
A stochastic gradient descent is scanning through our training set right before we have updated theta using a specific training example x(i) comma y(i)

27
00:02:33,294 --> 00:02:38,198
let's compute how well our hypothesis is doing on that training example.

28
00:02:38,198 --> 00:02:43,852
And we want to do this before updating theta because if we've just updated theta using example,

29
00:02:43,852 --> 00:02:49,061
you know, that it might be doing better on that example than what would be representative.

30
00:02:49,061 --> 00:02:57,438
Finally, in order to check for the convergence of  stochastic gradient descent, what we can do is every, say, every thousand iterations,

31
00:02:57,438 --> 00:03:01,511
we can plot these costs that we've been computing in the previous step.

32
00:03:01,511 --> 00:03:07,450
We can plot those costs average over, say, the last thousand examples processed by the algorithm.

33
00:03:07,450 --> 00:03:12,714
And if you do this, it kind of gives you a running estimate of how well the algorithm is doing.

34
00:03:12,714 --> 00:03:17,049
on, you know, the last 1000 training examples that your algorithm has seen.

35
00:03:17,049 --> 00:03:23,974
So, in contrast to computing J<u>train periodically which needed to scan through the entire training set.</u>

36
00:03:23,974 --> 00:03:27,973
With this other procedure, well, as part of stochastic gradient descent,

37
00:03:27,973 --> 00:03:32,965
it doesn't cost much to compute these costs as well right before updating to parameter theta.

38
00:03:32,965 --> 00:03:40,276
And all we're doing is every thousand integrations or so, we just average the last 1,000 costs that we computed and plot that.

39
00:03:40,276 --> 00:03:47,537
And by looking at those plots, this will allow us to check if stochastic gradient descent is converging.

40
00:03:47,537 --> 00:03:51,708
So here are a few examples of what these plots might look like.

41
00:03:51,708 --> 00:03:55,519
Suppose you have plotted the cost average over the last thousand examples,

42
00:03:55,519 --> 00:04:01,073
because these are averaged over just a thousand examples, they are going to be a little bit noisy and so,

43
00:04:01,073 --> 00:04:03,873
it may not decrease on every single iteration.

44
00:04:03,873 --> 00:04:07,828
Then if you get a figure that looks like this, So the plot is noisy

45
00:04:07,828 --> 00:04:11,721
because it's average over, you know, just a small subset, say a thousand training examples.

46
00:04:11,721 --> 00:04:17,283
If you get a figure that looks like this, you know that would be a pretty decent run with the algorithm,

47
00:04:17,283 --> 00:04:24,195
maybe, where it looks like the cost has gone down and then this plateau that looks kind of flattened out, you know, starting from around that point.

48
00:04:24,195 --> 00:04:29,603
look like, this is what your cost looks like then maybe your learning algorithm has converged.

49
00:04:29,603 --> 00:04:34,252
If you want to try using a smaller learning rate, something you might see is that

50
00:04:34,252 --> 00:04:39,229
the algorithm may initially learn more slowly so the cost goes down more slowly.

51
00:04:39,229 --> 00:04:47,585
But then eventually you have a smaller learning rate is actually possible for the algorithm to end up at a, maybe very slightly better solution.

52
00:04:47,585 --> 00:04:53,426
So the red line may represent the behavior of stochastic gradient descent using a slower, using a smaller leaning rate.

53
00:04:53,426 --> 00:05:00,594
And the reason this is the case is because, you remember, stochastic gradient descent doesn't just converge to the global minimum,

54
00:05:00,594 --> 00:05:05,068
is that what it does is the parameters will oscillate a bit around the global minimum.

55
00:05:05,068 --> 00:05:09,231
And so by using a smaller learning rate, you'll end up with smaller oscillations.

56
00:05:09,231 --> 00:05:12,896
And sometimes this little difference will be negligible

57
00:05:12,896 --> 00:05:19,686
and sometimes with a smaller than you can get a slightly better value for the parameters.

58
00:05:19,686 --> 00:05:22,269
Here are some other things that might happen.

59
00:05:22,269 --> 00:05:27,986
Let's say you run stochastic gradient descent and you average over a thousand examples when plotting these costs.

60
00:05:27,986 --> 00:05:32,369
So, you know, here might be the result of another one of these plots.

61
00:05:32,369 --> 00:05:34,353
Then again, it kind of looks like it's converged.

62
00:05:34,353 --> 00:05:42,119
If you were to take this number, a thousand, and increase to averaging over 5 thousand examples.

63
00:05:42,119 --> 00:05:47,913
Then it's possible that you might get a smoother curve that looks more like this.

64
00:05:47,913 --> 00:05:56,547
And by averaging over, say 5,000 examples instead of 1,000, you might be able to get a smoother curve like this.

65
00:05:56,547 --> 00:06:00,248
And so that's the effect of increasing the number of examples you average over.

66
00:06:00,248 --> 00:06:06,229
The disadvantage of making this too big of course is that now you get one date point only every 5,000 examples.

67
00:06:06,229 --> 00:06:12,001
And so the feedback you get on how well your learning learning algorithm is doing is, sort of, maybe it's more delayed

68
00:06:12,001 --> 00:06:17,681
because you get one data point on your plot only every 5,000 examples rather than every 1,000 examples.

69
00:06:17,681 --> 00:06:23,911
Along a similar vein some times you may run a gradient descent and end up with a plot that looks like this.

70
00:06:23,911 --> 00:06:32,079
And with a plot that looks like this, you know, it looks like the cost just is not decreasing at all.

71
00:06:32,079 --> 00:06:34,023
It looks like the algorithm is just not learning.

72
00:06:34,023 --> 00:06:39,261
It's just, looks like this here a flat curve and the cost is just not decreasing.

73
00:06:39,261 --> 00:06:46,260
But again if you were to increase this to averaging over a larger number of examples

74
00:06:46,260 --> 00:06:49,729
it is possible that you see something like this red line

75
00:06:49,729 --> 00:06:55,127
it looks like the cost actually is decreasing, it's just that the blue line averaging over 2, 3 examples,

76
00:06:55,127 --> 00:07:01,374
the blue line was too noisy so you couldn't see the actual trend in the cost actually decreasing

77
00:07:01,374 --> 00:07:06,688
and possibly averaging over 5,000 examples instead of 1,000 may help.

78
00:07:06,688 --> 00:07:12,358
Of course we averaged over a larger number examples that we've averaged here over 5,000 examples,

79
00:07:12,358 --> 00:07:16,998
I'm just using a different color, it is also possible that you that see a learning curve ends up looking like this.

80
00:07:16,998 --> 00:07:21,197
That it's still flat even when you average over a larger number of examples.

81
00:07:21,197 --> 00:07:25,908
And as you get that, then that's maybe just a more firm verification that

82
00:07:25,908 --> 00:07:29,287
unfortunately the algorithm just isn't learning much for whatever reason.

83
00:07:29,287 --> 00:07:34,969
And you need to either change the learning rate or change the features or change something else about the algorithm.

84
00:07:34,969 --> 00:07:39,235
Finally, one last thing that you might see would be if you were to plot these curves

85
00:07:39,235 --> 00:07:43,273
and you see a curve that looks like this, where it actually looks like it's increasing.

86
00:07:43,273 --> 00:07:48,066
And if that's the case then this is a sign that the algorithm is diverging.

87
00:07:48,066 --> 00:07:53,965
And what you really should do is use a smaller value of the learning rate alpha.

88
00:07:53,965 --> 00:07:58,143
So hopefully this gives you a sense of the range of phenomena you might see

89
00:07:58,143 --> 00:08:02,946
when you plot these cost average over some range of examples as well as

90
00:08:02,946 --> 00:08:07,765
suggests the sorts of things you might try to do in response to seeing different plots.

91
00:08:07,765 --> 00:08:15,070
So if the plots looks too noisy, or if it wiggles up and down too much, then try increasing the number of examples

92
00:08:15,070 --> 00:08:18,734
you're averaging over so you can see the overall trend in the plot better.

93
00:08:18,734 --> 00:08:25,836
And if you see that the errors are actually increasing, the costs are actually increasing, try using a smaller value of alpha.

94
00:08:25,836 --> 00:08:31,649
Finally, it's worth examining the issue of the learning rate just a little bit more.

95
00:08:31,649 --> 00:08:38,922
We saw that when we run stochastic gradient descent, the algorithm will start here and sort of meander towards the minimum

96
00:08:38,922 --> 00:08:43,494
And then it won't really converge, and instead it'll wander around the minimum forever.

97
00:08:43,494 --> 00:08:50,225
And so you end up with a parameter value that is hopefully close to the global minimum that won't be exact at the global minimum.

98
00:08:50,225 --> 00:08:57,991
In most typical implementations of stochastic gradient descent, the learning rate alpha is typically held constant.

99
00:08:57,991 --> 00:09:02,022
And so what you we end up is exactly a picture like this.

100
00:09:02,022 --> 00:09:06,523
If you want stochastic gradient descent to actually converge to the global minimum,

101
00:09:06,523 --> 00:09:11,825
there's one thing which you can do which is you can slowly decrease the learning rate alpha over time.

102
00:09:11,825 --> 00:09:22,240
So, a pretty typical way of doing that would be to set alpha equals some constant 1 divided by iteration number plus constant 2.

103
00:09:22,240 --> 00:09:28,169
So, iteration number is the number of iterations you've run of stochastic gradient descent,

104
00:09:28,169 --> 00:09:29,519
so it's really the number of training examples you've seen

105
00:09:29,519 --> 00:09:34,103
And const 1 and const 2 are additional parameters of the algorithm

106
00:09:34,103 --> 00:09:38,160
that you might have to play with a bit in order to get good performance.

107
00:09:38,160 --> 00:09:43,004
One of the reasons people tend not to do this is because you end up needing to spend time

108
00:09:43,004 --> 00:09:48,122
playing with these 2 extra parameters, constant 1 and constant 2, and so this makes the algorithm more finicky.

109
00:09:48,122 --> 00:09:52,113
You know, it's just more parameters able to fiddle with in order to make the algorithm work well.

110
00:09:52,113 --> 00:09:57,246
But if you manage to tune the parameters well, then the picture you can get is that

111
00:09:57,246 --> 00:10:02,834
the algorithm will actually around towards the minimum, but as it gets closer

112
00:10:02,834 --> 00:10:07,024
because you're decreasing the learning rate the meanderings will get smaller and smaller

113
00:10:07,024 --> 00:10:12,729
until it pretty much just to the global minimum. I hope this makes sense, right?

114
00:10:12,729 --> 00:10:21,608
And the reason this formula makes sense is because as the algorithm runs, the iteration number becomes large So alpha will slowly become small,

115
00:10:21,608 --> 00:10:27,506
and so you take smaller and smaller steps until it hopefully converges to the global minimum.

116
00:10:27,506 --> 00:10:33,484
So If you do slowly decrease alpha to zero you can end up with a slightly better hypothesis.

117
00:10:33,484 --> 00:10:40,078
But because of the extra work needed to fiddle with the constants and because frankly usually we're pretty happy

118
00:10:40,078 --> 00:10:43,892
with any parameter value that is, you know, pretty close to the global minimum.

119
00:10:43,892 --> 00:10:50,863
Typically this process of decreasing alpha slowly is usually not done and keeping the learning rate alpha constant

120
00:10:50,863 --> 00:10:56,983
is the more common application of stochastic gradient descent although you will see people use either version.

121
00:10:56,983 --> 00:11:03,595
To summarize in this video we talk about a way for approximately monitoring

122
00:11:03,595 --> 00:11:08,256
how the stochastic gradient descent is doing in terms for optimizing the cost function.

123
00:11:08,256 --> 00:11:17,043
And this is a method that does not require scanning over the entire training set periodically to compute the cost function on the entire training set.

124
00:11:17,043 --> 00:11:20,693
But instead it looks at say only the last thousand examples or so.

125
00:11:20,693 --> 00:11:27,592
And you can use this method both to make sure the stochastic gradient descent is okay and is converging

126
00:11:27,592 --> 00:11:31,468
or to use it to tune the learning rate alpha.