1
00:00:00,000 --> 00:00:04,985
[MUSIC]

2
00:00:04,985 --> 00:00:09,769
The second practical issue we're going
to talk about is how do you measure

3
00:00:09,769 --> 00:00:12,420
convergence for stochastic gradient.

4
00:00:12,420 --> 00:00:16,320
If you have to look at all the data
points to figure out they have converged,

5
00:00:16,320 --> 00:00:19,300
then the whole process is going to
be pointless and meaningless.

6
00:00:19,300 --> 00:00:22,700
So we need to think about new
techniques of measuring convergence.

7
00:00:22,700 --> 00:00:24,650
This, again,
is going to be an optional section.

8
00:00:24,650 --> 00:00:27,824
It's very practical for
those those who actually want to use and

9
00:00:27,824 --> 00:00:31,020
implement a really practical
stochastic gradient algorithm.

10
00:00:33,300 --> 00:00:37,070
One way to think about it is,
how do they actually make this plot?

11
00:00:37,070 --> 00:00:42,282
Here's a plot where a stochastic gradient
gets the optimum before a single pass

12
00:00:42,282 --> 00:00:47,360
over the data, while gradient is taking
100 or more passes over the data.

13
00:00:48,810 --> 00:00:52,370
If to get one point in
this plot I had to compute

14
00:00:52,370 --> 00:00:56,880
the whole likelihood over the entire
data set, that would require me, for

15
00:00:56,880 --> 00:01:01,200
every little blue dot,
to make a pass over the entire data set.

16
00:01:01,200 --> 00:01:03,390
Which make the whole thing really slow.

17
00:01:03,390 --> 00:01:07,228
If I had to make a pass over a data
set to compute the likelihood,

18
00:01:07,228 --> 00:01:11,030
might as well just use full gradient and
not have this noisy problems.

19
00:01:11,030 --> 00:01:15,482
And so, we need to rethink how
we're going to compute conversions,

20
00:01:15,482 --> 00:01:18,881
how we're going to plot
that we're making progress.

21
00:01:18,881 --> 00:01:23,027
And here there'll be a really,
really simple trick, really easy,

22
00:01:23,027 --> 00:01:25,820
really beautiful, that we can do.

23
00:01:25,820 --> 00:01:30,430
So I'm showing here the stochastic
gradient ascent algorithm for

24
00:01:30,430 --> 00:01:33,880
logistic regression,
the one that we've been using so far.

25
00:01:33,880 --> 00:01:38,860
So I go data point by data point, and I
compute the contribution to the gradient,

26
00:01:38,860 --> 00:01:42,319
which is this part,
this thing I'm calling partial j.

27
00:01:42,319 --> 00:01:45,784
Now if I wanted to compute the log
likelihood of the data to see what

28
00:01:45,784 --> 00:01:50,383
my quality was, how well I'm doing, I'll
have to compute the following equation for

29
00:01:50,383 --> 00:01:51,910
data point i.

30
00:01:51,910 --> 00:01:56,962
If the data point were a positive
data point, I take the log of y

31
00:01:56,962 --> 00:02:02,215
= +1 given xi and the current parameters,
current coefficient.

32
00:02:02,215 --> 00:02:06,900
And, if the data point
were a negative yi = -1,

33
00:02:06,900 --> 00:02:12,143
then I need to take the log
of the probability yi = -1,

34
00:02:12,143 --> 00:02:16,621
which turns out to be log of 1- P(y = +1).

35
00:02:16,621 --> 00:02:19,508
And so here's the beautiful thing,

36
00:02:19,508 --> 00:02:24,535
the thing that I need to compute
the likelihood for a data point is

37
00:02:24,535 --> 00:02:29,860
exactly the same as the thing that
I needed to compute the gradient.

38
00:02:31,280 --> 00:02:33,800
And so, I've already computed.

39
00:02:33,800 --> 00:02:36,550
I can compute the contribution
likelihood of this one data point.

40
00:02:37,760 --> 00:02:38,870
Which is great.

41
00:02:38,870 --> 00:02:41,040
I can't do it for everybody, but
I can do it for one data point.

42
00:02:42,840 --> 00:02:49,224
So every iteration t, I can compute
the likelihood of a particular data point.

43
00:02:49,224 --> 00:02:52,861
I can't use that measure conversions
because I could do well one data point

44
00:02:52,861 --> 00:02:57,440
classified perfectly but not for others,
so it would be a very noisy thing.

45
00:02:57,440 --> 00:03:02,856
But if I want to compute how well I'm
doing, let's say, after 75 iterations,

46
00:03:02,856 --> 00:03:07,091
what I can do is look at the last
few data points, how well I did.

47
00:03:07,091 --> 00:03:10,656
And the likelihood for
those data points, average it out, and

48
00:03:10,656 --> 00:03:12,550
then create the smoother curve.

49
00:03:14,290 --> 00:03:19,652
And so, for every time stamp I want to
keep an average, it's called a moving

50
00:03:19,652 --> 00:03:24,855
average, of the last few likelihoods
in order to measure convergence.

51
00:03:24,855 --> 00:03:27,863
And so in the plot here,
the plot that I be showing to you,

52
00:03:27,863 --> 00:03:31,080
now I can tell you truth in
advertising what it actually was.

53
00:03:32,670 --> 00:03:35,810
The blue line here was
not straight up gradient,

54
00:03:35,810 --> 00:03:40,113
it was many batches of size 100,
which is a great number to use.

55
00:03:40,113 --> 00:03:42,810
It still converts much
faster than gradient.

56
00:03:42,810 --> 00:03:48,080
And to draw that blue line, I average the
likelihood over the last 30 data points.

57
00:03:49,240 --> 00:03:52,643
So that's how I build a plot, and this
is how you would have to build a plot if

58
00:03:52,643 --> 00:03:55,852
you're going to go through this whole
process of stochastic gradient.

59
00:03:55,852 --> 00:04:00,369
[MUSIC]