1 00:00:00,000 --> 00:00:04,985 [MUSIC] 2 00:00:04,985 --> 00:00:09,769 The second practical issue we're going to talk about is how do you measure 3 00:00:09,769 --> 00:00:12,420 convergence for stochastic gradient. 4 00:00:12,420 --> 00:00:16,320 If you have to look at all the data points to figure out they have converged, 5 00:00:16,320 --> 00:00:19,300 then the whole process is going to be pointless and meaningless. 6 00:00:19,300 --> 00:00:22,700 So we need to think about new techniques of measuring convergence. 7 00:00:22,700 --> 00:00:24,650 This, again, is going to be an optional section. 8 00:00:24,650 --> 00:00:27,824 It's very practical for those those who actually want to use and 9 00:00:27,824 --> 00:00:31,020 implement a really practical stochastic gradient algorithm. 10 00:00:33,300 --> 00:00:37,070 One way to think about it is, how do they actually make this plot? 11 00:00:37,070 --> 00:00:42,282 Here's a plot where a stochastic gradient gets the optimum before a single pass 12 00:00:42,282 --> 00:00:47,360 over the data, while gradient is taking 100 or more passes over the data. 13 00:00:48,810 --> 00:00:52,370 If to get one point in this plot I had to compute 14 00:00:52,370 --> 00:00:56,880 the whole likelihood over the entire data set, that would require me, for 15 00:00:56,880 --> 00:01:01,200 every little blue dot, to make a pass over the entire data set. 16 00:01:01,200 --> 00:01:03,390 Which make the whole thing really slow. 17 00:01:03,390 --> 00:01:07,228 If I had to make a pass over a data set to compute the likelihood, 18 00:01:07,228 --> 00:01:11,030 might as well just use full gradient and not have this noisy problems. 19 00:01:11,030 --> 00:01:15,482 And so, we need to rethink how we're going to compute conversions, 20 00:01:15,482 --> 00:01:18,881 how we're going to plot that we're making progress. 21 00:01:18,881 --> 00:01:23,027 And here there'll be a really, really simple trick, really easy, 22 00:01:23,027 --> 00:01:25,820 really beautiful, that we can do. 23 00:01:25,820 --> 00:01:30,430 So I'm showing here the stochastic gradient ascent algorithm for 24 00:01:30,430 --> 00:01:33,880 logistic regression, the one that we've been using so far. 25 00:01:33,880 --> 00:01:38,860 So I go data point by data point, and I compute the contribution to the gradient, 26 00:01:38,860 --> 00:01:42,319 which is this part, this thing I'm calling partial j. 27 00:01:42,319 --> 00:01:45,784 Now if I wanted to compute the log likelihood of the data to see what 28 00:01:45,784 --> 00:01:50,383 my quality was, how well I'm doing, I'll have to compute the following equation for 29 00:01:50,383 --> 00:01:51,910 data point i. 30 00:01:51,910 --> 00:01:56,962 If the data point were a positive data point, I take the log of y 31 00:01:56,962 --> 00:02:02,215 = +1 given xi and the current parameters, current coefficient. 32 00:02:02,215 --> 00:02:06,900 And, if the data point were a negative yi = -1, 33 00:02:06,900 --> 00:02:12,143 then I need to take the log of the probability yi = -1, 34 00:02:12,143 --> 00:02:16,621 which turns out to be log of 1- P(y = +1). 35 00:02:16,621 --> 00:02:19,508 And so here's the beautiful thing, 36 00:02:19,508 --> 00:02:24,535 the thing that I need to compute the likelihood for a data point is 37 00:02:24,535 --> 00:02:29,860 exactly the same as the thing that I needed to compute the gradient. 38 00:02:31,280 --> 00:02:33,800 And so, I've already computed. 39 00:02:33,800 --> 00:02:36,550 I can compute the contribution likelihood of this one data point. 40 00:02:37,760 --> 00:02:38,870 Which is great. 41 00:02:38,870 --> 00:02:41,040 I can't do it for everybody, but I can do it for one data point. 42 00:02:42,840 --> 00:02:49,224 So every iteration t, I can compute the likelihood of a particular data point. 43 00:02:49,224 --> 00:02:52,861 I can't use that measure conversions because I could do well one data point 44 00:02:52,861 --> 00:02:57,440 classified perfectly but not for others, so it would be a very noisy thing. 45 00:02:57,440 --> 00:03:02,856 But if I want to compute how well I'm doing, let's say, after 75 iterations, 46 00:03:02,856 --> 00:03:07,091 what I can do is look at the last few data points, how well I did. 47 00:03:07,091 --> 00:03:10,656 And the likelihood for those data points, average it out, and 48 00:03:10,656 --> 00:03:12,550 then create the smoother curve. 49 00:03:14,290 --> 00:03:19,652 And so, for every time stamp I want to keep an average, it's called a moving 50 00:03:19,652 --> 00:03:24,855 average, of the last few likelihoods in order to measure convergence. 51 00:03:24,855 --> 00:03:27,863 And so in the plot here, the plot that I be showing to you, 52 00:03:27,863 --> 00:03:31,080 now I can tell you truth in advertising what it actually was. 53 00:03:32,670 --> 00:03:35,810 The blue line here was not straight up gradient, 54 00:03:35,810 --> 00:03:40,113 it was many batches of size 100, which is a great number to use. 55 00:03:40,113 --> 00:03:42,810 It still converts much faster than gradient. 56 00:03:42,810 --> 00:03:48,080 And to draw that blue line, I average the likelihood over the last 30 data points. 57 00:03:49,240 --> 00:03:52,643 So that's how I build a plot, and this is how you would have to build a plot if 58 00:03:52,643 --> 00:03:55,852 you're going to go through this whole process of stochastic gradient. 59 00:03:55,852 --> 00:04:00,369 [MUSIC]