1 00:00:00,000 --> 00:00:04,728 [MUSIC] 2 00:00:04,728 --> 00:00:09,168 The simplest form of ascent a gradient says when an estimated 3 00:00:09,168 --> 00:00:12,240 gradient one data point at a time. 4 00:00:12,240 --> 00:00:13,790 So lets see what that looks like. 5 00:00:15,505 --> 00:00:18,600 We're back to our gradient, the sun setting where we are going to start, 6 00:00:18,600 --> 00:00:20,540 from some point in space. 7 00:00:20,540 --> 00:00:24,840 We're going to compute the gradient of the likelihood function, 8 00:00:24,840 --> 00:00:27,920 take a little step in that space, then compute the gradient again, 9 00:00:27,920 --> 00:00:31,220 take another little step, another little step, and another little step. 10 00:00:31,220 --> 00:00:35,710 And this is the algorithm on the right that we'll be using 11 00:00:35,710 --> 00:00:38,410 throughout this module and before that. 12 00:00:38,410 --> 00:00:42,018 The question here is how expensive is this algorithm? 13 00:00:42,018 --> 00:00:44,642 Let's look at the case of logistic regression, 14 00:00:44,642 --> 00:00:48,100 just to remind ourselves of how expensive that is. 15 00:00:48,100 --> 00:00:53,930 What we need to do here is sum function over each data point. 16 00:00:53,930 --> 00:00:58,300 And you can think about it as the contribution to the gradient the data 17 00:00:58,300 --> 00:01:03,890 point xi, yi has, and we're adding those up through each data point. 18 00:01:03,890 --> 00:01:09,610 And so, I'm going to denote here by this term, 19 00:01:09,610 --> 00:01:14,480 the derivative l subscript i with respect to 20 00:01:14,480 --> 00:01:19,250 the priority wj is the contribution of i data point to the gradient. 21 00:01:19,250 --> 00:01:23,870 And what we're doing here Is summing up each one of these contributions 22 00:01:23,870 --> 00:01:27,650 in order to compute the final gradient that we're going to take, 23 00:01:27,650 --> 00:01:30,520 the final update for the parameters. 24 00:01:30,520 --> 00:01:34,850 So how computationally costly is it to do this approach? 25 00:01:34,850 --> 00:01:37,420 So let's ask that question. 26 00:01:37,420 --> 00:01:41,450 So for example, if I were to say 27 00:01:41,450 --> 00:01:45,200 computing the contribution of one data point takes me one millisecond. 28 00:01:45,200 --> 00:01:47,360 It's really fast, it only takes me one millisecond. 29 00:01:47,360 --> 00:01:50,945 When I have 1,000 data points, then that's not too bad. 30 00:01:50,945 --> 00:01:54,250 1,000 milliseconds is one second. 31 00:01:55,960 --> 00:01:59,550 I don't feel too bad about that outcome. 32 00:01:59,550 --> 00:02:02,720 What if you have a much more complex model, like a neural network or 33 00:02:02,720 --> 00:02:07,700 something, where one update takes 1 second instead of 1 millisecond, and 34 00:02:07,700 --> 00:02:08,940 you have 1000 data points? 35 00:02:10,420 --> 00:02:15,350 Now computing 1 step of gradient requires 1000 seconds, 36 00:02:15,350 --> 00:02:19,020 which if you can't do this in your head, I can't. 37 00:02:19,020 --> 00:02:23,019 By Cheetah is 16.7 minutes. 38 00:02:23,019 --> 00:02:25,890 Now things are getting a little bit more complicated. 39 00:02:25,890 --> 00:02:30,070 So 16.7 minutes is a lot to wait, but 40 00:02:30,070 --> 00:02:34,850 a thousand data points is nothing. 41 00:02:34,850 --> 00:02:39,000 We said YouTube has five billion data points per day or more. 42 00:02:39,000 --> 00:02:42,650 So let's say that we're going back to the one millisecond case. 43 00:02:42,650 --> 00:02:46,420 It only takes 1 millisecond to compute that gradient contribution, but 44 00:02:46,420 --> 00:02:50,280 you have 10 million data points, what happens? 45 00:02:50,280 --> 00:02:53,310 Well the total time now becomes 2.8 hours. 46 00:02:57,060 --> 00:03:00,700 So things are starting to get nasty but again, 47 00:03:00,700 --> 00:03:03,440 10 million data points is not that much. 48 00:03:03,440 --> 00:03:07,340 YouTube has five billion, ten billion more per day. 49 00:03:07,340 --> 00:03:09,800 So what happens if you have ten billion data points and 50 00:03:09,800 --> 00:03:13,248 it takes you just one millisecond to compute that creating contribution? 51 00:03:13,248 --> 00:03:20,700 Then it's going to take us 10 billion milliseconds, 52 00:03:20,700 --> 00:03:27,050 which I cheated here, it's 115.7 days. 53 00:03:27,050 --> 00:03:31,904 So it's going to take about four months, approximately, 54 00:03:31,904 --> 00:03:35,210 just to compute one gradient. 55 00:03:35,210 --> 00:03:37,598 And if you remember from implementation, 56 00:03:37,598 --> 00:03:40,861 you have to do many gradient updates before you converge. 57 00:03:40,861 --> 00:03:46,465 So this is definitely not going to work, you need to do something about it. 58 00:03:46,465 --> 00:03:47,360 [MUSIC]