1 00:00:00,197 --> 00:00:04,829 [MUSIC] 2 00:00:04,829 --> 00:00:09,053 In this module we're going to address a really important problem with machine 3 00:00:09,053 --> 00:00:10,450 learning today. 4 00:00:10,450 --> 00:00:15,160 How to scale the algorithms we discussed to really large data sets. 5 00:00:15,160 --> 00:00:18,482 The ideas we discuss today are going to be broadly applicable. 6 00:00:18,482 --> 00:00:20,995 They're going to be applicable for all of classification. 7 00:00:20,995 --> 00:00:24,210 They're also going to be applicable even for the regression course, or 8 00:00:24,210 --> 00:00:26,600 the second course of the specialization. 9 00:00:26,600 --> 00:00:30,490 We're going to talk about the technique called stochastic gradient, and 10 00:00:30,490 --> 00:00:34,190 when you relate it to something called online learning, where you're learning 11 00:00:34,190 --> 00:00:37,430 from data streams where one piece of data is arriving at a time. 12 00:00:38,790 --> 00:00:42,720 Let's take a moment to review how gradient ascent works in the context of 13 00:00:42,720 --> 00:00:46,680 classification, and how it's impacted by the data set size. 14 00:00:47,780 --> 00:00:50,080 So we have a very large data set here, and 15 00:00:50,080 --> 00:00:53,810 we have a set of coefficients w(t) which we're hoping to update. 16 00:00:53,810 --> 00:00:55,750 So we're going to use gradient ascent. 17 00:00:55,750 --> 00:00:58,530 We're going to compute the gradient on this data set which requires us to make 18 00:00:58,530 --> 00:01:00,740 a pass or scan over the data, 19 00:01:00,740 --> 00:01:04,710 computing the contribution of each one of these data points to the gradient. 20 00:01:04,710 --> 00:01:07,890 Then we compute the gradient, and we go ahead and update the parameters. 21 00:01:07,890 --> 00:01:09,040 Then we get W (t+1). 22 00:01:09,040 --> 00:01:13,370 Now we have to go back to data set, and make another pass where we visit every 23 00:01:13,370 --> 00:01:17,130 single data point, and we go computer a new gradient and 24 00:01:17,130 --> 00:01:21,610 update the parameters, the coefficients, and get W (t+2). 25 00:01:21,610 --> 00:01:25,390 So, in this process every time we're going to do a coefficient update 26 00:01:25,390 --> 00:01:29,300 we're going to have to make a full scan or a full pass over the entire data set, 27 00:01:29,300 --> 00:01:31,580 which can be really slow if the data set is really big. 28 00:01:33,380 --> 00:01:36,100 And these days data sets are getting huge. 29 00:01:36,100 --> 00:01:40,610 You can think about the 4.8 billion webpages that are out there on the web. 30 00:01:40,610 --> 00:01:43,050 You can think about the fact that Twitter, for example, 31 00:01:43,050 --> 00:01:45,680 is generating 500 million tweets a day. 32 00:01:45,680 --> 00:01:46,980 That's a lot. 33 00:01:46,980 --> 00:01:47,884 By the way follow me on Twitter. 34 00:01:47,884 --> 00:01:52,300 [LAUGH] Or you can think about again how the world is getting embedded with 35 00:01:52,300 --> 00:01:56,690 sensors which is something we today call internet of things where you have 36 00:01:56,690 --> 00:02:01,780 devices throughout the homes, devices we carry with us like the smart watch here. 37 00:02:01,780 --> 00:02:06,100 And everything else connected to each other generating tons and 38 00:02:06,100 --> 00:02:07,070 tons and tons of data. 39 00:02:07,070 --> 00:02:12,090 And you can think about specific websites like YouTube where 40 00:02:12,090 --> 00:02:16,690 300 hours of video uploaded every minute and nobody watches. 41 00:02:18,410 --> 00:02:21,210 And this is a really fundamental problem for machine learning. 42 00:02:21,210 --> 00:02:24,540 How to tackle this huge, massive data sets. 43 00:02:24,540 --> 00:02:25,800 Let's use YouTube as an example. 44 00:02:25,800 --> 00:02:27,890 We have this tons of videos being uploaded, 45 00:02:27,890 --> 00:02:32,010 and there's a billion users who are visiting this website. 46 00:02:32,010 --> 00:02:35,200 And YouTube makes money out of ad revenue or 47 00:02:35,200 --> 00:02:37,880 showing the right ad to each one of it's users. 48 00:02:39,260 --> 00:02:43,590 Now, the number you should be thinking about is not 300 hours of video a minute 49 00:02:43,590 --> 00:02:49,430 or 1 billion users, but if you were thinking about the 4 billion page views or 50 00:02:49,430 --> 00:02:51,460 video views that they have everyday. 51 00:02:51,460 --> 00:02:55,270 So for each one of these page views they have to serve ads. 52 00:02:55,270 --> 00:02:58,110 They have to figure out what ads to put with those videos, and 53 00:02:58,110 --> 00:03:01,860 they have to go back and retrain their learning algorithm. 54 00:03:01,860 --> 00:03:05,385 In other words they need the machine and the algorithm to 55 00:03:05,385 --> 00:03:10,035 figure out what ad to show that can deal with 5 billion events per day, and 56 00:03:10,035 --> 00:03:13,936 that's fast enough that you can make predictions as to what ad 57 00:03:13,936 --> 00:03:18,306 to show within milliseconds as you're going to watch those videos. 58 00:03:18,306 --> 00:03:22,939 [MUSIC]