[MUSIC] In this module we're going to address a really important problem with machine learning today. How to scale the algorithms we discussed to really large data sets. The ideas we discuss today are going to be broadly applicable. They're going to be applicable for all of classification. They're also going to be applicable even for the regression course, or the second course of the specialization. We're going to talk about the technique called stochastic gradient, and when you relate it to something called online learning, where you're learning from data streams where one piece of data is arriving at a time. Let's take a moment to review how gradient ascent works in the context of classification, and how it's impacted by the data set size. So we have a very large data set here, and we have a set of coefficients w(t) which we're hoping to update. So we're going to use gradient ascent. We're going to compute the gradient on this data set which requires us to make a pass or scan over the data, computing the contribution of each one of these data points to the gradient. Then we compute the gradient, and we go ahead and update the parameters. Then we get W (t+1). Now we have to go back to data set, and make another pass where we visit every single data point, and we go computer a new gradient and update the parameters, the coefficients, and get W (t+2). So, in this process every time we're going to do a coefficient update we're going to have to make a full scan or a full pass over the entire data set, which can be really slow if the data set is really big. And these days data sets are getting huge. You can think about the 4.8 billion webpages that are out there on the web. You can think about the fact that Twitter, for example, is generating 500 million tweets a day. That's a lot. By the way follow me on Twitter. [LAUGH] Or you can think about again how the world is getting embedded with sensors which is something we today call internet of things where you have devices throughout the homes, devices we carry with us like the smart watch here. And everything else connected to each other generating tons and tons and tons of data. And you can think about specific websites like YouTube where 300 hours of video uploaded every minute and nobody watches. And this is a really fundamental problem for machine learning. How to tackle this huge, massive data sets. Let's use YouTube as an example. We have this tons of videos being uploaded, and there's a billion users who are visiting this website. And YouTube makes money out of ad revenue or showing the right ad to each one of it's users. Now, the number you should be thinking about is not 300 hours of video a minute or 1 billion users, but if you were thinking about the 4 billion page views or video views that they have everyday. So for each one of these page views they have to serve ads. They have to figure out what ads to put with those videos, and they have to go back and retrain their learning algorithm. In other words they need the machine and the algorithm to figure out what ad to show that can deal with 5 billion events per day, and that's fast enough that you can make predictions as to what ad to show within milliseconds as you're going to watch those videos. [MUSIC]