[MUSIC] It's no news to anybody that datasets have been getting larger over the last two decades, but what is interesting is what impact this had on machine learning research and machine learning practice. So if you look back 20 years ago, datasets are pretty small, and so we ended up working on a lot of very complex models to be able to get the most, squeeze the most out of that data. because we needed good accuracy with very little data. So we work on things like kernels and graphical models, so things are pretty complex. Ten years later the scale of data started growing tremendously. We entered an era that we call big data era. And it turns out that because it was so hard to scale machine learning algorithms to big datasets, and because it was so much data, we reverted back to simpler algorithms, to simpler days. And using things like logistic regression, matrix factorization to address machine learning problem. And there was even a lot of discussion on what impact it has on machine learning approaches. And there was this pre-seminal paper that talked about the unreasonable effectiveness of big datasets. And what it basically says is, if you have a massive dataset, you can use a very simple approach, like a linear class file or logistic aggression, and beat a Farsi approach, like a graphical model, because a graphical model could only handle smaller datasets. So in bigger datasets, simple approaches can do extremely well. Well since then a lot of things have changed. Our data has gotten even bigger and our ambitions have gotten bigger. We want to be able to deal with even more accurate models and this even larger data. And so we're forced to come up with new kinds of algorithms that can scale up complex models to huge datasets to get really amazing accuracy. And this is where things like parallelisms, using GPUs, using big computer clusters, and the type of technique that we're going to talk about today are going to be helpful for us to deal with complex models. So things like boosted decision trees, tensor factorization, deep learning, deep learning requires tremendous amount of computation time to get the amazing accuracy, guys. And going back to the not 20 years ago, people were building this massive, massive graphical models. Because they have new techniques to scale them in parallel and in distributed settings to be able to get even more accuracy with bigger datasets. So the summary of the story is that machine learning has evolved over the last few years to first go back to its simpler roots, to be able to use large datasets, but today to come up with new algorithms that can scale complex models to massive datasets. That's where we are today. So, going back to the same framework we'll be discussing the whole course, where we're taking some training data, we're extracting some features, we're building a machinery model, but the machinery algorithm that we're going to use is a small modification upgrade and sent to study change. But it will allow us to scale to much bigger datasets, and often perform really well. This modification is called stochastic gradient, and does something extremely simple. What it does is take your massive dataset, and your current parameters, w(t), and when you're computing the gradient, instead of looking at all the data, because you don't need to look at everything. It just looks at small subset of the data. So it just does a bit of data and then updates the parameter to the coefficients of w(t+1). And then it looks at little bit more data and then updates the coefficients to w(t+2). Then it looks at a little bit more data and updates to w(t+3). And then looks at little bit more data and updates a coefficients to w(t+4). And so instead of making massive pass over the dataset, before making a coefficient update, here we're just looking at little bits of data and updating coefficients in a kind of interlinked fashion. This small change is going to really change everything for us, and allow us to scale to much bigger datasets, as we'll see today. [MUSIC]