[MUSIC] It's no news to anybody that datasets
have been getting larger over the last two decades, but what is
interesting is what impact this had on machine learning research and
machine learning practice. So if you look back 20 years ago,
datasets are pretty small, and so we ended up working on a lot of very
complex models to be able to get the most, squeeze the most out of that data. because we needed good accuracy
with very little data. So we work on things like kernels and
graphical models, so things are pretty complex. Ten years later the scale of data
started growing tremendously. We entered an era that
we call big data era. And it turns out that because it was so hard to scale machine learning algorithms
to big datasets, and because it was so much data, we reverted back to
simpler algorithms, to simpler days. And using things like logistic regression, matrix factorization to address
machine learning problem. And there was even a lot of discussion on
what impact it has on machine learning approaches. And there was this pre-seminal
paper that talked about the unreasonable effectiveness
of big datasets. And what it basically says is,
if you have a massive dataset, you can use a very simple approach,
like a linear class file or logistic aggression, and beat a Farsi
approach, like a graphical model, because a graphical model could
only handle smaller datasets. So in bigger datasets,
simple approaches can do extremely well. Well since then a lot
of things have changed. Our data has gotten even bigger and
our ambitions have gotten bigger. We want to be able to deal with even more
accurate models and this even larger data. And so we're forced to come up
with new kinds of algorithms that can scale up complex models to huge
datasets to get really amazing accuracy. And this is where things like
parallelisms, using GPUs, using big computer clusters, and the type
of technique that we're going to talk about today are going to be helpful for
us to deal with complex models. So things like boosted decision trees,
tensor factorization, deep learning,
deep learning requires tremendous amount of computation time to get
the amazing accuracy, guys. And going back to the not 20 years ago, people were building this massive,
massive graphical models. Because they have new techniques
to scale them in parallel and in distributed settings to be able to get
even more accuracy with bigger datasets. So the summary of the story is that
machine learning has evolved over the last few years to first go back
to its simpler roots, to be able to use large datasets, but
today to come up with new algorithms that can scale complex models
to massive datasets. That's where we are today. So, going back to the same framework
we'll be discussing the whole course, where we're taking some training data,
we're extracting some features, we're building a machinery model, but the
machinery algorithm that we're going to use is a small modification upgrade and
sent to study change. But it will allow us to scale
to much bigger datasets, and often perform really well. This modification is called
stochastic gradient, and does something extremely simple. What it does is take your massive dataset,
and your current parameters, w(t), and when you're computing the gradient,
instead of looking at all the data, because you don't need
to look at everything. It just looks at small subset of the data. So it just does a bit of data and then updates the parameter to
the coefficients of w(t+1). And then it looks at
little bit more data and then updates the coefficients to w(t+2). Then it looks at a little bit
more data and updates to w(t+3). And then looks at little bit more data and
updates a coefficients to w(t+4). And so instead of making
massive pass over the dataset, before making a coefficient update,
here we're just looking at little bits of data and updating coefficients
in a kind of interlinked fashion. This small change is going to really
change everything for us, and allow us to scale to much bigger datasets,
as we'll see today. [MUSIC]