In the next few videos, we'll talk about large scale machine learning.
That is, algorithms but viewing with big data sets.
If you look back at a recent 5 or 10-year history of machine learning.
One of the reasons that learning algorithms work so much better now than even say, 5-years ago,
is just the sheer amount of data that we have now and that we can train our algorithms on.
In these next few videos, we'll talk about algorithms for dealing when we have such massive data sets.
So why do we want to use such large data sets?
We've already seen that one of the best ways to get a high performance machine learning system,
is if you take a low-bias learning algorithm, and train that on a lot of data.
And so, one early example we have already seen was this example of classifying between confusable words.
So, for breakfast, I ate two (TWO) eggs and we saw in this example, these sorts of results,
where, you know, so long as you feed the algorithm a lot of data, it seems to do very well.
And so it's results like these that has led to the saying in machine learning that
often it's not who has the best algorithm that wins. It's who has the most data.
So you want to learn from large data sets, at least when we can get such large data sets.
But learning with large data sets comes with its own unique problems, specifically, computational problems.
Let's say your training set size is M equals 100,000,000.
And this is actually pretty realistic for many modern data sets.
If you look at the US Census data set, if there are, you know,
300 million people in the US, you can usually get hundreds of millions of records.
If you look at the amount of traffic that popular websites get,
you easily get training sets that are much larger than hundreds of millions of examples.
And let's say you want to train a linear regression model, or maybe a logistic regression model,
in which case this is the gradient descent rule.
And if you look at what you need to do to compute the gradient,
which is this term over here, then when M is a hundred million,
you need to carry out a summation over a hundred million terms,
in order to compute these derivatives terms and to perform a single step of decent.
Because of the computational expense of summing over a hundred million entries
in order to compute just one step of gradient descent,
in the next few videos we've spoken about techniques
for either replacing this with something else or to find more efficient ways to compute this derivative.
By the end of this sequence of videos on large scale machine learning,
you know how to fit models, linear regression, logistic regression, neural networks and so on
even today's data sets with, say, a hundred million examples.
Of course, before we put in the effort into training a model with a hundred million examples,
We should also ask ourselves, well, why not use just a thousand examples.
Maybe we can randomly pick the subsets of a thousand examples
out of a hundred million examples and train our algorithm on just a thousand examples.
So before investing the effort into actually developing and the software needed to train these massive models
is often a good sanity check, if training on just a thousand examples might do just as well.
The way to sanity check of using a much smaller training set might do just as well,
that is if using a much smaller n equals 1000 size training set,
that might do just as well, it is the usual method of plotting the learning curves,
so if you were to plot the learning curves and if your training objective were to look like this,
that's J train theta.
And if your cross-validation set objective, Jcv of theta would look like this,
then this looks like a high-variance learning algorithm,
and we will be more confident that adding extra training examples would improve performance.
Whereas in contrast if you were to plot the learning curves,
if your training objective were to look like this, and if your cross-validation objective were to look like that,
then this looks like the classical high-bias learning algorithm.
And in the latter case, you know, if you were to plot this up to,
say, m equals 1000 and so that is m equals 500 up to m equals 1000,
then it seems unlikely that increasing m to a hundred million will do much better
and then you'd be just fine sticking to n equals 1000,
rather than investing a lot of effort to figure out how the scale of the algorithm.
Of course, if you were in the situation shown by the figure on the right,
then one natural thing to do would be to add extra features,
or add extra hidden units to your neural network and so on,
so that you end up with a situation closer to that on the left, where maybe this is up to n equals 1000,
and this then gives you more confidence that trying to add infrastructure to change the algorithm
to use much more than a thousand examples that might actually be a good use of your time.
So in large-scale machine learning, we like to come up with computationally reasonable ways,
or computationally efficient ways, to deal with very big data sets.
In the next few videos, we'll see two main ideas.
The first is called stochastic gradient descent and the second is called Map Reduce, for viewing with very big data sets.
And after you've learned about these methods, hopefully that will allow you to scale up your learning algorithms to big data
and allow you to get much better performance on many different applications.