1
00:00:00,332 --> 00:00:04,284
In the next few videos, we'll talk about large scale machine learning.

2
00:00:04,284 --> 00:00:08,316
That is, algorithms but viewing with big data sets.

3
00:00:08,316 --> 00:00:12,839
If you look back at a recent 5 or 10-year history of machine learning.

4
00:00:12,839 --> 00:00:17,853
One of the reasons that learning algorithms work so much better now than even say, 5-years ago,

5
00:00:17,853 --> 00:00:22,657
is just the sheer amount of data that we have now and that we can train our algorithms on.

6
00:00:22,657 --> 00:00:29,741
In these next few videos, we'll talk about algorithms for dealing when we have such massive data sets.

7
00:00:32,926 --> 00:00:35,527
So why do we want to use such large data sets?

8
00:00:35,527 --> 00:00:40,564
We've already seen that one of the best ways to get a high performance machine learning system,

9
00:00:40,564 --> 00:00:46,168
is if you take a low-bias learning algorithm, and train that on a lot of data.

10
00:00:46,168 --> 00:00:53,561
And so, one early example we have already seen was this example of classifying between confusable words.

11
00:00:53,561 --> 00:01:00,726
So, for breakfast, I ate two (TWO) eggs and we saw in this example, these sorts of results,

12
00:01:00,726 --> 00:01:06,436
where, you know, so long as you feed the algorithm a lot of data, it seems to do very well.

13
00:01:06,436 --> 00:01:10,419
And so it's results like these that has led to the saying in machine learning that

14
00:01:10,419 --> 00:01:15,151
often it's not who has the best algorithm that wins. It's who has the most data.

15
00:01:15,151 --> 00:01:19,568
So you want to learn from large data sets, at least when we can get such large data sets.

16
00:01:19,568 --> 00:01:27,027
But learning with large data sets comes with its own unique problems, specifically, computational problems.

17
00:01:27,027 --> 00:01:33,870
Let's say your training set size is M equals 100,000,000.

18
00:01:33,870 --> 00:01:37,934
And this is actually pretty realistic for many modern data sets.

19
00:01:37,934 --> 00:01:40,518
If you look at the US Census data set, if there are, you know,

20
00:01:40,518 --> 00:01:44,663
300 million people in the US, you can usually get hundreds of millions of records.

21
00:01:44,663 --> 00:01:47,856
If you look at the amount of traffic that popular websites get,

22
00:01:47,856 --> 00:01:52,509
you easily get training sets that are much larger than hundreds of millions of examples.

23
00:01:52,509 --> 00:01:57,407
And let's say you want to train a linear regression model, or maybe a logistic regression model,

24
00:01:57,407 --> 00:02:01,692
in which case this is the gradient descent rule.

25
00:02:01,692 --> 00:02:05,372
And if you look at what you need to do to compute the gradient,

26
00:02:05,372 --> 00:02:09,992
which is this term over here, then when M is a hundred million,

27
00:02:09,992 --> 00:02:13,976
you need to carry out a summation over a hundred million terms,

28
00:02:13,976 --> 00:02:18,977
in order to compute these derivatives terms and to perform a single step of decent.

29
00:02:18,977 --> 00:02:25,627
Because of the computational expense of summing over a hundred million entries

30
00:02:25,627 --> 00:02:28,628
in order to compute just one step of gradient descent,

31
00:02:28,628 --> 00:02:31,530
in the next few videos we've spoken about techniques

32
00:02:31,530 --> 00:02:38,413
for either replacing this with something else or to find more efficient ways to compute this derivative.

33
00:02:38,413 --> 00:02:41,709
By the end of this sequence of videos on large scale machine learning,

34
00:02:41,709 --> 00:02:47,045
you know how to fit models, linear regression, logistic regression, neural networks and so on

35
00:02:47,045 --> 00:02:50,990
even today's data sets with, say, a hundred million examples.

36
00:02:50,990 --> 00:02:56,035
Of course, before we put in the effort into training a model with a hundred million examples,

37
00:02:56,035 --> 00:03:01,276
We should also ask ourselves, well, why not use just a thousand examples.

38
00:03:01,276 --> 00:03:04,923
Maybe we can randomly pick the subsets of a thousand examples

39
00:03:04,923 --> 00:03:10,254
out of a hundred million examples and train our algorithm on just a thousand examples.

40
00:03:10,254 --> 00:03:16,076
So before investing the effort into actually developing and the software needed to train these massive models

41
00:03:16,076 --> 00:03:22,461
is often a good sanity check, if training on just a thousand examples might do just as well.

42
00:03:22,461 --> 00:03:29,731
The way to sanity check of using a much smaller training set might do just as well,

43
00:03:29,731 --> 00:03:33,958
that is if using a much smaller n equals 1000 size training set,

44
00:03:33,958 --> 00:03:37,797
that might do just as well, it is the usual method of plotting the learning curves,

45
00:03:37,797 --> 00:03:46,872
so if you were to plot the learning curves and if your training objective were to look like this,

46
00:03:46,872 --> 00:03:49,553
that's J train theta.

47
00:03:49,553 --> 00:03:56,422
And if your cross-validation set objective, Jcv of theta would look like this,

48
00:03:56,422 --> 00:04:00,310
then this looks like a high-variance learning algorithm,

49
00:04:00,310 --> 00:04:05,913
and we will be more confident that adding extra training examples would improve performance.

50
00:04:05,913 --> 00:04:10,462
Whereas in contrast if you were to plot the learning curves,

51
00:04:10,462 --> 00:04:20,339
if your training objective were to look like this, and if your cross-validation objective were to look like that,

52
00:04:20,339 --> 00:04:24,292
then this looks like the classical high-bias learning algorithm.

53
00:04:24,292 --> 00:04:28,084
And in the latter case, you know, if you were to plot this up to,

54
00:04:28,084 --> 00:04:33,437
say, m equals 1000 and so that is m equals 500 up to m equals 1000,

55
00:04:33,437 --> 00:04:39,400
then it seems unlikely that increasing m to a hundred million will do much better

56
00:04:39,400 --> 00:04:42,736
and then you'd be just fine sticking to n equals 1000,

57
00:04:42,736 --> 00:04:47,000
rather than investing a lot of effort to figure out how the scale of the algorithm.

58
00:04:47,000 --> 00:04:51,029
Of course, if you were in the situation shown by the figure on the right,

59
00:04:51,029 --> 00:04:53,885
then one natural thing to do would be to add extra features,

60
00:04:53,885 --> 00:04:58,484
or add extra hidden units to your neural network and so on,

61
00:04:58,484 --> 00:05:04,627
so that you end up with a situation closer to that on the left, where maybe this is up to n equals 1000,

62
00:05:04,627 --> 00:05:09,553
and this then gives you more confidence that trying to add infrastructure to change the algorithm

63
00:05:09,553 --> 00:05:14,735
to use much more than a thousand examples that might actually be a good use of your time.

64
00:05:14,735 --> 00:05:19,642
So in large-scale machine learning, we like to come up with computationally reasonable ways,

65
00:05:19,642 --> 00:05:24,026
or computationally efficient ways, to deal with very big data sets.

66
00:05:24,026 --> 00:05:26,826
In the next few videos, we'll see two main ideas.

67
00:05:26,826 --> 00:05:33,464
The first is called stochastic gradient descent and the second is called Map Reduce, for viewing with very big data sets.

68
00:05:33,464 --> 00:05:39,986
And after you've learned about these methods, hopefully that will allow you to scale up your learning algorithms to big data

69
00:05:39,986 --> 00:05:43,986
and allow you to get much better performance on many different applications.