1 00:00:00,332 --> 00:00:04,284 In the next few videos, we'll talk about large scale machine learning. 2 00:00:04,284 --> 00:00:08,316 That is, algorithms but viewing with big data sets. 3 00:00:08,316 --> 00:00:12,839 If you look back at a recent 5 or 10-year history of machine learning. 4 00:00:12,839 --> 00:00:17,853 One of the reasons that learning algorithms work so much better now than even say, 5-years ago, 5 00:00:17,853 --> 00:00:22,657 is just the sheer amount of data that we have now and that we can train our algorithms on. 6 00:00:22,657 --> 00:00:29,741 In these next few videos, we'll talk about algorithms for dealing when we have such massive data sets. 7 00:00:32,926 --> 00:00:35,527 So why do we want to use such large data sets? 8 00:00:35,527 --> 00:00:40,564 We've already seen that one of the best ways to get a high performance machine learning system, 9 00:00:40,564 --> 00:00:46,168 is if you take a low-bias learning algorithm, and train that on a lot of data. 10 00:00:46,168 --> 00:00:53,561 And so, one early example we have already seen was this example of classifying between confusable words. 11 00:00:53,561 --> 00:01:00,726 So, for breakfast, I ate two (TWO) eggs and we saw in this example, these sorts of results, 12 00:01:00,726 --> 00:01:06,436 where, you know, so long as you feed the algorithm a lot of data, it seems to do very well. 13 00:01:06,436 --> 00:01:10,419 And so it's results like these that has led to the saying in machine learning that 14 00:01:10,419 --> 00:01:15,151 often it's not who has the best algorithm that wins. It's who has the most data. 15 00:01:15,151 --> 00:01:19,568 So you want to learn from large data sets, at least when we can get such large data sets. 16 00:01:19,568 --> 00:01:27,027 But learning with large data sets comes with its own unique problems, specifically, computational problems. 17 00:01:27,027 --> 00:01:33,870 Let's say your training set size is M equals 100,000,000. 18 00:01:33,870 --> 00:01:37,934 And this is actually pretty realistic for many modern data sets. 19 00:01:37,934 --> 00:01:40,518 If you look at the US Census data set, if there are, you know, 20 00:01:40,518 --> 00:01:44,663 300 million people in the US, you can usually get hundreds of millions of records. 21 00:01:44,663 --> 00:01:47,856 If you look at the amount of traffic that popular websites get, 22 00:01:47,856 --> 00:01:52,509 you easily get training sets that are much larger than hundreds of millions of examples. 23 00:01:52,509 --> 00:01:57,407 And let's say you want to train a linear regression model, or maybe a logistic regression model, 24 00:01:57,407 --> 00:02:01,692 in which case this is the gradient descent rule. 25 00:02:01,692 --> 00:02:05,372 And if you look at what you need to do to compute the gradient, 26 00:02:05,372 --> 00:02:09,992 which is this term over here, then when M is a hundred million, 27 00:02:09,992 --> 00:02:13,976 you need to carry out a summation over a hundred million terms, 28 00:02:13,976 --> 00:02:18,977 in order to compute these derivatives terms and to perform a single step of decent. 29 00:02:18,977 --> 00:02:25,627 Because of the computational expense of summing over a hundred million entries 30 00:02:25,627 --> 00:02:28,628 in order to compute just one step of gradient descent, 31 00:02:28,628 --> 00:02:31,530 in the next few videos we've spoken about techniques 32 00:02:31,530 --> 00:02:38,413 for either replacing this with something else or to find more efficient ways to compute this derivative. 33 00:02:38,413 --> 00:02:41,709 By the end of this sequence of videos on large scale machine learning, 34 00:02:41,709 --> 00:02:47,045 you know how to fit models, linear regression, logistic regression, neural networks and so on 35 00:02:47,045 --> 00:02:50,990 even today's data sets with, say, a hundred million examples. 36 00:02:50,990 --> 00:02:56,035 Of course, before we put in the effort into training a model with a hundred million examples, 37 00:02:56,035 --> 00:03:01,276 We should also ask ourselves, well, why not use just a thousand examples. 38 00:03:01,276 --> 00:03:04,923 Maybe we can randomly pick the subsets of a thousand examples 39 00:03:04,923 --> 00:03:10,254 out of a hundred million examples and train our algorithm on just a thousand examples. 40 00:03:10,254 --> 00:03:16,076 So before investing the effort into actually developing and the software needed to train these massive models 41 00:03:16,076 --> 00:03:22,461 is often a good sanity check, if training on just a thousand examples might do just as well. 42 00:03:22,461 --> 00:03:29,731 The way to sanity check of using a much smaller training set might do just as well, 43 00:03:29,731 --> 00:03:33,958 that is if using a much smaller n equals 1000 size training set, 44 00:03:33,958 --> 00:03:37,797 that might do just as well, it is the usual method of plotting the learning curves, 45 00:03:37,797 --> 00:03:46,872 so if you were to plot the learning curves and if your training objective were to look like this, 46 00:03:46,872 --> 00:03:49,553 that's J train theta. 47 00:03:49,553 --> 00:03:56,422 And if your cross-validation set objective, Jcv of theta would look like this, 48 00:03:56,422 --> 00:04:00,310 then this looks like a high-variance learning algorithm, 49 00:04:00,310 --> 00:04:05,913 and we will be more confident that adding extra training examples would improve performance. 50 00:04:05,913 --> 00:04:10,462 Whereas in contrast if you were to plot the learning curves, 51 00:04:10,462 --> 00:04:20,339 if your training objective were to look like this, and if your cross-validation objective were to look like that, 52 00:04:20,339 --> 00:04:24,292 then this looks like the classical high-bias learning algorithm. 53 00:04:24,292 --> 00:04:28,084 And in the latter case, you know, if you were to plot this up to, 54 00:04:28,084 --> 00:04:33,437 say, m equals 1000 and so that is m equals 500 up to m equals 1000, 55 00:04:33,437 --> 00:04:39,400 then it seems unlikely that increasing m to a hundred million will do much better 56 00:04:39,400 --> 00:04:42,736 and then you'd be just fine sticking to n equals 1000, 57 00:04:42,736 --> 00:04:47,000 rather than investing a lot of effort to figure out how the scale of the algorithm. 58 00:04:47,000 --> 00:04:51,029 Of course, if you were in the situation shown by the figure on the right, 59 00:04:51,029 --> 00:04:53,885 then one natural thing to do would be to add extra features, 60 00:04:53,885 --> 00:04:58,484 or add extra hidden units to your neural network and so on, 61 00:04:58,484 --> 00:05:04,627 so that you end up with a situation closer to that on the left, where maybe this is up to n equals 1000, 62 00:05:04,627 --> 00:05:09,553 and this then gives you more confidence that trying to add infrastructure to change the algorithm 63 00:05:09,553 --> 00:05:14,735 to use much more than a thousand examples that might actually be a good use of your time. 64 00:05:14,735 --> 00:05:19,642 So in large-scale machine learning, we like to come up with computationally reasonable ways, 65 00:05:19,642 --> 00:05:24,026 or computationally efficient ways, to deal with very big data sets. 66 00:05:24,026 --> 00:05:26,826 In the next few videos, we'll see two main ideas. 67 00:05:26,826 --> 00:05:33,464 The first is called stochastic gradient descent and the second is called Map Reduce, for viewing with very big data sets. 68 00:05:33,464 --> 00:05:39,986 And after you've learned about these methods, hopefully that will allow you to scale up your learning algorithms to big data 69 00:05:39,986 --> 00:05:43,986 and allow you to get much better performance on many different applications.