1 00:00:00,000 --> 00:00:04,530 [MUSIC] 2 00:00:04,530 --> 00:00:08,914 It's no news to anybody that datasets have been getting larger over the last 3 00:00:08,914 --> 00:00:12,407 two decades, but what is interesting is what impact this had 4 00:00:12,407 --> 00:00:16,670 on machine learning research and machine learning practice. 5 00:00:16,670 --> 00:00:20,900 So if you look back 20 years ago, datasets are pretty small, and so 6 00:00:20,900 --> 00:00:25,660 we ended up working on a lot of very complex models to be able to get the most, 7 00:00:25,660 --> 00:00:27,655 squeeze the most out of that data. 8 00:00:27,655 --> 00:00:30,550 because we needed good accuracy with very little data. 9 00:00:30,550 --> 00:00:33,160 So we work on things like kernels and graphical models, 10 00:00:33,160 --> 00:00:35,000 so things are pretty complex. 11 00:00:36,050 --> 00:00:40,460 Ten years later the scale of data started growing tremendously. 12 00:00:40,460 --> 00:00:44,500 We entered an era that we call big data era. 13 00:00:44,500 --> 00:00:48,290 And it turns out that because it was so 14 00:00:48,290 --> 00:00:51,870 hard to scale machine learning algorithms to big datasets, and because it was so 15 00:00:51,870 --> 00:00:56,140 much data, we reverted back to simpler algorithms, to simpler days. 16 00:00:57,180 --> 00:00:59,180 And using things like logistic regression, 17 00:00:59,180 --> 00:01:03,490 matrix factorization to address machine learning problem. 18 00:01:03,490 --> 00:01:07,764 And there was even a lot of discussion on what impact it has on machine learning 19 00:01:07,764 --> 00:01:08,574 approaches. 20 00:01:08,574 --> 00:01:13,415 And there was this pre-seminal paper that talked about 21 00:01:13,415 --> 00:01:17,850 the unreasonable effectiveness of big datasets. 22 00:01:17,850 --> 00:01:21,601 And what it basically says is, if you have a massive dataset, 23 00:01:21,601 --> 00:01:25,353 you can use a very simple approach, like a linear class file or 24 00:01:25,353 --> 00:01:29,973 logistic aggression, and beat a Farsi approach, like a graphical model, 25 00:01:29,973 --> 00:01:33,920 because a graphical model could only handle smaller datasets. 26 00:01:33,920 --> 00:01:37,480 So in bigger datasets, simple approaches can do extremely well. 27 00:01:38,990 --> 00:01:41,360 Well since then a lot of things have changed. 28 00:01:41,360 --> 00:01:45,570 Our data has gotten even bigger and our ambitions have gotten bigger. 29 00:01:45,570 --> 00:01:51,290 We want to be able to deal with even more accurate models and this even larger data. 30 00:01:51,290 --> 00:01:54,760 And so we're forced to come up with new kinds of algorithms 31 00:01:54,760 --> 00:02:00,130 that can scale up complex models to huge datasets to get really amazing accuracy. 32 00:02:00,130 --> 00:02:04,430 And this is where things like parallelisms, using GPUs, 33 00:02:04,430 --> 00:02:08,240 using big computer clusters, and the type of technique that we're going to talk 34 00:02:08,240 --> 00:02:12,940 about today are going to be helpful for us to deal with complex models. 35 00:02:12,940 --> 00:02:16,935 So things like boosted decision trees, tensor factorization, 36 00:02:16,935 --> 00:02:20,566 deep learning, deep learning requires tremendous amount 37 00:02:20,566 --> 00:02:24,380 of computation time to get the amazing accuracy, guys. 38 00:02:24,380 --> 00:02:27,130 And going back to the not 20 years ago, 39 00:02:27,130 --> 00:02:31,190 people were building this massive, massive graphical models. 40 00:02:31,190 --> 00:02:34,770 Because they have new techniques to scale them in parallel and 41 00:02:34,770 --> 00:02:40,790 in distributed settings to be able to get even more accuracy with bigger datasets. 42 00:02:40,790 --> 00:02:45,150 So the summary of the story is that machine learning has evolved over the last 43 00:02:45,150 --> 00:02:48,960 few years to first go back to its simpler roots, 44 00:02:48,960 --> 00:02:52,270 to be able to use large datasets, but today to come up with new algorithms that 45 00:02:52,270 --> 00:02:55,690 can scale complex models to massive datasets. 46 00:02:55,690 --> 00:02:56,520 That's where we are today. 47 00:02:58,210 --> 00:03:01,857 So, going back to the same framework we'll be discussing the whole course, 48 00:03:01,857 --> 00:03:05,505 where we're taking some training data, we're extracting some features, 49 00:03:05,505 --> 00:03:09,433 we're building a machinery model, but the machinery algorithm that we're going to 50 00:03:09,433 --> 00:03:12,373 use is a small modification upgrade and sent to study change. 51 00:03:12,373 --> 00:03:15,994 But it will allow us to scale to much bigger datasets, 52 00:03:15,994 --> 00:03:18,170 and often perform really well. 53 00:03:19,220 --> 00:03:23,110 This modification is called stochastic gradient, and 54 00:03:23,110 --> 00:03:24,500 does something extremely simple. 55 00:03:24,500 --> 00:03:28,479 What it does is take your massive dataset, and your current parameters, w(t), and 56 00:03:28,479 --> 00:03:32,680 when you're computing the gradient, instead of looking at all the data, 57 00:03:32,680 --> 00:03:34,210 because you don't need to look at everything. 58 00:03:34,210 --> 00:03:36,500 It just looks at small subset of the data. 59 00:03:36,500 --> 00:03:38,850 So it just does a bit of data and 60 00:03:38,850 --> 00:03:42,780 then updates the parameter to the coefficients of w(t+1). 61 00:03:42,780 --> 00:03:45,970 And then it looks at little bit more data and 62 00:03:45,970 --> 00:03:48,617 then updates the coefficients to w(t+2). 63 00:03:48,617 --> 00:03:52,105 Then it looks at a little bit more data and updates to w(t+3). 64 00:03:52,105 --> 00:03:57,240 And then looks at little bit more data and updates a coefficients to w(t+4). 65 00:03:57,240 --> 00:04:01,420 And so instead of making massive pass over the dataset, 66 00:04:01,420 --> 00:04:06,296 before making a coefficient update, here we're just looking at little 67 00:04:06,296 --> 00:04:10,780 bits of data and updating coefficients in a kind of interlinked fashion. 68 00:04:10,780 --> 00:04:14,602 This small change is going to really change everything for us, and 69 00:04:14,602 --> 00:04:18,444 allow us to scale to much bigger datasets, as we'll see today. 70 00:04:18,444 --> 00:04:22,479 [MUSIC]