1
00:00:00,000 --> 00:00:04,530
[MUSIC]

2
00:00:04,530 --> 00:00:08,914
It's no news to anybody that datasets
have been getting larger over the last

3
00:00:08,914 --> 00:00:12,407
two decades, but what is
interesting is what impact this had

4
00:00:12,407 --> 00:00:16,670
on machine learning research and
machine learning practice.

5
00:00:16,670 --> 00:00:20,900
So if you look back 20 years ago,
datasets are pretty small, and so

6
00:00:20,900 --> 00:00:25,660
we ended up working on a lot of very
complex models to be able to get the most,

7
00:00:25,660 --> 00:00:27,655
squeeze the most out of that data.

8
00:00:27,655 --> 00:00:30,550
because we needed good accuracy
with very little data.

9
00:00:30,550 --> 00:00:33,160
So we work on things like kernels and
graphical models,

10
00:00:33,160 --> 00:00:35,000
so things are pretty complex.

11
00:00:36,050 --> 00:00:40,460
Ten years later the scale of data
started growing tremendously.

12
00:00:40,460 --> 00:00:44,500
We entered an era that
we call big data era.

13
00:00:44,500 --> 00:00:48,290
And it turns out that because it was so

14
00:00:48,290 --> 00:00:51,870
hard to scale machine learning algorithms
to big datasets, and because it was so

15
00:00:51,870 --> 00:00:56,140
much data, we reverted back to
simpler algorithms, to simpler days.

16
00:00:57,180 --> 00:00:59,180
And using things like logistic regression,

17
00:00:59,180 --> 00:01:03,490
matrix factorization to address
machine learning problem.

18
00:01:03,490 --> 00:01:07,764
And there was even a lot of discussion on
what impact it has on machine learning

19
00:01:07,764 --> 00:01:08,574
approaches.

20
00:01:08,574 --> 00:01:13,415
And there was this pre-seminal
paper that talked about

21
00:01:13,415 --> 00:01:17,850
the unreasonable effectiveness
of big datasets.

22
00:01:17,850 --> 00:01:21,601
And what it basically says is,
if you have a massive dataset,

23
00:01:21,601 --> 00:01:25,353
you can use a very simple approach,
like a linear class file or

24
00:01:25,353 --> 00:01:29,973
logistic aggression, and beat a Farsi
approach, like a graphical model,

25
00:01:29,973 --> 00:01:33,920
because a graphical model could
only handle smaller datasets.

26
00:01:33,920 --> 00:01:37,480
So in bigger datasets,
simple approaches can do extremely well.

27
00:01:38,990 --> 00:01:41,360
Well since then a lot
of things have changed.

28
00:01:41,360 --> 00:01:45,570
Our data has gotten even bigger and
our ambitions have gotten bigger.

29
00:01:45,570 --> 00:01:51,290
We want to be able to deal with even more
accurate models and this even larger data.

30
00:01:51,290 --> 00:01:54,760
And so we're forced to come up
with new kinds of algorithms

31
00:01:54,760 --> 00:02:00,130
that can scale up complex models to huge
datasets to get really amazing accuracy.

32
00:02:00,130 --> 00:02:04,430
And this is where things like
parallelisms, using GPUs,

33
00:02:04,430 --> 00:02:08,240
using big computer clusters, and the type
of technique that we're going to talk

34
00:02:08,240 --> 00:02:12,940
about today are going to be helpful for
us to deal with complex models.

35
00:02:12,940 --> 00:02:16,935
So things like boosted decision trees,
tensor factorization,

36
00:02:16,935 --> 00:02:20,566
deep learning,
deep learning requires tremendous amount

37
00:02:20,566 --> 00:02:24,380
of computation time to get
the amazing accuracy, guys.

38
00:02:24,380 --> 00:02:27,130
And going back to the not 20 years ago,

39
00:02:27,130 --> 00:02:31,190
people were building this massive,
massive graphical models.

40
00:02:31,190 --> 00:02:34,770
Because they have new techniques
to scale them in parallel and

41
00:02:34,770 --> 00:02:40,790
in distributed settings to be able to get
even more accuracy with bigger datasets.

42
00:02:40,790 --> 00:02:45,150
So the summary of the story is that
machine learning has evolved over the last

43
00:02:45,150 --> 00:02:48,960
few years to first go back
to its simpler roots,

44
00:02:48,960 --> 00:02:52,270
to be able to use large datasets, but
today to come up with new algorithms that

45
00:02:52,270 --> 00:02:55,690
can scale complex models
to massive datasets.

46
00:02:55,690 --> 00:02:56,520
That's where we are today.

47
00:02:58,210 --> 00:03:01,857
So, going back to the same framework
we'll be discussing the whole course,

48
00:03:01,857 --> 00:03:05,505
where we're taking some training data,
we're extracting some features,

49
00:03:05,505 --> 00:03:09,433
we're building a machinery model, but the
machinery algorithm that we're going to

50
00:03:09,433 --> 00:03:12,373
use is a small modification upgrade and
sent to study change.

51
00:03:12,373 --> 00:03:15,994
But it will allow us to scale
to much bigger datasets,

52
00:03:15,994 --> 00:03:18,170
and often perform really well.

53
00:03:19,220 --> 00:03:23,110
This modification is called
stochastic gradient, and

54
00:03:23,110 --> 00:03:24,500
does something extremely simple.

55
00:03:24,500 --> 00:03:28,479
What it does is take your massive dataset,
and your current parameters, w(t), and

56
00:03:28,479 --> 00:03:32,680
when you're computing the gradient,
instead of looking at all the data,

57
00:03:32,680 --> 00:03:34,210
because you don't need
to look at everything.

58
00:03:34,210 --> 00:03:36,500
It just looks at small subset of the data.

59
00:03:36,500 --> 00:03:38,850
So it just does a bit of data and

60
00:03:38,850 --> 00:03:42,780
then updates the parameter to
the coefficients of w(t+1).

61
00:03:42,780 --> 00:03:45,970
And then it looks at
little bit more data and

62
00:03:45,970 --> 00:03:48,617
then updates the coefficients to w(t+2).

63
00:03:48,617 --> 00:03:52,105
Then it looks at a little bit
more data and updates to w(t+3).

64
00:03:52,105 --> 00:03:57,240
And then looks at little bit more data and
updates a coefficients to w(t+4).

65
00:03:57,240 --> 00:04:01,420
And so instead of making
massive pass over the dataset,

66
00:04:01,420 --> 00:04:06,296
before making a coefficient update,
here we're just looking at little

67
00:04:06,296 --> 00:04:10,780
bits of data and updating coefficients
in a kind of interlinked fashion.

68
00:04:10,780 --> 00:04:14,602
This small change is going to really
change everything for us, and

69
00:04:14,602 --> 00:04:18,444
allow us to scale to much bigger datasets,
as we'll see today.

70
00:04:18,444 --> 00:04:22,479
[MUSIC]