1
00:00:00,197 --> 00:00:04,829
[MUSIC]

2
00:00:04,829 --> 00:00:09,053
In this module we're going to address
a really important problem with machine

3
00:00:09,053 --> 00:00:10,450
learning today.

4
00:00:10,450 --> 00:00:15,160
How to scale the algorithms we
discussed to really large data sets.

5
00:00:15,160 --> 00:00:18,482
The ideas we discuss today
are going to be broadly applicable.

6
00:00:18,482 --> 00:00:20,995
They're going to be applicable for
all of classification.

7
00:00:20,995 --> 00:00:24,210
They're also going to be applicable
even for the regression course, or

8
00:00:24,210 --> 00:00:26,600
the second course of the specialization.

9
00:00:26,600 --> 00:00:30,490
We're going to talk about the technique
called stochastic gradient, and

10
00:00:30,490 --> 00:00:34,190
when you relate it to something called
online learning, where you're learning

11
00:00:34,190 --> 00:00:37,430
from data streams where one piece
of data is arriving at a time.

12
00:00:38,790 --> 00:00:42,720
Let's take a moment to review how
gradient ascent works in the context of

13
00:00:42,720 --> 00:00:46,680
classification, and
how it's impacted by the data set size.

14
00:00:47,780 --> 00:00:50,080
So we have a very large data set here, and

15
00:00:50,080 --> 00:00:53,810
we have a set of coefficients w(t)
which we're hoping to update.

16
00:00:53,810 --> 00:00:55,750
So we're going to use gradient ascent.

17
00:00:55,750 --> 00:00:58,530
We're going to compute the gradient on
this data set which requires us to make

18
00:00:58,530 --> 00:01:00,740
a pass or scan over the data,

19
00:01:00,740 --> 00:01:04,710
computing the contribution of each one
of these data points to the gradient.

20
00:01:04,710 --> 00:01:07,890
Then we compute the gradient, and
we go ahead and update the parameters.

21
00:01:07,890 --> 00:01:09,040
Then we get W (t+1).

22
00:01:09,040 --> 00:01:13,370
Now we have to go back to data set, and
make another pass where we visit every

23
00:01:13,370 --> 00:01:17,130
single data point, and
we go computer a new gradient and

24
00:01:17,130 --> 00:01:21,610
update the parameters,
the coefficients, and get W (t+2).

25
00:01:21,610 --> 00:01:25,390
So, in this process every time we're
going to do a coefficient update

26
00:01:25,390 --> 00:01:29,300
we're going to have to make a full scan or
a full pass over the entire data set,

27
00:01:29,300 --> 00:01:31,580
which can be really slow if
the data set is really big.

28
00:01:33,380 --> 00:01:36,100
And these days data sets are getting huge.

29
00:01:36,100 --> 00:01:40,610
You can think about the 4.8 billion
webpages that are out there on the web.

30
00:01:40,610 --> 00:01:43,050
You can think about the fact that Twitter,
for example,

31
00:01:43,050 --> 00:01:45,680
is generating 500 million tweets a day.

32
00:01:45,680 --> 00:01:46,980
That's a lot.

33
00:01:46,980 --> 00:01:47,884
By the way follow me on Twitter.

34
00:01:47,884 --> 00:01:52,300
[LAUGH] Or you can think about again
how the world is getting embedded with

35
00:01:52,300 --> 00:01:56,690
sensors which is something we today
call internet of things where you have

36
00:01:56,690 --> 00:02:01,780
devices throughout the homes, devices we
carry with us like the smart watch here.

37
00:02:01,780 --> 00:02:06,100
And everything else connected to
each other generating tons and

38
00:02:06,100 --> 00:02:07,070
tons and tons of data.

39
00:02:07,070 --> 00:02:12,090
And you can think about specific
websites like YouTube where

40
00:02:12,090 --> 00:02:16,690
300 hours of video uploaded
every minute and nobody watches.

41
00:02:18,410 --> 00:02:21,210
And this is a really fundamental
problem for machine learning.

42
00:02:21,210 --> 00:02:24,540
How to tackle this huge,
massive data sets.

43
00:02:24,540 --> 00:02:25,800
Let's use YouTube as an example.

44
00:02:25,800 --> 00:02:27,890
We have this tons of
videos being uploaded,

45
00:02:27,890 --> 00:02:32,010
and there's a billion users
who are visiting this website.

46
00:02:32,010 --> 00:02:35,200
And YouTube makes money
out of ad revenue or

47
00:02:35,200 --> 00:02:37,880
showing the right ad to
each one of it's users.

48
00:02:39,260 --> 00:02:43,590
Now, the number you should be thinking
about is not 300 hours of video a minute

49
00:02:43,590 --> 00:02:49,430
or 1 billion users, but if you were
thinking about the 4 billion page views or

50
00:02:49,430 --> 00:02:51,460
video views that they have everyday.

51
00:02:51,460 --> 00:02:55,270
So for each one of these page
views they have to serve ads.

52
00:02:55,270 --> 00:02:58,110
They have to figure out what ads
to put with those videos, and

53
00:02:58,110 --> 00:03:01,860
they have to go back and
retrain their learning algorithm.

54
00:03:01,860 --> 00:03:05,385
In other words they need the machine and
the algorithm to

55
00:03:05,385 --> 00:03:10,035
figure out what ad to show that can
deal with 5 billion events per day, and

56
00:03:10,035 --> 00:03:13,936
that's fast enough that you can
make predictions as to what ad

57
00:03:13,936 --> 00:03:18,306
to show within milliseconds as
you're going to watch those videos.

58
00:03:18,306 --> 00:03:22,939
[MUSIC]