1
00:00:00,000 --> 00:00:04,641
[MUSIC]

2
00:00:04,641 --> 00:00:08,675
From here we're going to explore
one of the most fundamental things

3
00:00:08,675 --> 00:00:13,000
that happens in practice with data,
the issue of missing data.

4
00:00:13,000 --> 00:00:15,660
So here I have a dataset, but
unlike what we've done so

5
00:00:15,660 --> 00:00:21,030
far in this course in specialization,
we assume that all data is observed.

6
00:00:21,030 --> 00:00:24,400
But here if you look at the second row,
you have a question mark for that person.

7
00:00:24,400 --> 00:00:28,710
You don't know if the loan they took was
a three-year loan or a five-year loan.

8
00:00:28,710 --> 00:00:30,830
So what do we do when
we have missing data?

9
00:00:30,830 --> 00:00:34,640
We're going to talk about many ways to
address that are extremely practical

10
00:00:34,640 --> 00:00:37,720
including a modification of decision trees

11
00:00:37,720 --> 00:00:40,960
where the tree can be learned to
take into account missing data.

12
00:00:40,960 --> 00:00:44,160
Make decisions that depend
not only on observed values

13
00:00:44,160 --> 00:00:46,680
like your credit was excellent,
fair, or poor.

14
00:00:46,680 --> 00:00:48,541
But what happens if your
credit is unobserved?

15
00:00:48,541 --> 00:00:49,890
What should you do?

16
00:00:49,890 --> 00:00:53,080
And those techniques are going to
be extremely useful in practice and

17
00:00:53,080 --> 00:00:54,270
widely applicable.

18
00:00:56,120 --> 00:00:57,846
The seventh module's going to be amazing.

19
00:00:57,846 --> 00:01:04,072
We're going to look at a question that
was asked by Kearns and Valiant in 1988.

20
00:01:04,072 --> 00:01:06,846
In fact, Valiant is a Turing Award winner.

21
00:01:06,846 --> 00:01:08,140
So this is a fundamental question.

22
00:01:08,140 --> 00:01:11,780
The question was,
can you combine simple classifiers

23
00:01:11,780 --> 00:01:17,100
in a way that gives you performance
of a really complex classifier?

24
00:01:17,100 --> 00:01:20,400
In that question, which is
pre-theoretical, was answered a couple of

25
00:01:20,400 --> 00:01:25,780
years later by Schapire in the positive,
using something called boosting,

26
00:01:25,780 --> 00:01:30,710
which is amazing algorithm that has
had an incredible impact in practice.

27
00:01:30,710 --> 00:01:33,420
In fact, if you know, for example,
what a Kaggle competition is,

28
00:01:33,420 --> 00:01:35,940
which is one of those online
machine learning competitions,

29
00:01:35,940 --> 00:01:40,300
more than half of the winners
use boosting in their solution.

30
00:01:40,300 --> 00:01:44,580
Boosting is a simple technique that
has really changed the world, and

31
00:01:44,580 --> 00:01:47,170
we're going to learn the fundamentals
of the technique and

32
00:01:47,170 --> 00:01:49,140
you're going to be able
to implement yourself.

33
00:01:49,140 --> 00:01:52,740
We're going to talk about one kind of
boosting algorithm called AdaBoost

34
00:01:52,740 --> 00:01:55,140
where you can take
the input of a classifier.

35
00:01:55,140 --> 00:02:00,660
For example, this decision tree here might
say that a loan is likely to be okay,

36
00:02:00,660 --> 00:02:01,820
safe +1.

37
00:02:01,820 --> 00:02:06,085
But you might have another one
that says no, it's risky, -1.

38
00:02:06,085 --> 00:02:08,825
And you have others that might say +1 or
-1, and

39
00:02:08,825 --> 00:02:12,540
you have the vote of many classifiers
in what's called an ensemble.

40
00:02:12,540 --> 00:02:16,064
So boosting is about building these many
ensembles where many classifiers vote.

41
00:02:17,370 --> 00:02:20,660
And we do just want to learn
the combinations of this vote

42
00:02:20,660 --> 00:02:23,010
to get the best possible prediction.

43
00:02:23,010 --> 00:02:26,980
And by learning those combinations using a
boosting algorithm, we're going to be able

44
00:02:26,980 --> 00:02:32,100
to start very simple classifiers but come
up with very complex decision boundaries.

45
00:02:32,100 --> 00:02:35,830
And this is exactly the technique that
wins most of those Kaggle competitions.

46
00:02:37,540 --> 00:02:39,800
In the eighth module we're
going to step back and

47
00:02:39,800 --> 00:02:41,990
look at fundamental concepts
in machine learning.

48
00:02:41,990 --> 00:02:43,428
One is called precision and recall.

49
00:02:43,428 --> 00:02:46,240
Let me tell you about an example of that.

50
00:02:46,240 --> 00:02:50,290
Say I own a restaurant and I want to
increase the number of guests that I have,

51
00:02:50,290 --> 00:02:52,330
the number of customers, by 30%.

52
00:02:52,330 --> 00:02:53,307
How do I do that?

53
00:02:53,307 --> 00:02:55,061
Well, I'm going to start
a marketing campaign but

54
00:02:55,061 --> 00:02:56,702
I don't want to be just
like everybody else.

55
00:02:56,702 --> 00:02:59,950
I want to be kind of an authentic
nice marketing campaign.

56
00:02:59,950 --> 00:03:04,350
So when I use the reviews of my restaurant
and try to find great things to say,

57
00:03:04,350 --> 00:03:07,520
and every time somebody enters
a review on our website,

58
00:03:07,520 --> 00:03:12,470
I get some sentence shown in my website
saying how great my restaurant is.

59
00:03:12,470 --> 00:03:16,760
So given the reviews, I want to make
a prediction as to what sentence were most

60
00:03:16,760 --> 00:03:22,450
positive, like easily the best sushi
in Seattle, and also who are great

61
00:03:22,450 --> 00:03:27,320
people that we should showcase that
say great things about the restaurant.

62
00:03:27,320 --> 00:03:33,100
When you have a setting like this,
accuracy is not a good metric.

63
00:03:33,100 --> 00:03:36,320
What you really care about is
what's called precision recall.

64
00:03:36,320 --> 00:03:39,659
Precision is going to say, if I pick out
a few sentences from the reviews and

65
00:03:39,659 --> 00:03:40,772
show them on my website,

66
00:03:40,772 --> 00:03:43,800
how likely is it that I'm going to
show a really negative sentence?

67
00:03:43,800 --> 00:03:46,627
Because if I show a bad sentence
like the sushi was terrible,

68
00:03:46,627 --> 00:03:48,860
then that's really bad for my website.

69
00:03:48,860 --> 00:03:52,420
So precision makes sure that I
show only positive sentences.

70
00:03:52,420 --> 00:03:54,030
And then when we talk about recall,

71
00:03:54,030 --> 00:03:57,990
which is about finding all the great
positive things that people are saying.

72
00:03:57,990 --> 00:04:01,880
So if a classifier has good precision and
recall, it means that I find

73
00:04:01,880 --> 00:04:05,948
all the great sentences, and I only show
great sentences about my restaurant.

74
00:04:05,948 --> 00:04:08,080
And we're going to talk about
that in quite a lot of detail,

75
00:04:08,080 --> 00:04:10,560
because precision-recall
is what you will use

76
00:04:10,560 --> 00:04:13,580
most likely if you build
a classifying practice.

77
00:04:13,580 --> 00:04:17,370
It's what basically every company builds
classifiers uses as its core metric.

78
00:04:19,050 --> 00:04:22,710
And in the final module we're
going to address initial scalability.

79
00:04:22,710 --> 00:04:26,310
How do we scale to
really massive datasets?

80
00:04:26,310 --> 00:04:31,290
And so as you can see the number of web
pages on the web is growing tremendously.

81
00:04:31,290 --> 00:04:34,080
There is about 4.8 billion pages today.

82
00:04:34,080 --> 00:04:36,880
There is about 500 million tweets per day.

83
00:04:36,880 --> 00:04:39,190
Hey, follow me on Twitter, by the way.

84
00:04:39,190 --> 00:04:42,537
I send them one, maybe less.

85
00:04:42,537 --> 00:04:46,768
And if you think about YouTube,
there's 5 billion page views on YouTube,

86
00:04:46,768 --> 00:04:48,342
video views every day.

87
00:04:48,342 --> 00:04:53,320
So there's tons of data out there and
gradient type

88
00:04:53,320 --> 00:04:58,080
methods don't tend to scale very well
when you have massive amounts of data.

89
00:04:58,080 --> 00:05:02,100
And so what we're going to show is
a technique called stochastic gradient,

90
00:05:02,100 --> 00:05:06,128
which converges much faster
than gradient to the solution.

91
00:05:06,128 --> 00:05:09,020
It's just a very small
modification to gradient

92
00:05:09,020 --> 00:05:10,710
which gives you amazing performance.

93
00:05:10,710 --> 00:05:13,290
So in this simple example
from sentiment analysis,

94
00:05:13,290 --> 00:05:18,170
we see over 100 times faster
performance in the same dataset.

95
00:05:18,170 --> 00:05:21,934
However, stochastic gradient is
a extremely finicky technique to get to

96
00:05:21,934 --> 00:05:22,612
work right.

97
00:05:22,612 --> 00:05:25,495
There's many practical problems that
you need to address to make it work.

98
00:05:25,495 --> 00:05:29,618
And so we're going to talk about the
technique and explain why it works, but

99
00:05:29,618 --> 00:05:32,878
also explain those practical
issues that you must address

100
00:05:32,878 --> 00:05:34,620
in order to get it to work well.

101
00:05:35,680 --> 00:05:39,352
So as you can see, it's going to be an
action-packed course that's going to cover

102
00:05:39,352 --> 00:05:41,423
a wide range of topics
in machine learning.

103
00:05:41,423 --> 00:05:46,769
[MUSIC]