1
00:00:00,000 --> 00:00:04,620
In this video I am going to define what is
probably the most common type of machine

2
00:00:04,620 --> 00:00:08,910
learning problem, which is supervised
learning. I'll define supervised learning

3
00:00:08,910 --> 00:00:13,255
more formally later, but it's probably
best to explain or start with an example

4
00:00:13,255 --> 00:00:17,820
of what it is and we'll do the formal
definition later. Let's say you want to

5
00:00:17,820 --> 00:00:23,072
predict housing prices. A while back, a
student collected data sets from the

6
00:00:23,072 --> 00:00:28,745
Institute of Portland Oregon. And let's
say you plot a data set and it looks like

7
00:00:28,745 --> 00:00:34,347
this. Here on the horizontal axis, the
size of different houses in square feet,

8
00:00:34,347 --> 00:00:39,879
and on the vertical axis, the price of
different houses in thousands of dollars.

9
00:00:39,879 --> 00:00:45,168
So. Given this data, let's say you have a
friend who owns a house that is, say 750

10
00:00:45,168 --> 00:00:50,708
square feet and hoping to sell the house
and they want to know how much they can

11
00:00:50,708 --> 00:00:56,116
get for the house. So how can the learning
algorithm help you? One thing a learning

12
00:00:56,116 --> 00:01:01,524
algorithm might be able to do is put a
straight line through the data or to fit a

13
00:01:01,524 --> 00:01:07,111
straight line to the data and, based on
that, it looks like maybe the house can be

14
00:01:07,111 --> 00:01:13,239
sold for maybe about $150,000. But maybe this
isn't the only learning algorithm you can

15
00:01:13,239 --> 00:01:18,536
use. There might be a better one. For
example, instead of sending a straight

16
00:01:18,536 --> 00:01:23,620
line to the data, we might decide that
it's better to fit a quadratic

17
00:01:23,620 --> 00:01:29,110
function or a second-order polynomial to
this data. And if you do that, and make a

18
00:01:29,110 --> 00:01:34,667
prediction here, then it looks like, well,
maybe we can sell the house for closer to

19
00:01:34,667 --> 00:01:39,184
$200,000. One of the things we'll talk
about later is how to choose and how to

20
00:01:39,184 --> 00:01:43,792
decide do you want to fit a straight line
to the data or do you want to fit the

21
00:01:43,792 --> 00:01:48,631
quadratic function to the data and there's
no fair picking whichever one gives your

22
00:01:48,631 --> 00:01:53,182
friend the better house to sell. But each
of these would be a fine example of a

23
00:01:53,182 --> 00:01:57,834
learning algorithm. So this is an example
of a supervised learning algorithm. And

24
00:01:57,834 --> 00:02:03,736
the term supervised learning refers to the
fact that we gave the algorithm a data set

25
00:02:03,736 --> 00:02:09,089
in which the "right answers" were
given. That is, we gave it a data set of

26
00:02:09,089 --> 00:02:14,580
houses in which for every example in this
data set, we told it what is the right

27
00:02:14,580 --> 00:02:20,002
price so what is the actual price that,
that house sold for and the toss of the

28
00:02:20,002 --> 00:02:25,423
algorithm was to just produce more of
these right answers such as for this new

29
00:02:25,423 --> 00:02:30,579
house, you know, that your friend may be
trying to sell. To define with a bit more

30
00:02:30,579 --> 00:02:35,257
terminology this is also called a
regression problem and by regression

31
00:02:35,257 --> 00:02:40,467
problem I mean we're trying to predict a
continuous value output. Namely the price.

32
00:02:40,467 --> 00:02:44,720
So technically I guess prices can be
rounded off to the nearest cent. So maybe

33
00:02:44,720 --> 00:02:49,246
prices are actually discrete values, but
usually we think of the price of a house

34
00:02:49,246 --> 00:02:53,608
as a real number, as a scalar value, as
a continuous value number and the term

35
00:02:53,608 --> 00:02:58,080
regression refers to the fact that we're
trying to predict the sort of continuous

36
00:02:58,080 --> 00:03:02,060
values attribute. Here's another
supervised learning example, some friends

37
00:03:02,060 --> 00:03:06,427
and I were actually working on this
earlier. Let's see you want to look at

38
00:03:06,427 --> 00:03:11,675
medical records and try to predict of a
breast cancer as malignant or benign. If

39
00:03:11,675 --> 00:03:16,856
someone discovers a breast tumor, a lump
in their breast, a malignant tumor is a

40
00:03:16,856 --> 00:03:22,300
tumor that is harmful and dangerous and a
benign tumor is a tumor that is harmless.

41
00:03:22,300 --> 00:03:27,876
So obviously people care a lot about this.
Let's see a collected data set and suppose

42
00:03:27,876 --> 00:03:33,164
in your data set you have on your
horizontal axis the size of the tumor and

43
00:03:33,164 --> 00:03:39,317
on the vertical axis I'm going to plot one
or zero, yes or no, whether or not these are

44
00:03:39,317 --> 00:03:45,184
examples of tumors we've seen before are
malignant–which is one–or zero if not malignant

45
00:03:45,184 --> 00:03:50,392
or benign. So let's say our data set looks
like this where we saw a tumor of this

46
00:03:50,392 --> 00:03:56,283
size that turned out to be benign. One of
this size, one of this size. And so on.

47
00:03:56,283 --> 00:04:02,227
And sadly we also saw a few malignant
tumors, one of that size, one of that

48
00:04:02,227 --> 00:04:08,572
size, one of that size... So on. So this
example... I have five examples of benign

49
00:04:08,572 --> 00:04:15,159
tumors shown down here, and five examples
of malignant tumors shown with a vertical

50
00:04:15,159 --> 00:04:21,504
axis value of one. And let's say we have
a friend who tragically has a breast

51
00:04:21,504 --> 00:04:28,097
tumor, and let's say her breast tumor size
is maybe somewhere around this value. The

52
00:04:28,097 --> 00:04:32,930
machine learning question is, can you
estimate what is the probability, what is

53
00:04:32,930 --> 00:04:37,819
the chance that a tumor is malignant
versus benign? To introduce a bit more

54
00:04:37,819 --> 00:04:42,719
terminology this is an example of a
classification problem. The term

55
00:04:42,719 --> 00:04:47,342
classification refers to the fact that
here we're trying to predict a discrete

56
00:04:47,342 --> 00:04:52,321
value output: zero or one, malignant or
benign. And it turns out that in

57
00:04:52,321 --> 00:04:58,331
classification problems sometimes you can
have more than two values for the two

58
00:04:58,331 --> 00:05:03,852
possible values for the output. As a
concrete example maybe there are three

59
00:05:03,852 --> 00:05:09,947
types of breast cancers and so you may try
to predict the discrete value of zero,

60
00:05:09,947 --> 00:05:15,138
one, two, or three with zero being benign.
Benign tumor, so no cancer. And one may

61
00:05:15,138 --> 00:05:19,836
mean, type one cancer, like, you have
three types of cancer, whatever type one

62
00:05:19,836 --> 00:05:24,654
means. And two may mean a second type of
cancer, a three may mean a third type of

63
00:05:24,654 --> 00:05:29,111
cancer. But this would also be a
classification problem, because this other

64
00:05:29,111 --> 00:05:33,929
discrete value set of output corresponding
to, you know, no cancer, or cancer type

65
00:05:33,929 --> 00:05:39,094
one, or cancer type two, or cancer type
three. In classification problems there is

66
00:05:39,094 --> 00:05:44,413
another way to plot this data. Let me show
you what I mean. Let me use a slightly

67
00:05:44,413 --> 00:05:49,206
different set of symbols to plot this
data. So if tumor size is going to be the

68
00:05:49,206 --> 00:05:54,303
attribute that I'm going to use to predict
malignancy or benignness, I can also draw

69
00:05:54,303 --> 00:05:58,975
my data like this. I'm going to use
different symbols to denote my benign and

70
00:05:58,975 --> 00:06:03,707
malignant, or my negative and positive
examples. So instead of drawing crosses,

71
00:06:03,707 --> 00:06:11,595
I'm now going to draw O's for the benign
tumors. Like so. And I'm going to keep

72
00:06:11,595 --> 00:06:18,655
using X's to denote my malignant tumors.
Okay? I hope this is beginning to make

73
00:06:18,655 --> 00:06:23,624
sense. All I did was I took, you know,
these, my data set on top and I just

74
00:06:23,624 --> 00:06:30,894
mapped it down. To this real line like so.
And started to use different symbols,

75
00:06:30,894 --> 00:06:35,828
circles and crosses, to denote malignant
versus benign examples. Now, in this

76
00:06:35,828 --> 00:06:41,091
example we use only one feature or one
attribute, mainly, the tumor size in order

77
00:06:41,091 --> 00:06:46,289
to predict whether the tumor is malignant
or benign. In other machine learning

78
00:06:46,289 --> 00:06:51,355
problems when we have more than one
feature, more than one attribute. Here's

79
00:06:51,355 --> 00:06:56,749
an example. Let's say that instead of just
knowing the tumor size, we know both the

80
00:06:56,749 --> 00:07:02,387
age of the patients and the tumor size. In
that case maybe your data set will look

81
00:07:02,387 --> 00:07:08,562
like this where I may have a set of patients
with those ages and that tumor size and

82
00:07:08,562 --> 00:07:14,980
they look like this. And a different set
of patients, they look a little different,

83
00:07:15,600 --> 00:07:23,968
whose tumors turn out to be malignant, as
denoted by the crosses. So, let's say you

84
00:07:23,968 --> 00:07:32,027
have a friend who tragically has a
tumor. And maybe, their tumor size and age

85
00:07:32,027 --> 00:07:37,657
falls around there. So given a data set
like this, what the learning algorithm

86
00:07:37,657 --> 00:07:42,462
might do is throw the straight line
through the data to try to separate out

87
00:07:42,462 --> 00:07:47,710
the malignant tumors from the benign ones
and, so the learning algorithm may decide

88
00:07:47,710 --> 00:07:53,004
to throw the straight line like that to
separate out the two classes of tumors.

89
00:07:53,004 --> 00:07:57,644
And. You know, with this, hopefully you
can decide that your friend's tumor is

90
00:07:57,644 --> 00:08:02,322
more likely to if it's over there,
that hopefully your learning algorithm

91
00:08:02,322 --> 00:08:07,305
will say that your friend's tumor falls on
this benign side and is therefore more

92
00:08:07,305 --> 00:08:12,044
likely to be benign than malignant. In
this example we had two features, namely,

93
00:08:12,044 --> 00:08:17,147
the age of the patient and the size of the
tumor. In other machine learning problems

94
00:08:17,147 --> 00:08:21,454
we will often have more features, and my
friends that work on this problem, they

95
00:08:21,454 --> 00:08:25,849
actually use other features like these,
which is clump thickness, the clump thickness of

96
00:08:25,849 --> 00:08:30,299
the breast tumor. Uniformity of cell size
of the tumor. Uniformity of cell shape of

97
00:08:30,299 --> 00:08:34,911
the tumor, and so on, and other features
as well. And it turns out one of the interes-,

98
00:08:34,911 --> 00:08:39,907
most interesting learning algorithms that
we'll see in this class is a learning

99
00:08:39,907 --> 00:08:45,153
algorithm that can deal with, not just two
or three or five features, but an infinite

100
00:08:45,153 --> 00:08:50,150
number of features. On this slide, I've
listed a total of five different features.

101
00:08:50,150 --> 00:08:54,482
Right, two on the axes and three more up here.
But it turns out that for some learning

102
00:08:54,482 --> 00:08:58,497
problems, what you really want is not to
use, like, three or five features. But

103
00:08:58,497 --> 00:09:02,566
instead, you want to use an infinite
number of features, an infinite number of

104
00:09:02,566 --> 00:09:06,211
attributes, so that your learning
algorithm has lots of attributes or

105
00:09:06,211 --> 00:09:10,333
features or cues with which to make those
predictions. So how do you deal with an

106
00:09:10,333 --> 00:09:14,439
infinite number of features. How do you even
store an infinite number of

107
00:09:14,439 --> 00:09:18,290
things on the computer when your
computer is gonna run out of memory. It

108
00:09:18,290 --> 00:09:22,188
turns out that when we talk about an
algorithm called the Support Vector

109
00:09:22,188 --> 00:09:26,675
Machine, there will be a neat mathematical
trick that will allow a computer to deal

110
00:09:26,675 --> 00:09:31,214
with an infinite number of features. Imagine
that I didn't just write down two features

111
00:09:31,214 --> 00:09:35,487
here and three features on the right. But, imagine that
I wrote down an infinitely long list, I

112
00:09:35,487 --> 00:09:39,866
just kept writing more and more and more
features. Like an infinitely long list of

113
00:09:39,866 --> 00:09:44,192
features. Turns out, we'll be able to come
up with an algorithm that can deal with

114
00:09:44,192 --> 00:09:49,701
that. So, just to recap. In this
class we'll talk about supervised

115
00:09:49,701 --> 00:09:54,167
learning. And the idea is that, in
supervised learning, in every example in

116
00:09:54,167 --> 00:09:58,880
our data set, we are told what is the
"correct answer" that we would have

117
00:09:58,880 --> 00:10:03,960
quite liked the algorithms have predicted
on that example. Such as the price of the

118
00:10:03,960 --> 00:10:08,428
house, or whether a tumor is malignant or
benign. We also talked about the

119
00:10:08,428 --> 00:10:13,202
regression problem. And by regression,
that means that our goal is to predict a

120
00:10:13,202 --> 00:10:17,977
continuous valued output. And we talked
about the classification problem, where

121
00:10:17,977 --> 00:10:22,690
the goal is to predict a discrete value
output. Just a quick wrap up question:

122
00:10:22,690 --> 00:10:27,541
Suppose you're running a company and you
want to develop learning algorithms to

123
00:10:27,541 --> 00:10:32,618
address each of two problems. In the first
problem, you have a large inventory of

124
00:10:32,618 --> 00:10:38,113
identical items. So imagine that you have
thousands of copies of some identical

125
00:10:38,113 --> 00:10:43,607
items to sell and you want to predict how
many of these items you sell within the

126
00:10:43,607 --> 00:10:49,172
next three months. In the second problem,
problem two, you'd like--  you have lots of

127
00:10:49,172 --> 00:10:54,145
users and you want to write software to
examine each individual of your

128
00:10:54,145 --> 00:10:59,193
customer's accounts, so each one of your
customer's accounts; and for each account,

129
00:10:59,193 --> 00:11:04,178
decide whether or not the account has been
hacked or compromised. So, for each of

130
00:11:04,178 --> 00:11:08,914
these problems, should they be treated as
a classification problem, or as a

131
00:11:08,914 --> 00:11:14,087
regression problem? When the video pauses,
please use your mouse to select whichever

132
00:11:14,087 --> 00:11:20,884
of these four options on the left you
think is the correct answer. So hopefully,

133
00:11:20,884 --> 00:11:25,871
you got that this is the answer. For
problem one, I would treat this as a

134
00:11:25,871 --> 00:11:31,058
regression problem, because if I have, you
know, thousands of items, well, I would

135
00:11:31,058 --> 00:11:36,071
probably just treat this as a real value,
as a continuous value. And

136
00:11:36,290 --> 00:11:41,837
treat, therefore, the number of items I sell,
as a continuous value. And for the

137
00:11:41,837 --> 00:11:47,748
second problem, I would treat that as a
classification problem, because I might

138
00:11:47,748 --> 00:11:53,659
say, set the value I want to predict with
zero, to denote the account has not been

139
00:11:53,659 --> 00:11:58,850
hacked. And set the value one to denote an
account that has been hacked into. So just

140
00:11:58,850 --> 00:12:03,287
like, you know, breast cancer, is,
zero is benign, one is malignant. So I

141
00:12:03,287 --> 00:12:08,150
might set this be zero or one depending on
whether it's been hacked, and have an

142
00:12:08,150 --> 00:12:13,134
algorithm try to predict each one of these
two discrete values. And because there's a

143
00:12:13,134 --> 00:12:17,693
small number of discrete values, I would
therefore treat it as a classification

144
00:12:17,693 --> 00:12:23,075
problem. So, that's it for supervised
learning and in the next video I'll talk

145
00:12:23,075 --> 00:12:28,325
about unsupervised learning, which is the
other major category of learning algorithms.