1
00:00:00,440 --> 00:00:01,400
In this and in the

2
00:00:01,480 --> 00:00:02,640
next set of videos, I'd like

3
00:00:02,780 --> 00:00:04,270
to tell you about a learning

4
00:00:04,550 --> 00:00:06,110
algorithm called a Neural Network.

5
00:00:07,190 --> 00:00:07,900
We're going to first talk about

6
00:00:08,079 --> 00:00:09,330
the representation and then

7
00:00:09,600 --> 00:00:10,390
in the next set of videos

8
00:00:10,410 --> 00:00:12,160
talk about learning algorithms for it.

9
00:00:12,660 --> 00:00:14,070
Neutral networks is actually

10
00:00:14,510 --> 00:00:15,870
a pretty old idea, but had

11
00:00:16,290 --> 00:00:17,680
fallen out of favor for a while.

12
00:00:18,200 --> 00:00:19,270
But today, it is the

13
00:00:19,580 --> 00:00:20,820
state of the art technique for

14
00:00:21,090 --> 00:00:22,390
many different machine learning problems.

15
00:00:23,740 --> 00:00:25,740
So why do we need yet another learning algorithm?

16
00:00:26,300 --> 00:00:28,030
We already have linear regression and

17
00:00:28,180 --> 00:00:31,260
we have logistic regression, so why do we need, you know, neural networks?

18
00:00:32,280 --> 00:00:34,260
In order to motivate the discussion

19
00:00:34,790 --> 00:00:35,970
of neural networks, let me

20
00:00:36,120 --> 00:00:37,130
start by showing you a few

21
00:00:37,310 --> 00:00:38,720
examples of machine learning

22
00:00:38,930 --> 00:00:40,100
problems where we need

23
00:00:40,300 --> 00:00:41,850
to learn complex non-linear hypotheses.

24
00:00:43,850 --> 00:00:45,650
Consider a supervised learning classification

25
00:00:46,530 --> 00:00:48,440
problem where you have a training set like this.

26
00:00:49,280 --> 00:00:50,530
If you want to apply logistic

27
00:00:50,960 --> 00:00:52,710
regression to this problem, one

28
00:00:52,900 --> 00:00:54,250
thing you could do is apply

29
00:00:54,660 --> 00:00:56,140
logistic regression with a

30
00:00:56,190 --> 00:00:57,720
lot of nonlinear features like that.

31
00:00:58,170 --> 00:00:59,580
So here, g as usual

32
00:01:00,070 --> 00:01:01,710
is the sigmoid function, and we

33
00:01:01,780 --> 00:01:04,680
can include lots of polynomial terms like these.

34
00:01:05,450 --> 00:01:06,790
And, if you include enough polynomial

35
00:01:07,370 --> 00:01:08,280
terms then, you know, maybe

36
00:01:08,950 --> 00:01:10,280
you can get a hypotheses

37
00:01:11,600 --> 00:01:13,780
that separates the positive and negative examples.

38
00:01:14,630 --> 00:01:16,080
This particular method works well

39
00:01:16,470 --> 00:01:18,400
when you have only, say, two

40
00:01:18,620 --> 00:01:20,180
features - x1 and x2

41
00:01:20,190 --> 00:01:20,980
- because you can then include

42
00:01:21,500 --> 00:01:22,880
all those polynomial terms of

43
00:01:23,400 --> 00:01:24,620
x1 and x2.

44
00:01:24,810 --> 00:01:26,280
But for many interesting machine learning

45
00:01:26,520 --> 00:01:27,730
problems would have a

46
00:01:27,910 --> 00:01:29,230
lot more features than just two.

47
00:01:30,780 --> 00:01:31,760
We've been talking for a while

48
00:01:32,320 --> 00:01:34,560
about housing prediction, and suppose

49
00:01:35,130 --> 00:01:36,990
you have a housing classification

50
00:01:38,020 --> 00:01:39,280
problem rather than a

51
00:01:39,390 --> 00:01:41,170
regression problem, like maybe

52
00:01:41,580 --> 00:01:43,350
if you have different features of

53
00:01:43,440 --> 00:01:44,760
a house, and you want

54
00:01:45,010 --> 00:01:46,000
to predict what are the

55
00:01:46,050 --> 00:01:47,590
odds that your house will

56
00:01:47,700 --> 00:01:48,710
be sold within the next

57
00:01:48,910 --> 00:01:51,040
six months, so that will be a classification problem.

58
00:01:52,100 --> 00:01:53,060
And as we saw we can

59
00:01:53,260 --> 00:01:55,130
come up with quite a

60
00:01:55,260 --> 00:01:56,480
lot of features, maybe a hundred

61
00:01:56,840 --> 00:01:58,270
different features of different houses.

62
00:02:00,130 --> 00:02:01,610
For a problem like this, if

63
00:02:01,880 --> 00:02:03,260
you were to include all the

64
00:02:03,370 --> 00:02:04,980
quadratic terms, all of

65
00:02:05,100 --> 00:02:06,260
these, even all of the

66
00:02:06,540 --> 00:02:07,540
quadratic that is the second

67
00:02:07,930 --> 00:02:10,450
or the polynomial terms, there would be a lot of them.

68
00:02:10,560 --> 00:02:11,580
There would be terms like x1 squared,

69
00:02:12,960 --> 00:02:17,610
x1x2, x1x3, you know, x1x4

70
00:02:18,750 --> 00:02:21,880
up to x1x100 and then

71
00:02:21,980 --> 00:02:23,620
you have x2 squared, x2x3

72
00:02:25,620 --> 00:02:25,980
and so on.

73
00:02:26,510 --> 00:02:27,770
And if you include just

74
00:02:28,060 --> 00:02:29,200
the second order terms, that

75
00:02:29,330 --> 00:02:30,750
is, the terms that are

76
00:02:30,840 --> 00:02:32,090
a product of, you know,

77
00:02:32,220 --> 00:02:33,390
two of these terms, x1

78
00:02:33,510 --> 00:02:35,010
times x1 and so on, then,

79
00:02:35,780 --> 00:02:36,920
for the case of n equals

80
00:02:38,180 --> 00:02:40,280
100, you end up with about five thousand features.

81
00:02:41,890 --> 00:02:44,880
And, asymptotically, the

82
00:02:45,000 --> 00:02:46,330
number of quadratic features grows

83
00:02:46,770 --> 00:02:48,670
roughly as order n

84
00:02:48,820 --> 00:02:50,330
squared, where n is the

85
00:02:50,460 --> 00:02:52,790
number of the original features,

86
00:02:53,370 --> 00:02:54,780
like x1 through x100 that we had.

87
00:02:55,700 --> 00:02:58,750
And its actually closer to n squared over two.

88
00:02:59,920 --> 00:03:01,440
So including all the

89
00:03:01,560 --> 00:03:02,920
quadratic features doesn't seem

90
00:03:03,220 --> 00:03:04,220
like it's maybe a good

91
00:03:04,300 --> 00:03:05,380
idea, because that is a

92
00:03:05,580 --> 00:03:07,050
lot of features and you

93
00:03:07,220 --> 00:03:08,920
might up overfitting the training

94
00:03:09,330 --> 00:03:10,500
set, and it can

95
00:03:10,740 --> 00:03:12,800
also be computationally expensive, you know, to

96
00:03:14,080 --> 00:03:15,120
be working with that many features.

97
00:03:16,450 --> 00:03:17,540
One thing you could do is

98
00:03:17,770 --> 00:03:19,090
include only a subset of

99
00:03:19,290 --> 00:03:20,950
these, so if you include only the

100
00:03:21,050 --> 00:03:22,630
features x1 squared, x2 squared,

101
00:03:23,590 --> 00:03:25,180
x3 squared, up to

102
00:03:25,580 --> 00:03:27,750
maybe x100 squared, then

103
00:03:28,100 --> 00:03:29,500
the number of features is much smaller.

104
00:03:29,980 --> 00:03:31,720
Here you have only 100 such

105
00:03:32,070 --> 00:03:33,850
quadratic features, but this

106
00:03:34,120 --> 00:03:35,950
is not enough features and

107
00:03:36,100 --> 00:03:37,170
certainly won't let you fit

108
00:03:37,290 --> 00:03:39,330
the data set like that on the upper left.

109
00:03:39,570 --> 00:03:40,550
In fact, if you include

110
00:03:41,040 --> 00:03:42,720
only these quadratic features together

111
00:03:43,170 --> 00:03:44,870
with the original x1, and

112
00:03:45,350 --> 00:03:46,500
so on, up to x100 features,

113
00:03:47,460 --> 00:03:48,530
then you can actually fit very

114
00:03:48,910 --> 00:03:50,210
interesting hypotheses. So, you

115
00:03:50,330 --> 00:03:52,350
can fit things like, you know, access a

116
00:03:52,490 --> 00:03:53,860
line of the ellipses like these, but

117
00:03:55,080 --> 00:03:56,240
you certainly cannot fit a more

118
00:03:56,340 --> 00:03:57,930
complex data set like that shown here.

119
00:03:59,360 --> 00:04:00,530
So 5000 features seems like

120
00:04:00,620 --> 00:04:03,090
a lot, if you were

121
00:04:03,230 --> 00:04:04,860
to include the cubic, or

122
00:04:05,140 --> 00:04:06,100
third order known of each others,

123
00:04:06,440 --> 00:04:08,050
the x1, x2, x3.

124
00:04:08,400 --> 00:04:09,800
You know, x1 squared,

125
00:04:10,310 --> 00:04:12,240
x2, x10 and

126
00:04:12,900 --> 00:04:15,280
x11, x17 and so on.

127
00:04:15,700 --> 00:04:18,110
You can imagine there are gonna be a lot of these features.

128
00:04:19,040 --> 00:04:19,770
In fact, they are going to be

129
00:04:20,050 --> 00:04:21,260
order and cube such features

130
00:04:22,210 --> 00:04:23,830
and if any is 100

131
00:04:24,150 --> 00:04:25,660
you can compute that, you

132
00:04:25,740 --> 00:04:26,870
end up with on the order

133
00:04:27,730 --> 00:04:29,650
of about 170,000 such cubic

134
00:04:30,040 --> 00:04:31,670
features and so including

135
00:04:32,260 --> 00:04:34,470
these higher auto-polynomial features when

136
00:04:34,920 --> 00:04:36,050
your original feature set end

137
00:04:36,230 --> 00:04:37,730
is large this really dramatically

138
00:04:38,530 --> 00:04:40,440
blows up your feature space and

139
00:04:41,070 --> 00:04:42,180
this doesn't seem like a

140
00:04:42,320 --> 00:04:43,320
good way to come up with

141
00:04:43,560 --> 00:04:45,050
additional features with which

142
00:04:45,240 --> 00:04:48,100
to build none many classifiers when n is large.

143
00:04:49,590 --> 00:04:52,560
For many machine learning problems, n will be pretty large.

144
00:04:53,270 --> 00:04:53,560
Here's an example.

145
00:04:55,000 --> 00:04:58,140
Let's consider the problem of computer vision.

146
00:04:59,670 --> 00:05:00,770
And suppose you want to

147
00:05:01,260 --> 00:05:02,620
use machine learning to train

148
00:05:02,710 --> 00:05:04,610
a classifier to examine an

149
00:05:04,710 --> 00:05:05,880
image and tell us whether

150
00:05:06,160 --> 00:05:08,030
or not the image is a car.

151
00:05:09,480 --> 00:05:11,900
Many people wonder why computer vision could be difficult.

152
00:05:12,390 --> 00:05:13,140
I mean when you and I

153
00:05:13,270 --> 00:05:15,670
look at this picture it is so obvious what this is.

154
00:05:15,900 --> 00:05:17,000
You wonder how is it

155
00:05:17,190 --> 00:05:18,320
that a learning algorithm could possibly

156
00:05:18,910 --> 00:05:20,880
fail to know what this picture is.

157
00:05:22,110 --> 00:05:23,330
To understand why computer vision

158
00:05:23,720 --> 00:05:25,380
is hard let's zoom

159
00:05:25,650 --> 00:05:26,690
into a small part of the

160
00:05:26,940 --> 00:05:28,180
image like that area where the

161
00:05:28,510 --> 00:05:30,240
little red rectangle is.

162
00:05:30,400 --> 00:05:31,330
It turns out that where you

163
00:05:31,450 --> 00:05:34,270
and I see a car, the computer sees that.

164
00:05:34,780 --> 00:05:35,930
What it sees is this matrix,

165
00:05:36,600 --> 00:05:38,110
or this grid, of pixel

166
00:05:38,580 --> 00:05:40,350
intensity values that tells

167
00:05:40,610 --> 00:05:42,930
us the brightness of each pixel in the image.

168
00:05:43,640 --> 00:05:45,170
So the computer vision problem is

169
00:05:45,550 --> 00:05:47,230
to look at this matrix of

170
00:05:47,310 --> 00:05:49,140
pixel intensity values, and tell

171
00:05:49,410 --> 00:05:52,440
us that these numbers represent the door handle of a car.

172
00:05:54,230 --> 00:05:55,740
Concretely, when we use

173
00:05:56,030 --> 00:05:57,220
machine learning to build a

174
00:05:57,430 --> 00:05:59,060
car detector, what we do

175
00:05:59,360 --> 00:06:00,510
is we come up with a

176
00:06:00,800 --> 00:06:02,690
label training set, with, let's

177
00:06:02,890 --> 00:06:04,250
say, a few label examples

178
00:06:04,730 --> 00:06:05,850
of cars and a few

179
00:06:06,000 --> 00:06:07,150
label examples of things that

180
00:06:07,380 --> 00:06:08,780
are not cars, then we

181
00:06:09,090 --> 00:06:10,590
give our training set to

182
00:06:10,720 --> 00:06:12,230
the learning algorithm trained a

183
00:06:12,310 --> 00:06:13,500
classifier and then, you

184
00:06:13,680 --> 00:06:14,700
know, we may test it and show

185
00:06:14,890 --> 00:06:16,710
the new image and ask, "What is this new thing?".

186
00:06:17,980 --> 00:06:20,030
And hopefully it will recognize that that is a car.

187
00:06:21,410 --> 00:06:24,000
To understand why we

188
00:06:24,120 --> 00:06:26,810
need nonlinear hypotheses, let's take

189
00:06:27,050 --> 00:06:27,940
a look at some of the

190
00:06:28,190 --> 00:06:29,360
images of cars and maybe

191
00:06:29,480 --> 00:06:31,780
non-cars that we might feed to our learning algorithm.

192
00:06:32,960 --> 00:06:33,920
Let's pick a couple of pixel

193
00:06:34,090 --> 00:06:35,630
locations in our images, so

194
00:06:35,750 --> 00:06:37,040
that's pixel one location and

195
00:06:37,180 --> 00:06:39,500
pixel two location, and let's

196
00:06:39,730 --> 00:06:42,390
plot this car, you know, at the

197
00:06:42,510 --> 00:06:44,010
location, at a certain

198
00:06:44,360 --> 00:06:45,890
point, depending on the intensities

199
00:06:46,430 --> 00:06:47,870
of pixel one and pixel two.

200
00:06:49,260 --> 00:06:50,630
And let's do this with a few other images.

201
00:06:51,060 --> 00:06:52,450
So let's take a different example

202
00:06:52,980 --> 00:06:53,980
of the car and you know,

203
00:06:54,080 --> 00:06:55,010
look at the same two pixel locations

204
00:06:56,160 --> 00:06:57,570
and that image has a

205
00:06:57,770 --> 00:06:58,970
different intensity for pixel one

206
00:06:59,230 --> 00:07:00,660
and a different intensity for pixel two.

207
00:07:00,960 --> 00:07:02,940
So, it ends up at a different location on the figure.

208
00:07:03,360 --> 00:07:05,740
And then let's plot some negative examples as well.

209
00:07:05,990 --> 00:07:07,590
That's a non-car, that's a

210
00:07:07,720 --> 00:07:09,470
non-car  .

211
00:07:09,730 --> 00:07:10,910
And if we do this for

212
00:07:11,070 --> 00:07:12,720
more and more examples using

213
00:07:13,280 --> 00:07:14,680
the pluses to denote cars

214
00:07:15,080 --> 00:07:16,310
and minuses to denote non-cars,

215
00:07:16,890 --> 00:07:18,500
what we'll find is that

216
00:07:18,830 --> 00:07:20,680
the cars and non-cars end up

217
00:07:20,890 --> 00:07:22,430
lying in different regions of

218
00:07:22,570 --> 00:07:24,910
the space, and what we

219
00:07:25,180 --> 00:07:26,570
need therefore is some sort

220
00:07:26,750 --> 00:07:28,790
of non-linear hypotheses to try

221
00:07:29,000 --> 00:07:30,900
to separate out the two classes.

222
00:07:32,480 --> 00:07:34,300
What is the dimension of the feature space?

223
00:07:35,290 --> 00:07:38,210
Suppose we were to use just 50 by 50 pixel images.

224
00:07:38,770 --> 00:07:40,050
Now that suppose our images were

225
00:07:40,520 --> 00:07:42,760
pretty small ones, just 50 pixels on the side.

226
00:07:43,470 --> 00:07:44,990
Then we would have 2500 pixels,

227
00:07:46,330 --> 00:07:47,650
and so the dimension of

228
00:07:47,740 --> 00:07:49,310
our feature size will be N

229
00:07:49,520 --> 00:07:51,450
equals 2500 where our feature

230
00:07:51,700 --> 00:07:52,910
vector x is a list

231
00:07:53,180 --> 00:07:54,570
of all the pixel testings, you

232
00:07:54,710 --> 00:07:56,690
know, the pixel brightness of pixel

233
00:07:57,080 --> 00:07:58,030
one, the brightness of pixel

234
00:07:58,330 --> 00:07:59,580
two, and so on down

235
00:07:59,870 --> 00:08:01,310
to the pixel brightness of the

236
00:08:01,400 --> 00:08:03,420
last pixel where, you know, in a

237
00:08:03,590 --> 00:08:05,450
typical computer representation, each of

238
00:08:05,540 --> 00:08:07,190
these may be values between say

239
00:08:07,480 --> 00:08:09,020
0 to 255 if it gives

240
00:08:09,230 --> 00:08:12,110
us the grayscale value.

241
00:08:12,520 --> 00:08:13,290
So we have n equals 2500,

242
00:08:13,950 --> 00:08:15,580
and that's if we

243
00:08:15,740 --> 00:08:17,140
were using grayscale images.

244
00:08:17,790 --> 00:08:18,800
If we were using RGB

245
00:08:19,440 --> 00:08:21,140
images with separate red, green

246
00:08:21,420 --> 00:08:23,870
and blue values, we would have n equals 7500.

247
00:08:27,650 --> 00:08:28,630
So, if we were to

248
00:08:29,000 --> 00:08:29,920
try to learn a nonlinear

249
00:08:30,370 --> 00:08:32,020
hypothesis by including all

250
00:08:32,300 --> 00:08:33,710
the quadratic features, that is

251
00:08:33,810 --> 00:08:34,680
all the terms of the form, you know,

252
00:08:35,430 --> 00:08:38,900
Xi times Xj, while with the

253
00:08:39,130 --> 00:08:40,370
2500 pixels we would end

254
00:08:40,580 --> 00:08:42,500
up with a total of three million features.

255
00:08:43,050 --> 00:08:44,620
And that's just too large to

256
00:08:44,720 --> 00:08:46,430
be reasonable; the computation would

257
00:08:46,600 --> 00:08:48,680
be very expensive to find and

258
00:08:48,840 --> 00:08:50,070
to represent all of these

259
00:08:50,310 --> 00:08:52,250
three million features per training example.

260
00:08:55,470 --> 00:08:57,580
So, simple logistic regression together

261
00:08:58,100 --> 00:08:59,230
with adding in maybe the

262
00:08:59,300 --> 00:09:00,510
quadratic or the cubic features

263
00:09:00,930 --> 00:09:01,910
- that's just not a

264
00:09:01,980 --> 00:09:03,950
good way to learn complex

265
00:09:04,550 --> 00:09:06,090
nonlinear hypotheses when n

266
00:09:06,310 --> 00:09:08,410
is large because you just end up with too many features.

267
00:09:09,370 --> 00:09:10,620
In the next few videos, I

268
00:09:10,840 --> 00:09:11,890
would like to tell you about Neural

269
00:09:12,080 --> 00:09:13,670
Networks, which turns out

270
00:09:13,980 --> 00:09:15,370
to be a much better way to

271
00:09:15,650 --> 00:09:17,720
learn complex hypotheses, complex nonlinear

272
00:09:17,960 --> 00:09:19,780
hypotheses even when your

273
00:09:20,070 --> 00:09:22,080
input feature space, even when n is large.

274
00:09:22,860 --> 00:09:24,080
And along the way I'll

275
00:09:24,420 --> 00:09:25,580
also get to show

276
00:09:25,780 --> 00:09:26,730
you a couple of fun videos

277
00:09:27,240 --> 00:09:29,030
of historically important applications

278
00:09:30,300 --> 00:09:31,290
of Neural networks as well that I

279
00:09:32,100 --> 00:09:33,480
hope those videos that

280
00:09:33,570 --> 00:09:35,460
we'll see later will be fun for you to watch as well.