1
00:00:00,420 --> 00:00:01,540
In this video, I'd like you

2
00:00:01,700 --> 00:00:02,680
to work in through our example

3
00:00:03,390 --> 00:00:04,480
to show how a neural

4
00:00:04,730 --> 00:00:07,140
network can compute complex nonlinear hypotheses.

5
00:00:10,110 --> 00:00:11,240
In the last video, we saw

6
00:00:11,490 --> 00:00:12,790
how a neural network can

7
00:00:13,020 --> 00:00:13,900
be used to compute the functions

8
00:00:14,420 --> 00:00:16,120
x1 and x2 and the

9
00:00:16,230 --> 00:00:18,410
function x1 or x2 when

10
00:00:18,750 --> 00:00:20,250
x1 and x2 are binary.

11
00:00:20,870 --> 00:00:23,080
That is, when they take on values of 0, 1.

12
00:00:23,230 --> 00:00:24,580
We can also have a

13
00:00:24,620 --> 00:00:27,130
network to compute negation, that

14
00:00:27,330 --> 00:00:30,040
is to compute the function  "not x1".

15
00:00:30,280 --> 00:00:31,670
Let me just write

16
00:00:31,890 --> 00:00:33,670
down the ways associated with this network.

17
00:00:33,970 --> 00:00:35,350
We have only one input feature, x1

18
00:00:35,450 --> 00:00:36,550
in this case and the

19
00:00:36,620 --> 00:00:38,210
bias unit plus 1 and

20
00:00:38,680 --> 00:00:40,130
if I associate this with

21
00:00:41,070 --> 00:00:42,610
the weights +10 and

22
00:00:43,120 --> 00:00:45,700
-20 then my hypotheses is computing this.

23
00:00:46,080 --> 00:00:47,740
H of x equals sigmoid of

24
00:00:47,880 --> 00:00:49,600
10 minus 20 times x1

25
00:00:50,390 --> 00:00:51,710
so when x1 is

26
00:00:51,940 --> 00:00:52,880
equal to 0, my

27
00:00:52,960 --> 00:00:54,060
hypothesis will be computing

28
00:00:55,160 --> 00:00:57,340
g of 10 minus

29
00:00:57,970 --> 00:00:59,910
20 times 0 which is just 10.

30
00:01:00,090 --> 00:01:01,600
And so that's approximately

31
00:01:02,440 --> 00:01:03,390
1, and when is x is

32
00:01:03,500 --> 00:01:04,300
equal to 1 this will

33
00:01:04,380 --> 00:01:05,740
be g of  -10, which

34
00:01:06,210 --> 00:01:09,380
is therefore approximately equal to 0.

35
00:01:09,550 --> 00:01:10,320
And if you look at what

36
00:01:10,450 --> 00:01:11,720
these values are, that's essentially

37
00:01:12,230 --> 00:01:13,470
the "not x1" function.

38
00:01:14,560 --> 00:01:16,410
So to include negations, the

39
00:01:16,700 --> 00:01:18,640
general idea is to put

40
00:01:19,080 --> 00:01:20,460
a large negative weight in front

41
00:01:20,650 --> 00:01:22,870
of the variable you want to negate.

42
00:01:23,100 --> 00:01:24,710
So if it's -20, multiplied by

43
00:01:25,590 --> 00:01:26,780
x1 and, you know, that's the general

44
00:01:27,230 --> 00:01:28,110
idea of how you end

45
00:01:28,320 --> 00:01:30,500
up negating x1.

46
00:01:30,700 --> 00:01:32,210
And so, in an example that

47
00:01:32,580 --> 00:01:33,410
I hope you will figure out

48
00:01:33,480 --> 00:01:35,090
yourself, if you want

49
00:01:35,280 --> 00:01:36,410
to compute a function like this:

50
00:01:36,580 --> 00:01:38,870
"not x1 and not x2"

51
00:01:39,090 --> 00:01:40,100
you know, while part of that would

52
00:01:40,390 --> 00:01:41,860
probably be putting large negative

53
00:01:42,290 --> 00:01:44,150
weights in front of x1

54
00:01:44,500 --> 00:01:45,330
and x2, but it should

55
00:01:45,580 --> 00:01:47,320
be feasible to get a

56
00:01:47,490 --> 00:01:49,910
neural network with just

57
00:01:50,420 --> 00:01:52,810
one output unit to compute this as well.

58
00:01:52,990 --> 00:01:53,460
All right?

59
00:01:53,680 --> 00:01:55,130
So, this large fill

60
00:01:55,300 --> 00:01:56,290
function "not x1 and not

61
00:01:56,590 --> 00:01:57,990
x2" is going to

62
00:01:58,210 --> 00:02:00,450
be equal to 1

63
00:02:00,780 --> 00:02:06,960
if, and only if, x1

64
00:02:07,350 --> 00:02:09,860
equals x2 equals zero, right?

65
00:02:10,420 --> 00:02:11,480
So this is a logical function, this

66
00:02:11,680 --> 00:02:14,290
is not x1, that means x1 must be zero and not x2.

67
00:02:14,530 --> 00:02:17,130
That means x2 must be equal to zero as well.

68
00:02:17,800 --> 00:02:19,210
So this logical function is

69
00:02:19,450 --> 00:02:20,210
equal to 1 if, and only

70
00:02:20,540 --> 00:02:22,900
if, both x1 and x2 are equal to zero.

71
00:02:23,910 --> 00:02:25,600
And hopefully, you should be

72
00:02:25,710 --> 00:02:26,630
able to figure out how to make a

73
00:02:26,950 --> 00:02:28,240
small neural network to compute

74
00:02:28,640 --> 00:02:29,830
this logical function as well.

75
00:02:33,430 --> 00:02:34,350
Now, taking the three pieces

76
00:02:34,820 --> 00:02:36,720
that we have put together, the

77
00:02:37,400 --> 00:02:38,710
network for computing x1 and

78
00:02:38,910 --> 00:02:40,620
x2 and the network for

79
00:02:40,960 --> 00:02:42,070
computing not x1 and

80
00:02:42,340 --> 00:02:44,170
not x2 and one last

81
00:02:44,620 --> 00:02:45,910
network for computing x1 or

82
00:02:46,570 --> 00:02:47,700
x2, we should be

83
00:02:47,760 --> 00:02:49,420
able to put these three pieces together

84
00:02:49,840 --> 00:02:51,270
to compute this x1, XNOR

85
00:02:51,470 --> 00:02:52,810
x2 function.

86
00:02:53,860 --> 00:02:54,930
And just to remind you, if

87
00:02:55,100 --> 00:02:57,130
this was x1, x2, this

88
00:02:58,080 --> 00:02:58,830
function that we want to

89
00:02:59,090 --> 00:03:00,900
compute would have negative examples

90
00:03:01,520 --> 00:03:02,690
here and here and we'd

91
00:03:02,830 --> 00:03:04,370
have positive examples there and there.

92
00:03:04,730 --> 00:03:06,270
And so clearly this, you know, we'll

93
00:03:06,570 --> 00:03:08,400
need a nonlinear decision boundary

94
00:03:08,940 --> 00:03:10,540
in order to separate the positive and negative examples.

95
00:03:12,950 --> 00:03:13,460
Let's draw the network.

96
00:03:14,260 --> 00:03:15,820
I'm going to take my input

97
00:03:16,570 --> 00:03:18,610
plus 1, x1, x2, and create

98
00:03:19,150 --> 00:03:20,390
my first hidden unit here.

99
00:03:20,660 --> 00:03:22,010
I'm going to call this a(2)1

100
00:03:22,770 --> 00:03:24,060
because that's my first hidden unit.

101
00:03:24,510 --> 00:03:25,660
And I'm going to copy

102
00:03:25,920 --> 00:03:27,410
the weights over from the Red

103
00:03:27,740 --> 00:03:30,020
Network, x1 and x2 networks.

104
00:03:30,820 --> 00:03:32,410
So now minus 30, 20, 20.

105
00:03:32,650 --> 00:03:36,060
Next, let me create

106
00:03:36,420 --> 00:03:37,700
a second hidden unit, which

107
00:03:37,930 --> 00:03:39,960
I'm going to call a(2)2 that

108
00:03:40,350 --> 00:03:42,610
is the second hidden unit of layer two.

109
00:03:43,550 --> 00:03:44,590
And I'm going to copy over the

110
00:03:44,740 --> 00:03:45,940
Cyan Network in the

111
00:03:46,170 --> 00:03:47,080
middle, so I'm going

112
00:03:47,130 --> 00:03:49,230
to have the weights 10, minus 20,

113
00:03:50,150 --> 00:03:51,060
minus 20.

114
00:03:52,150 --> 00:03:55,570
And so, let's pull some of the truth table values.

115
00:03:56,170 --> 00:03:57,350
For the Red Network, we know

116
00:03:57,590 --> 00:03:59,340
that was computing the x1 and x2.

117
00:03:59,690 --> 00:04:00,940
And so this is

118
00:04:01,040 --> 00:04:02,460
going to be approximately 0, 0,

119
00:04:02,540 --> 00:04:05,030
0, 1, depending on the values of x1 and x2.

120
00:04:07,040 --> 00:04:09,560
And for a (2)2, that's the Cyan Network.

121
00:04:10,590 --> 00:04:11,750
Well that we know the function not x1

122
00:04:12,240 --> 00:04:13,640
and not x2 then outputs 1,

123
00:04:13,640 --> 00:04:15,610
0, 0, 0 for the

124
00:04:15,700 --> 00:04:17,830
4 values of x1 and x2.

125
00:04:18,480 --> 00:04:19,560
Finally, I'm going to

126
00:04:19,810 --> 00:04:21,300
create my output note, my

127
00:04:21,490 --> 00:04:23,950
output unit that is a(3)1.

128
00:04:24,860 --> 00:04:26,230
This is one more output h

129
00:04:26,590 --> 00:04:28,270
of x and I'm

130
00:04:28,390 --> 00:04:30,030
going to copy over the OR

131
00:04:30,320 --> 00:04:32,470
Network for that and I'm going to

132
00:04:32,860 --> 00:04:34,330
need a plus one bias unit here.

133
00:04:34,810 --> 00:04:36,010
So, draw that in and I'm

134
00:04:36,320 --> 00:04:38,360
going to copy over the weights from the Green Networks.

135
00:04:38,950 --> 00:04:39,750
So, it's minus 10, 20, 20

136
00:04:42,370 --> 00:04:44,460
and we know earlier that this computes the OR function.

137
00:04:46,660 --> 00:04:48,200
So, let's go on the truth table entries.

138
00:04:50,300 --> 00:04:51,660
For the first entry is 0

139
00:04:51,720 --> 00:04:53,930
or 1, which is gonna be

140
00:04:54,140 --> 00:04:55,710
1 then next 0 or

141
00:04:55,800 --> 00:04:57,280
0, which is 0,

142
00:04:57,350 --> 00:04:58,920
0, or 0, which is 0,

143
00:04:58,960 --> 00:05:00,420
1 or 0 and that all

144
00:05:00,600 --> 00:05:02,450
is to 1 and thus, h

145
00:05:02,640 --> 00:05:04,820
of x is equal to 1

146
00:05:04,980 --> 00:05:06,270
when either both x1 and

147
00:05:06,780 --> 00:05:08,360
x2 are 0 or when x1 and

148
00:05:08,590 --> 00:05:10,160
x2 are both 1. And

149
00:05:10,900 --> 00:05:12,170
concretely, h of x

150
00:05:12,680 --> 00:05:15,340
outputs 1 exactly at these

151
00:05:15,560 --> 00:05:16,850
two locations and it outputs

152
00:05:17,230 --> 00:05:19,270
0 otherwise and thus,

153
00:05:19,570 --> 00:05:20,970
with this neural network, which

154
00:05:21,210 --> 00:05:23,030
has an input layer, one

155
00:05:23,200 --> 00:05:24,560
hidden layer and one output

156
00:05:24,880 --> 00:05:25,920
layer, we end up

157
00:05:26,100 --> 00:05:28,450
with a nonlinear decision boundary that

158
00:05:29,120 --> 00:05:30,520
computes this XNOR function.

159
00:05:31,640 --> 00:05:33,390
And the more general intuition is

160
00:05:33,710 --> 00:05:34,870
that in the input

161
00:05:34,990 --> 00:05:35,780
layer, we just had our raw

162
00:05:36,060 --> 00:05:37,400
inputs then we had

163
00:05:37,610 --> 00:05:39,510
a hidden layer, which computed some

164
00:05:39,680 --> 00:05:41,140
slightly more complex functions of

165
00:05:41,250 --> 00:05:42,080
the inputs that is shown

166
00:05:42,430 --> 00:05:43,410
here, these are slightly more

167
00:05:43,550 --> 00:05:44,960
complex functions, and then by

168
00:05:45,250 --> 00:05:46,510
adding yet another layer, we end

169
00:05:46,640 --> 00:05:49,030
up with an even more complex nonlinear function.

170
00:05:50,550 --> 00:05:51,340
And this is the sort of

171
00:05:51,450 --> 00:05:53,810
intuition about why Neural

172
00:05:54,100 --> 00:05:55,270
Networks can compute pretty complicated

173
00:05:55,840 --> 00:05:57,270
functions that when you

174
00:05:57,340 --> 00:05:58,550
have multiple layers, you have, you know,

175
00:05:58,910 --> 00:06:00,300
relatively simple function of

176
00:06:00,390 --> 00:06:01,500
the inputs, and the second layer,

177
00:06:02,160 --> 00:06:03,110
but the third layer can build

178
00:06:03,340 --> 00:06:04,590
on that to compute even more

179
00:06:04,820 --> 00:06:06,330
complex functions and then

180
00:06:06,790 --> 00:06:08,730
the layer after that can compute even more complex functions.

181
00:06:10,340 --> 00:06:11,740
To wrap up this video, I

182
00:06:11,800 --> 00:06:13,330
want to show you a fun example of

183
00:06:13,480 --> 00:06:14,650
an application of a neural

184
00:06:14,880 --> 00:06:16,400
network that captures this intuition

185
00:06:17,260 --> 00:06:19,440
of the deeper layers computing more complex features.

186
00:06:19,900 --> 00:06:21,040
I want to show you

187
00:06:21,200 --> 00:06:22,480
a video that I got from

188
00:06:22,930 --> 00:06:24,170
a good friend of mine, Yon Khun.

189
00:06:24,850 --> 00:06:26,240
Yon is a professor at

190
00:06:26,610 --> 00:06:27,680
New York University, at NYU,

191
00:06:28,230 --> 00:06:29,400
and he was one of

192
00:06:29,470 --> 00:06:30,910
the early pioneers of neural

193
00:06:31,130 --> 00:06:32,590
network research and sort

194
00:06:32,930 --> 00:06:34,610
of a legend in the field

195
00:06:34,930 --> 00:06:36,520
now and his ideas are

196
00:06:36,560 --> 00:06:38,340
used in all sorts of products

197
00:06:38,980 --> 00:06:40,490
and applications throughout the world now.

198
00:06:41,470 --> 00:06:42,230
So, I want to show you

199
00:06:42,380 --> 00:06:43,410
a video from some of his

200
00:06:43,740 --> 00:06:44,890
early work in which he

201
00:06:44,980 --> 00:06:46,110
was using a neural network

202
00:06:47,000 --> 00:06:50,300
to recognize handwriting - to do handwritten digit recognition.

203
00:06:51,370 --> 00:06:52,510
You might remember early in this

204
00:06:52,720 --> 00:06:53,630
class, at the start of this

205
00:06:53,730 --> 00:06:55,180
class, I said that one of

206
00:06:55,460 --> 00:06:56,720
early successes of neural networks

207
00:06:57,140 --> 00:06:58,170
was trying to use it

208
00:06:58,320 --> 00:07:00,580
to read zip codes, to help

209
00:07:00,850 --> 00:07:02,940
us, you know, send mail along. So, to read postal codes.

210
00:07:03,880 --> 00:07:04,910
So, this is one of the attempts.

211
00:07:05,250 --> 00:07:06,220
So, this is one of the

212
00:07:06,650 --> 00:07:08,370
algorithms used to try to address that problem.

213
00:07:09,320 --> 00:07:10,420
In the video I'll show you

214
00:07:11,060 --> 00:07:12,640
this area here is the

215
00:07:12,910 --> 00:07:14,420
input area that shows a

216
00:07:14,980 --> 00:07:16,460
handwritten character shown to the

217
00:07:16,560 --> 00:07:18,610
network. This column here

218
00:07:19,490 --> 00:07:21,350
shows a visualization of

219
00:07:21,460 --> 00:07:23,550
the features computed by so that the

220
00:07:23,900 --> 00:07:24,760
first hidden layer of the

221
00:07:24,830 --> 00:07:26,090
network and so the first

222
00:07:26,400 --> 00:07:28,420
hidden layer, you know, this visualization shows

223
00:07:28,720 --> 00:07:31,190
different features, different edges and lines and so on detected.

224
00:07:32,360 --> 00:07:35,260
This is a visualization of the next hidden layer.

225
00:07:35,530 --> 00:07:36,390
It's kind of harder to see

226
00:07:36,770 --> 00:07:38,170
how to understand deeper hidden

227
00:07:38,730 --> 00:07:39,680
layers and that's the visualization

228
00:07:40,460 --> 00:07:41,830
of what the next hidden layer is computing.

229
00:07:42,140 --> 00:07:43,530
You probably have a hard time

230
00:07:44,180 --> 00:07:45,550
seeing what's going on, you know,

231
00:07:45,700 --> 00:07:46,800
much beyond the first hidden layer,

232
00:07:47,640 --> 00:07:49,160
but then finally, all of

233
00:07:49,260 --> 00:07:51,110
these learned features get

234
00:07:51,430 --> 00:07:52,590
fed to the output layer

235
00:07:53,260 --> 00:07:54,830
and shown over here is

236
00:07:55,030 --> 00:07:56,370
the final answers, the final

237
00:07:56,800 --> 00:07:58,850
predictive value for what

238
00:07:59,390 --> 00:08:02,150
handwritten digit the neural network things that is being shown.

239
00:08:03,130 --> 00:08:04,270
So, let's take a look at the video.

240
00:09:42,060 --> 00:09:44,370
So, I hope

241
00:09:50,610 --> 00:09:52,010
you enjoyed the video and that

242
00:09:52,260 --> 00:09:53,480
this hopefully gave you some

243
00:09:53,670 --> 00:09:55,240
intuition about the sorts

244
00:09:55,450 --> 00:09:57,120
of pretty complicated functions neural

245
00:09:57,320 --> 00:09:58,420
networks can learn in which

246
00:09:58,740 --> 00:10:00,250
it takes this input this image,

247
00:10:00,670 --> 00:10:01,510
just takes this input the

248
00:10:01,620 --> 00:10:03,140
raw pixels and the first

249
00:10:03,310 --> 00:10:04,640
end of the layer computes some set

250
00:10:04,770 --> 00:10:05,680
of features, the next end

251
00:10:05,740 --> 00:10:06,900
of the layer computes even more complex

252
00:10:07,330 --> 00:10:08,620
features and even more complex features

253
00:10:09,560 --> 00:10:10,640
and these features can then be

254
00:10:10,780 --> 00:10:12,030
used by essentially the final

255
00:10:12,940 --> 00:10:14,700
layer of logistic regression classifiers

256
00:10:15,810 --> 00:10:17,550
to make accurate predictions about what

257
00:10:17,880 --> 00:10:19,190
are the numbers that the network sees.