1
00:00:00,780 --> 00:00:01,870
In this video, I want

2
00:00:02,070 --> 00:00:03,210
to start telling you about how

3
00:00:03,470 --> 00:00:04,970
we represent Neural Networks,

4
00:00:05,520 --> 00:00:06,690
in other words how we represent

5
00:00:07,050 --> 00:00:08,130
our hypotheses or how

6
00:00:08,350 --> 00:00:11,270
we represent our model when using your Neural Networks.

7
00:00:12,050 --> 00:00:13,750
Neural Networks were developed at

8
00:00:14,320 --> 00:00:17,650
simulating neurons or networks of neurons in the brain.

9
00:00:18,540 --> 00:00:19,830
So, to explain the hypotheses

10
00:00:20,400 --> 00:00:22,330
representation. Let's start by

11
00:00:22,580 --> 00:00:23,590
looking at what a single

12
00:00:24,050 --> 00:00:25,250
neuron in the brain looks like.

13
00:00:26,390 --> 00:00:27,630
Your brain and mine is jam-packed

14
00:00:28,160 --> 00:00:29,610
full of neurons like these

15
00:00:30,170 --> 00:00:31,300
and neurons are cells in

16
00:00:31,380 --> 00:00:32,740
the brain and the two

17
00:00:33,000 --> 00:00:34,740
things to draw attention to are

18
00:00:34,970 --> 00:00:36,590
that first that the

19
00:00:36,780 --> 00:00:37,820
neuron has a cell body

20
00:00:38,360 --> 00:00:40,320
like so and moreover, the

21
00:00:40,500 --> 00:00:41,480
neuron has a number of

22
00:00:41,680 --> 00:00:43,060
input wires and these are

23
00:00:43,260 --> 00:00:44,360
called the dendrites who think of

24
00:00:44,670 --> 00:00:47,370
them as input wires and

25
00:00:48,180 --> 00:00:49,500
these receive inputs from other

26
00:00:49,660 --> 00:00:51,330
locations and the neuron

27
00:00:51,600 --> 00:00:54,270
also has an output wire called the axon.

28
00:00:55,140 --> 00:00:56,710
And this output wire

29
00:00:57,290 --> 00:00:58,910
is what it uses to send

30
00:00:59,140 --> 00:01:00,690
signal to other neurons

31
00:01:01,290 --> 00:01:04,130
or to send messages to other neurons.

32
00:01:05,280 --> 00:01:07,220
So, at a simplistic level, what

33
00:01:07,410 --> 00:01:08,740
a neuron is is a computational

34
00:01:09,430 --> 00:01:10,470
unit that gets a number

35
00:01:10,650 --> 00:01:13,220
of inputs through its input wires, does some computation.

36
00:01:14,430 --> 00:01:15,700
and then it sends outputs, via its

37
00:01:15,830 --> 00:01:17,640
axon to other nodes

38
00:01:18,150 --> 00:01:19,540
or other neurons in the brain.

39
00:01:20,460 --> 00:01:23,370
Here's an illustration of a group of neurons.

40
00:01:24,260 --> 00:01:25,350
The way that neurons communicate

41
00:01:26,120 --> 00:01:28,410
with each other is with little pulses of electricities.

42
00:01:29,230 --> 00:01:31,820
They're also called spikes, but they're just means of little pulse of electricity.

43
00:01:33,140 --> 00:01:35,000
So, here's one neuron and what

44
00:01:35,680 --> 00:01:37,060
it does is if it

45
00:01:37,190 --> 00:01:38,260
wants to send a message,

46
00:01:38,500 --> 00:01:39,280
what it does is it sends

47
00:01:39,710 --> 00:01:41,190
the little pulse of electricity via its

48
00:01:41,820 --> 00:01:44,110
axon to some difference

49
00:01:44,970 --> 00:01:46,610
neuron and here this axon.

50
00:01:47,250 --> 00:01:48,310
There is this open wire.

51
00:01:49,190 --> 00:01:50,840
Connects to the input wire or

52
00:01:51,030 --> 00:01:52,270
connects to the dendrite of this

53
00:01:52,550 --> 00:01:54,300
second neuron over here, which

54
00:01:54,560 --> 00:01:55,860
then accepts this incoming message

55
00:01:56,830 --> 00:01:58,510
does some computation and may

56
00:01:58,720 --> 00:01:59,710
in turn decide to send

57
00:02:00,030 --> 00:02:01,450
out its O messages on its

58
00:02:02,020 --> 00:02:04,090
axon to other neurons.

59
00:02:04,400 --> 00:02:05,740
And this is the process by

60
00:02:05,940 --> 00:02:07,570
which all human thought

61
00:02:08,060 --> 00:02:09,540
happens as these neurons doing

62
00:02:09,730 --> 00:02:11,150
computations and passing messages

63
00:02:11,630 --> 00:02:13,120
to other neurons as a

64
00:02:13,380 --> 00:02:15,560
result of what other inputs they've got.

65
00:02:16,530 --> 00:02:17,560
And by the way, this is how

66
00:02:18,340 --> 00:02:21,030
our senses and our muscles work as well.

67
00:02:21,680 --> 00:02:23,340
If you want to move one

68
00:02:23,500 --> 00:02:24,460
of your muscles, the way that

69
00:02:24,760 --> 00:02:26,110
works is that a neuron may

70
00:02:26,240 --> 00:02:27,370
send these pulses of electricities

71
00:02:28,470 --> 00:02:29,590
to your muscle and that causes

72
00:02:30,160 --> 00:02:32,440
your muscles to contract and your

73
00:02:32,710 --> 00:02:34,030
eyes - if some

74
00:02:34,330 --> 00:02:35,510
sensor like your eye

75
00:02:35,650 --> 00:02:36,710
wants to send a message to

76
00:02:36,950 --> 00:02:37,810
your brain, what it does

77
00:02:38,360 --> 00:02:39,900
is it sends its pulses of

78
00:02:40,670 --> 00:02:42,670
electricity to a neuron in your brain like so.

79
00:02:43,460 --> 00:02:45,490
In a neural network, or

80
00:02:46,040 --> 00:02:47,700
rather in an artificial neural

81
00:02:48,040 --> 00:02:49,250
network that we implement in

82
00:02:49,290 --> 00:02:50,980
a computer, we're going to

83
00:02:51,200 --> 00:02:52,560
use a very simple model

84
00:02:53,160 --> 00:02:54,380
of what a neuron does.

85
00:02:54,510 --> 00:02:57,720
We're going to model a neuron as just a logistic unit.

86
00:02:58,590 --> 00:02:59,480
So, when I draw a yellow

87
00:02:59,770 --> 00:03:01,130
circle like that, you should hink of

88
00:03:01,240 --> 00:03:03,130
that as playing a

89
00:03:03,280 --> 00:03:04,710
role analogous to maybe the

90
00:03:04,870 --> 00:03:06,480
body of a neuron, and

91
00:03:07,210 --> 00:03:08,840
we then feed the neuron a

92
00:03:09,670 --> 00:03:11,670
few inputs via its dendrites or

93
00:03:11,910 --> 00:03:16,150
its input wires and the neuron does some computation

94
00:03:17,390 --> 00:03:19,050
and output some value on

95
00:03:19,200 --> 00:03:21,260
this output wire or in

96
00:03:21,820 --> 00:03:23,400
a biological neuron that sorts

97
00:03:23,530 --> 00:03:25,160
the axon and whenever I

98
00:03:25,310 --> 00:03:26,660
draw a diagram like this, what

99
00:03:26,830 --> 00:03:28,020
this means is that this represents

100
00:03:28,550 --> 00:03:30,040
a computations of, you know, h of x equals 1

101
00:03:32,780 --> 00:03:34,290
over 1 + e to

102
00:03:35,290 --> 00:03:37,590
the negative theta transpose x where, as

103
00:03:37,930 --> 00:03:39,330
usual, x and theta

104
00:03:39,650 --> 00:03:42,610
are our parameter vectors like so.

105
00:03:42,920 --> 00:03:44,410
So, this is a very simple maybe

106
00:03:44,780 --> 00:03:46,490
a vastly over simplified model of

107
00:03:46,670 --> 00:03:48,050
the computation that the neuron

108
00:03:48,320 --> 00:03:49,200
does where it gets the

109
00:03:49,260 --> 00:03:50,790
number of inputs, x1, x2,

110
00:03:51,650 --> 00:03:54,150
x3 and it outputs some value computed like so.

111
00:03:59,960 --> 00:04:01,250
When I draw a neural network,

112
00:04:01,900 --> 00:04:03,430
usually I draw only the

113
00:04:03,720 --> 00:04:04,770
input nose x1, x2, x3,

114
00:04:06,330 --> 00:04:07,740
sometimes when it's useful to do so.

115
00:04:08,170 --> 00:04:09,780
I draw an extra node for x zero.

116
00:04:11,050 --> 00:04:12,200
This x zero node is

117
00:04:12,370 --> 00:04:13,960
sometimes called the bias unit

118
00:04:14,960 --> 00:04:17,970
or the bias neuron but because

119
00:04:18,500 --> 00:04:21,350
x0 is already equal to 1.

120
00:04:21,530 --> 00:04:22,320
Sometimes, I draw with it, sometimes

121
00:04:22,820 --> 00:04:24,280
I won't just depending on whether

122
00:04:24,800 --> 00:04:27,560
there's more the rotationally convenient for that example.

123
00:04:28,080 --> 00:04:32,810
Finally, one last

124
00:04:33,270 --> 00:04:34,800
bit of terminology when we

125
00:04:34,900 --> 00:04:36,690
talk about neural networks, sometimes

126
00:04:36,810 --> 00:04:38,500
we'll say that this

127
00:04:38,790 --> 00:04:40,330
is a neuron - an

128
00:04:40,440 --> 00:04:42,720
artificial neuron  with a sigmoid or a logistic

129
00:04:43,090 --> 00:04:44,250
activation function.

130
00:04:44,760 --> 00:04:48,030
So this activation function in the neuronetro

131
00:04:48,140 --> 00:04:49,200
terminology, this is just

132
00:04:49,540 --> 00:04:51,210
another term for that

133
00:04:51,560 --> 00:04:53,190
function for that non-linearity g

134
00:04:53,430 --> 00:04:55,170
of z, equals 1

135
00:04:55,260 --> 00:04:56,020
over 1 plus e to the negative z.

136
00:04:56,660 --> 00:04:58,410
And whereas so far

137
00:04:58,930 --> 00:05:00,090
I've been calling theta the parameters

138
00:05:00,600 --> 00:05:02,500
of the model are mostly continued

139
00:05:02,940 --> 00:05:04,790
to use that terminology to conjugate

140
00:05:05,480 --> 00:05:06,480
to the parameters, but the neural networks.

141
00:05:07,680 --> 00:05:08,960
In the neural networks literature and

142
00:05:09,400 --> 00:05:10,290
sometimes you might hear people

143
00:05:10,620 --> 00:05:12,160
talk about weights of a

144
00:05:12,400 --> 00:05:13,760
model and weights just means

145
00:05:13,950 --> 00:05:15,490
exactly the same thing as

146
00:05:15,750 --> 00:05:17,470
parameters of the model.

147
00:05:17,830 --> 00:05:18,890
But almost to use the terminology

148
00:05:19,900 --> 00:05:21,010
parameters in these videos,

149
00:05:21,620 --> 00:05:24,180
but sometimes you may hear others use the weights terminology.

150
00:05:27,890 --> 00:05:29,290
So, this little

151
00:05:29,430 --> 00:05:31,340
diagram represents a single neuron.

152
00:05:34,470 --> 00:05:35,790
What a neural network is

153
00:05:36,560 --> 00:05:38,590
Is just a proof of

154
00:05:38,780 --> 00:05:40,500
these different neurons strung together.

155
00:05:41,630 --> 00:05:42,770
Concretely, here we have

156
00:05:43,530 --> 00:05:45,070
input units x1, x2, and x3

157
00:05:45,410 --> 00:05:47,170
and once again,

158
00:05:47,540 --> 00:05:49,070
sometimes can draw this

159
00:05:49,370 --> 00:05:50,760
extra node x0 or sometimes

160
00:05:51,340 --> 00:05:52,490
not. So, I just draw that in here.

161
00:05:53,620 --> 00:05:54,950
And here we have

162
00:05:55,300 --> 00:05:56,800
three neurons, which I

163
00:05:56,930 --> 00:05:58,890
have written, you know, a(2)1, a(2)2 and

164
00:05:59,060 --> 00:06:00,250
a(2)3 around top bottles indices

165
00:06:00,700 --> 00:06:02,140
later and once again,

166
00:06:02,730 --> 00:06:03,790
we can if we want

167
00:06:04,500 --> 00:06:05,440
adding this a0 and

168
00:06:05,620 --> 00:06:08,840
add an extra bias unit there.

169
00:06:10,240 --> 00:06:12,030
It always outputs the value of 1.

170
00:06:12,390 --> 00:06:13,680
Then finally we have this

171
00:06:13,880 --> 00:06:15,450
third node at the final

172
00:06:15,710 --> 00:06:16,800
layer, and it's this

173
00:06:16,990 --> 00:06:18,600
third node that opens the value

174
00:06:19,210 --> 00:06:21,020
that the hypotheses h of x computes.

175
00:06:22,330 --> 00:06:23,480
To introduce a bit

176
00:06:23,610 --> 00:06:25,250
more terminology in a neural

177
00:06:25,530 --> 00:06:27,340
network, the first layer, this

178
00:06:27,480 --> 00:06:28,610
is also called the input

179
00:06:29,040 --> 00:06:30,160
layer because this is where

180
00:06:30,400 --> 00:06:33,510
we input our features, x1 x2 x3.

181
00:06:33,770 --> 00:06:35,560
The final layer is

182
00:06:35,850 --> 00:06:37,190
also called the output layer

183
00:06:37,640 --> 00:06:39,550
because that layer has

184
00:06:39,840 --> 00:06:41,010
the neuron - this one over

185
00:06:41,150 --> 00:06:42,340
here - that outputs the

186
00:06:42,400 --> 00:06:43,980
final value computed by a

187
00:06:44,370 --> 00:06:46,180
hypotheses and then layer

188
00:06:46,420 --> 00:06:48,900
two in between, this is called the hidden layer.

189
00:06:49,830 --> 00:06:51,300
The term hidden layer isn't a

190
00:06:51,450 --> 00:06:53,290
great terminology, but the

191
00:06:54,160 --> 00:06:55,680
intuition is that, you know, in

192
00:06:56,020 --> 00:06:57,450
supervised learning where you

193
00:06:57,530 --> 00:06:59,820
get to see the inputs, and you get to see the correct outputs.

194
00:07:00,640 --> 00:07:02,530
Whereas the hidden layer are values you

195
00:07:02,660 --> 00:07:04,260
don't get to observe in the training set.

196
00:07:04,520 --> 00:07:07,280
If it's not x and it's not y and so we call those hidden.

197
00:07:08,170 --> 00:07:09,860
and later on we'll see neural

198
00:07:10,050 --> 00:07:11,260
networks with more than

199
00:07:11,370 --> 00:07:12,690
one hidden layer, but in

200
00:07:13,020 --> 00:07:14,290
this example we have one

201
00:07:14,480 --> 00:07:16,010
input layer, layer 1; one hidden

202
00:07:16,260 --> 00:07:18,900
layer, layer 2; and one output layer, layer 3.

203
00:07:19,390 --> 00:07:20,530
But basically anything that isn't

204
00:07:20,990 --> 00:07:22,260
an input layer and isn't a

205
00:07:22,410 --> 00:07:24,480
output layer is called a hidden layer.

206
00:07:26,710 --> 00:07:29,620
So, I

207
00:07:29,710 --> 00:07:30,610
want to be really clear

208
00:07:31,090 --> 00:07:33,130
about what this neural network is doing.

209
00:07:33,970 --> 00:07:34,840
Let's step through the computational

210
00:07:35,760 --> 00:07:37,600
steps that are embodied

211
00:07:38,050 --> 00:07:40,410
by this, represented by this diagram.

212
00:07:41,560 --> 00:07:42,800
To explain the specific computations

213
00:07:43,660 --> 00:07:44,960
represented by a neural network,

214
00:07:45,580 --> 00:07:46,910
here's a little bit more notation.

215
00:07:47,270 --> 00:07:48,400
I'm going to use a superscript

216
00:07:48,950 --> 00:07:50,520
j subscript i to denote

217
00:07:51,090 --> 00:07:53,640
the activation of neuron i

218
00:07:54,060 --> 00:07:55,390
or of unit i in layer

219
00:07:55,720 --> 00:07:58,290
j.  So concretely, this

220
00:07:59,390 --> 00:08:01,280
a superscript 2 subscript 1

221
00:08:01,380 --> 00:08:03,930
does the activation of the

222
00:08:04,010 --> 00:08:05,320
first unit in layer 2,

223
00:08:05,450 --> 00:08:07,140
in our hidden layer.

224
00:08:07,280 --> 00:08:08,640
And by activation, I just mean,

225
00:08:08,970 --> 00:08:10,360
you know, the value that is computed

226
00:08:10,710 --> 00:08:12,530
by and that is output by a specific.

227
00:08:13,200 --> 00:08:14,320
In addition, our neural network

228
00:08:14,850 --> 00:08:17,050
is parametrized by these matrices,

229
00:08:17,470 --> 00:08:19,520
theta superscript j where

230
00:08:19,690 --> 00:08:20,600
our theta j is going to

231
00:08:20,820 --> 00:08:21,820
be a matrix of waves

232
00:08:22,140 --> 00:08:23,770
controlling the function mapping from

233
00:08:24,130 --> 00:08:25,780
one layer, maybe the first

234
00:08:25,990 --> 00:08:28,360
layer to the second layer or from the second layer to the third layer.

235
00:08:29,580 --> 00:08:32,990
So, here are the computations that are represented by this diagram.

236
00:08:34,520 --> 00:08:35,770
This first hidden unit here,

237
00:08:37,060 --> 00:08:38,600
has its value computed as

238
00:08:38,840 --> 00:08:40,020
follows: is a a(2)1 is

239
00:08:40,260 --> 00:08:41,980
equal to the sigmoid

240
00:08:42,400 --> 00:08:44,240
function, or the sigmoid activation function

241
00:08:45,210 --> 00:08:46,550
also called the logistic activation function,

242
00:08:47,760 --> 00:08:49,730
applied to this sort

243
00:08:49,990 --> 00:08:52,360
of linear combination of its inputs.

244
00:08:53,840 --> 00:08:56,560
And then this second hidden

245
00:08:56,820 --> 00:08:58,330
unit has this activation

246
00:08:59,010 --> 00:09:01,400
value computed as sigmoid of this.

247
00:09:02,470 --> 00:09:04,110
And similarly, for this

248
00:09:04,260 --> 00:09:07,010
third hidden unit, it's computed by that formula.

249
00:09:08,330 --> 00:09:10,060
So here, we have three

250
00:09:10,780 --> 00:09:13,960
input units and the three hidden units.

251
00:09:16,830 --> 00:09:18,840
And so the dimension

252
00:09:19,590 --> 00:09:21,530
of theta one which the

253
00:09:22,060 --> 00:09:23,590
matrix of parameters governing our

254
00:09:23,740 --> 00:09:24,870
mapping from all three input

255
00:09:25,170 --> 00:09:26,530
units, about three hidden units

256
00:09:27,080 --> 00:09:28,210
theta 1 is going to

257
00:09:29,880 --> 00:09:35,390
be a 3,

258
00:09:35,640 --> 00:09:36,870
theta 1 is going to

259
00:09:38,130 --> 00:09:39,640
be a 3 by 4 dimensional

260
00:09:40,650 --> 00:09:42,620
matrix and more generally,

261
00:09:43,870 --> 00:09:45,440
if a network has Sj

262
00:09:45,710 --> 00:09:46,710
units and their j

263
00:09:47,210 --> 00:09:48,440
and Sj + 1 units

264
00:09:48,650 --> 00:09:49,980
in their j + 1 then

265
00:09:50,310 --> 00:09:51,700
the matrix theta j which

266
00:09:52,010 --> 00:09:53,560
governs the function mapping from

267
00:09:53,780 --> 00:09:55,390
layer j to layer j +

268
00:09:55,640 --> 00:09:56,660
1 that we'll have to mention

269
00:09:57,280 --> 00:10:00,160
Sj + 1 by Sj + 1.

270
00:10:00,580 --> 00:10:02,390
Just be clear about this notation, right?

271
00:10:02,580 --> 00:10:04,440
This is S subscript j

272
00:10:04,440 --> 00:10:05,810
+ 1 and that's S

273
00:10:06,100 --> 00:10:07,260
subscript j, and then

274
00:10:07,380 --> 00:10:09,060
this whole thing, plus 1.

275
00:10:09,430 --> 00:10:11,860
Of this whole thing, that's j + 1, okay?

276
00:10:12,260 --> 00:10:13,730
So that's S subscript j plus

277
00:10:14,080 --> 00:10:22,400
1 plus, by So,

278
00:10:22,560 --> 00:10:24,090
that's S subscript j plus

279
00:10:24,400 --> 00:10:26,230
1 by Sj

280
00:10:27,220 --> 00:10:30,460
+ 1 where as plus 1 is not part of the subscript.

281
00:10:32,400 --> 00:10:33,520
So, we talked about what

282
00:10:33,690 --> 00:10:36,120
the three hidden units do to compute their values.

283
00:10:37,180 --> 00:10:41,240
Finally, this last, the spinal in optimal

284
00:10:41,370 --> 00:10:42,280
layer, we have one more

285
00:10:42,540 --> 00:10:44,270
units which computes h of

286
00:10:44,350 --> 00:10:46,090
x and that's equal, can

287
00:10:46,230 --> 00:10:47,210
also be written as a(3)1

288
00:10:47,270 --> 00:10:50,830
and that's equal to this.

289
00:10:52,030 --> 00:10:53,110
And you notice that I've

290
00:10:53,290 --> 00:10:54,480
written this with a superscript

291
00:10:54,670 --> 00:10:56,380
2 here because theta superscript

292
00:10:57,130 --> 00:10:58,340
2 is the matrix of parameters,

293
00:10:59,080 --> 00:11:01,170
or the matrix of weights that

294
00:11:01,380 --> 00:11:02,830
controls the function that maps

295
00:11:03,240 --> 00:11:05,090
from the hidden units, that

296
00:11:05,600 --> 00:11:06,850
is the layer 2 units,

297
00:11:07,720 --> 00:11:09,230
to the 1 layer 3

298
00:11:09,590 --> 00:11:10,840
unit that is the output

299
00:11:12,360 --> 00:11:12,360
unit.

300
00:11:12,550 --> 00:11:13,460
To summarize, what we've done

301
00:11:13,830 --> 00:11:14,900
is shown how a picture

302
00:11:15,230 --> 00:11:16,670
like this over here defines

303
00:11:17,350 --> 00:11:20,280
an artificial neural network which defines

304
00:11:20,920 --> 00:11:22,160
a function h that maps

305
00:11:23,090 --> 00:11:24,880
your x's input values to hopefully

306
00:11:25,140 --> 00:11:26,650
to some space and provisions y?

307
00:11:27,500 --> 00:11:29,430
And these hypotheses after parametrized

308
00:11:30,190 --> 00:11:31,600
by parameters that I

309
00:11:31,690 --> 00:11:33,070
am denoting with a capital

310
00:11:33,460 --> 00:11:35,020
theta so that as

311
00:11:35,170 --> 00:11:36,920
we be vary theta we get different hypotheses.

312
00:11:37,650 --> 00:11:38,930
So we get different functions mapping

313
00:11:39,490 --> 00:11:42,490
say from x to y.  
So

314
00:11:42,940 --> 00:11:44,000
this gives us a mathematical

315
00:11:44,790 --> 00:11:45,980
definition of how to

316
00:11:46,140 --> 00:11:48,400
represent the hypotheses in the neural network.

317
00:11:49,430 --> 00:11:50,750
In the next few videos, what

318
00:11:50,780 --> 00:11:51,930
I'd like to do is give

319
00:11:52,090 --> 00:11:53,580
you more intuition about what

320
00:11:53,760 --> 00:11:56,280
these hypotheses representations do, as

321
00:11:56,410 --> 00:11:57,290
well as go through a

322
00:11:57,370 --> 00:12:00,280
few examples and talk about how to compute them efficiently.