1
00:00:00,270 --> 00:00:01,380
Neural Networks are one of

2
00:00:01,450 --> 00:00:03,610
the most powerful learning algorithms that we have today.

3
00:00:04,310 --> 00:00:05,490
In this, and in the

4
00:00:05,590 --> 00:00:06,690
next few videos, I'd like to

5
00:00:06,760 --> 00:00:08,030
start talking about a learning

6
00:00:08,380 --> 00:00:09,920
algorithm for fitting the parameters

7
00:00:10,630 --> 00:00:12,530
of the neural network given the training set.

8
00:00:13,460 --> 00:00:14,840
As for the discussion of most

9
00:00:15,020 --> 00:00:16,300
of the learning algorithms, we're going

10
00:00:16,450 --> 00:00:17,860
to begin by talking about the

11
00:00:17,960 --> 00:00:20,510
cost function for fitting the parameters of the network.

12
00:00:21,170 --> 00:00:22,650
I'm going to focus

13
00:00:23,270 --> 00:00:24,790
on the application of neural

14
00:00:25,060 --> 00:00:26,590
networks to classification problems.

15
00:00:27,660 --> 00:00:28,860
So, suppose we have a

16
00:00:29,120 --> 00:00:31,300
network like that shown on the left.

17
00:00:31,530 --> 00:00:32,510
And suppose we have a training

18
00:00:32,710 --> 00:00:33,850
set like this of this of

19
00:00:33,980 --> 00:00:36,550
Xi, Yi pairs of m training examples.

20
00:00:37,920 --> 00:00:39,040
I'm going to use upper case

21
00:00:39,450 --> 00:00:40,640
L to denote the

22
00:00:40,790 --> 00:00:42,460
total number of layers in this network.

23
00:00:43,330 --> 00:00:44,550
So, for the network shown

24
00:00:44,810 --> 00:00:45,720
on the left, we would have

25
00:00:46,370 --> 00:00:47,920
capital L equals 4.

26
00:00:48,020 --> 00:00:48,910
And, I'm going to use

27
00:00:49,180 --> 00:00:50,740
s subscript L, to denote

28
00:00:51,260 --> 00:00:52,540
the number of units, that is

29
00:00:52,730 --> 00:00:54,490
a number of neurons, not counting

30
00:00:54,770 --> 00:00:57,180
the bias unit in layer L of the network.

31
00:00:57,900 --> 00:00:59,440
So, for example, we would

32
00:00:59,580 --> 00:01:01,280
have a S1 which

33
00:01:01,370 --> 00:01:03,330
is the input layer equals S3 unit,

34
00:01:04,140 --> 00:01:05,970
S2 in my example is five units.

35
00:01:06,900 --> 00:01:08,670
And the output layer S4.

36
00:01:09,940 --> 00:01:12,820
Which is also equals SL, because capital L is equal to four.

37
00:01:12,990 --> 00:01:14,290
The output layer in my

38
00:01:14,450 --> 00:01:16,230
example in the left has four units.

39
00:01:17,630 --> 00:01:19,880
We're going to consider two types of classification problems.

40
00:01:20,430 --> 00:01:21,780
The first is binary classification,

41
00:01:22,970 --> 00:01:25,550
where the labels y are either zero or one.

42
00:01:26,240 --> 00:01:28,540
In this case, we would have one output unit.

43
00:01:29,140 --> 00:01:30,260
So, this neural network on top

44
00:01:30,510 --> 00:01:32,430
has four output units, but if

45
00:01:32,570 --> 00:01:33,960
we had binary classification, we would

46
00:01:34,120 --> 00:01:35,810
have only one output unit

47
00:01:36,720 --> 00:01:38,360
that computes h of x.

48
00:01:40,310 --> 00:01:41,550
And the outputs of the

49
00:01:41,630 --> 00:01:42,960
neural network would be h

50
00:01:43,140 --> 00:01:45,580
of x is going to be a real number.

51
00:01:46,900 --> 00:01:47,980
And in this case the number

52
00:01:48,360 --> 00:01:50,240
of output units, SL, where

53
00:01:50,480 --> 00:01:51,880
L is again the index

54
00:01:52,300 --> 00:01:53,970
of the final layer because that's

55
00:01:54,240 --> 00:01:55,630
the number of layers we have in the network.

56
00:01:56,570 --> 00:01:57,960
So the number of units we

57
00:01:58,110 --> 00:02:00,060
have in the output layer is going to be equal to one.

58
00:02:01,040 --> 00:02:02,430
In this case, to simplify notation

59
00:02:02,950 --> 00:02:05,340
later, I'm also going to set k equals 1.

60
00:02:05,460 --> 00:02:06,560
So, you can think of k as

61
00:02:06,770 --> 00:02:08,240
also denoting the number

62
00:02:08,700 --> 00:02:10,780
of units in the output layer.

63
00:02:11,410 --> 00:02:12,980
The second type of classification problem

64
00:02:13,280 --> 00:02:15,160
we'll consider will be multiclass classification

65
00:02:15,780 --> 00:02:18,020
problem where we may have k distinct classes.

66
00:02:19,160 --> 00:02:20,760
So, our early example, I

67
00:02:21,070 --> 00:02:22,530
had this representation for y

68
00:02:23,080 --> 00:02:24,900
if we have four classes and

69
00:02:25,160 --> 00:02:27,050
in this case, we would have caps lock K

70
00:02:27,340 --> 00:02:29,530
output units and our hypotheses

71
00:02:30,350 --> 00:02:33,720
will output vectors that are K dimensional.

72
00:02:34,980 --> 00:02:36,230
And the number of output units

73
00:02:36,760 --> 00:02:38,390
will be equal to K.

74
00:02:39,000 --> 00:02:40,020
And usually we will have

75
00:02:40,370 --> 00:02:41,620
K greater than or equal

76
00:02:41,820 --> 00:02:42,960
to three in this case, because

77
00:02:43,980 --> 00:02:45,340
if we had two classes then,

78
00:02:45,710 --> 00:02:46,560
you know, we don't need to

79
00:02:46,690 --> 00:02:48,330
use the one versus all method.

80
00:02:48,720 --> 00:02:49,640
We need to use the one versus

81
00:02:49,970 --> 00:02:50,950
all method only if we

82
00:02:51,110 --> 00:02:52,460
have K greater than or

83
00:02:52,740 --> 00:02:54,250
equal to three classes so we

84
00:02:54,470 --> 00:02:56,100
only have two classes we will

85
00:02:56,180 --> 00:02:57,670
need to use only one output unit.

86
00:02:58,250 --> 00:03:00,870
Now, let's define the cost function for our cost function for our neural network.

87
00:03:03,880 --> 00:03:05,130
The cost function we use for

88
00:03:05,240 --> 00:03:06,530
the neural network is going to

89
00:03:06,680 --> 00:03:08,300
be a generalization of the

90
00:03:08,360 --> 00:03:09,340
one that we use for logistic

91
00:03:09,510 --> 00:03:11,500
regression. 
For logistic regression,

92
00:03:12,100 --> 00:03:13,440
we used to minimize the

93
00:03:13,510 --> 00:03:14,490
cost function j of theta

94
00:03:15,270 --> 00:03:16,550
that was minus 1 over

95
00:03:16,770 --> 00:03:17,760
m of this cost function

96
00:03:18,720 --> 00:03:20,570
and then plus this extra regularization

97
00:03:21,300 --> 00:03:22,660
term here, where this was

98
00:03:22,850 --> 00:03:24,020
a sum from j equals

99
00:03:24,700 --> 00:03:26,190
1 through n, because we

100
00:03:26,270 --> 00:03:29,760
did not regularize the bias term theta zero.

101
00:03:31,030 --> 00:03:32,590
For a neural network our cost

102
00:03:32,910 --> 00:03:34,490
function is going to be a generalization of this.

103
00:03:35,650 --> 00:03:37,060
Where instead of having basically

104
00:03:37,530 --> 00:03:39,360
just one logistic regression output

105
00:03:39,650 --> 00:03:41,650
unit, we may instead have K of them.

106
00:03:42,590 --> 00:03:43,520
So here's our cost function.

107
00:03:44,770 --> 00:03:46,300
Neural network now outputs vectors

108
00:03:46,720 --> 00:03:47,920
in RK where K might

109
00:03:48,170 --> 00:03:48,830
be equal to 1 if we

110
00:03:49,200 --> 00:03:50,350
have the binary classification problem.

111
00:03:51,380 --> 00:03:52,240
I'm going to use this notation,

112
00:03:53,300 --> 00:03:56,470
h of x subscript i, to denote the ith output.

113
00:03:57,440 --> 00:03:59,860
That is h of x is a K dimensional vector.

114
00:04:00,840 --> 00:04:02,590
And so this subscript i just

115
00:04:02,960 --> 00:04:04,400
selects out the ith element

116
00:04:05,200 --> 00:04:07,510
of the vector that is output by my neural network.

117
00:04:08,900 --> 00:04:10,050
My cost function, j of

118
00:04:10,180 --> 00:04:11,580
theta is now going

119
00:04:11,760 --> 00:04:13,790
to be the following is minus one

120
00:04:13,940 --> 00:04:14,850
over m of a sum

121
00:04:15,420 --> 00:04:16,780
of a similar term to what

122
00:04:16,960 --> 00:04:18,990
we have in logistic regression.
Except that

123
00:04:19,300 --> 00:04:20,360
we have this sum from K

124
00:04:21,020 --> 00:04:22,490
equals one through K.  The

125
00:04:22,600 --> 00:04:23,650
summation is basically a

126
00:04:23,720 --> 00:04:25,580
sum over my K output unit.

127
00:04:26,060 --> 00:04:28,290
So, if I have four upper units.

128
00:04:29,400 --> 00:04:30,740
That is the final layer of my

129
00:04:30,850 --> 00:04:32,530
neural network has four output

130
00:04:32,860 --> 00:04:34,420
units then this sum

131
00:04:34,700 --> 00:04:35,680
from, this is a sum from

132
00:04:35,900 --> 00:04:37,140
K equals one through four

133
00:04:38,050 --> 00:04:40,550
of basically the logistic regression algorithms

134
00:04:42,070 --> 00:04:43,640
cost function but summing

135
00:04:43,750 --> 00:04:45,570
that cost function over each

136
00:04:45,890 --> 00:04:47,120
of my four output units in turn.

137
00:04:47,800 --> 00:04:48,970
And so, you notice

138
00:04:49,380 --> 00:04:50,700
in particular that this applies

139
00:04:51,400 --> 00:04:53,530
to YK, HK, because

140
00:04:53,740 --> 00:04:55,040
we're basically taking the K

141
00:04:55,500 --> 00:04:57,020
upper unit and comparing that

142
00:04:57,780 --> 00:04:59,590
to the value of YK, which

143
00:04:59,810 --> 00:05:02,020
is you know, which is

144
00:05:02,210 --> 00:05:03,260
that one of those vectors

145
00:05:03,740 --> 00:05:05,110
to say what cause it should be.

146
00:05:06,280 --> 00:05:08,060
And finally, the second term

147
00:05:08,360 --> 00:05:09,490
here is the regularization

148
00:05:10,440 --> 00:05:12,970
term similar to what we had for logistic regression.

149
00:05:14,080 --> 00:05:15,640
This summation terms looks really

150
00:05:15,850 --> 00:05:17,370
complicated and always doing

151
00:05:17,840 --> 00:05:19,460
is a summing over these terms,

152
00:05:19,950 --> 00:05:21,670
theta j i l for

153
00:05:21,860 --> 00:05:23,340
all values of i j

154
00:05:23,410 --> 00:05:24,830
and l. 
Except that we

155
00:05:25,010 --> 00:05:26,340
don't sum over the terms

156
00:05:26,710 --> 00:05:28,210
corresponding to these bias values

157
00:05:28,900 --> 00:05:30,000
like we had for logistic progression.

158
00:05:30,900 --> 00:05:32,080
Concretely, we don't sum

159
00:05:32,240 --> 00:05:33,590
over the terms corresponding

160
00:05:34,300 --> 00:05:36,290
to where i is equal to zero.

161
00:05:36,780 --> 00:05:38,310
So, that is because

162
00:05:38,920 --> 00:05:40,010
when we are computing the activation

163
00:05:40,590 --> 00:05:41,930
of the neuron, we have terms

164
00:05:42,280 --> 00:05:43,630
like these, you know theta, i0

165
00:05:43,810 --> 00:05:47,860
plus theta, i1,

166
00:05:48,160 --> 00:05:50,410
x1 plus, and

167
00:05:50,520 --> 00:05:51,780
so on, where I guess

168
00:05:52,020 --> 00:05:53,310
we could have a 2 there

169
00:05:53,490 --> 00:05:54,420
if this is the first hidden layer,

170
00:05:55,250 --> 00:05:56,800
and so the values with

171
00:05:57,230 --> 00:05:58,730
the 0 there at that corresponds to

172
00:05:58,730 --> 00:06:00,110
something that multiplies into an

173
00:06:00,260 --> 00:06:01,460
x0 or an a0 and

174
00:06:02,210 --> 00:06:02,950
so, this is kind of like

175
00:06:03,120 --> 00:06:04,810
a bias unit and by

176
00:06:04,980 --> 00:06:06,020
analogy to what we were

177
00:06:06,130 --> 00:06:07,680
doing for logistic progression, we won't

178
00:06:07,890 --> 00:06:09,090
sum over those terms in

179
00:06:09,160 --> 00:06:11,050
our regularization term because we

180
00:06:11,160 --> 00:06:13,470
don't want to regularize them and

181
00:06:13,670 --> 00:06:15,140
string the values 0.

182
00:06:15,360 --> 00:06:16,530
But this is just one possible convention

183
00:06:17,670 --> 00:06:18,670
and even if you were

184
00:06:18,840 --> 00:06:20,960
to sum over, you know, i equals 0 up

185
00:06:21,200 --> 00:06:22,810
to SL, it will work

186
00:06:23,160 --> 00:06:24,720
about the same and it doesn't make a big difference.

187
00:06:25,530 --> 00:06:26,760
But maybe this convention

188
00:06:27,500 --> 00:06:28,790
of not regularizing the bias

189
00:06:29,070 --> 00:06:30,320
term is just slightly more common.

190
00:06:32,960 --> 00:06:34,200
So, that's the cost function

191
00:06:34,690 --> 00:06:36,270
we're going to use to fill on your own network.

192
00:06:36,790 --> 00:06:38,130
In the next video, we'll start

193
00:06:38,480 --> 00:06:40,270
to talk about an algorithm for

194
00:06:40,570 --> 00:06:42,530
trying to optimize the cost function.