1
00:00:00,250 --> 00:00:01,530
In the previous video, we talked

2
00:00:01,850 --> 00:00:02,870
about how to use back propagation

3
00:00:03,980 --> 00:00:05,810
to compute the derivatives of your cost function.

4
00:00:06,780 --> 00:00:07,770
In this video, I want

5
00:00:08,030 --> 00:00:10,260
to quickly tell you about one implementational detail of

6
00:00:11,220 --> 00:00:13,110
unrolling your parameters from

7
00:00:13,670 --> 00:00:15,500
matrices into vectors, which we

8
00:00:15,610 --> 00:00:17,870
need in order to use the advanced optimization routines.

9
00:00:20,230 --> 00:00:21,470
Concretely, let's say

10
00:00:21,640 --> 00:00:23,120
you've implemented a cost function

11
00:00:23,660 --> 00:00:24,870
that takes this input, you know, parameters

12
00:00:25,420 --> 00:00:28,690
theta and returns the cost function and returns derivatives.

13
00:00:30,050 --> 00:00:31,260
Then you can pass this to

14
00:00:31,510 --> 00:00:33,820
an advanced authorization algorithm by fminunc

15
00:00:34,080 --> 00:00:34,790
and fminunc

16
00:00:34,890 --> 00:00:35,900
isn't the only one by the way.

17
00:00:36,060 --> 00:00:38,660
There are also other advanced authorization algorithms.

18
00:00:39,710 --> 00:00:40,910
But what all of them

19
00:00:41,030 --> 00:00:41,970
do is take those input

20
00:00:42,730 --> 00:00:43,560
pointedly the cost function,

21
00:00:44,490 --> 00:00:45,730
and some initial value of theta.

22
00:00:47,010 --> 00:00:48,490
And both, and these

23
00:00:48,730 --> 00:00:51,600
routines assume that theta and

24
00:00:51,740 --> 00:00:53,360
the initial value of theta, that

25
00:00:53,580 --> 00:00:55,410
these are parameter vectors, maybe

26
00:00:55,640 --> 00:00:57,040
Rn or Rn plus 1.

27
00:00:57,870 --> 00:01:00,440
But these are vectors and it

28
00:01:00,530 --> 00:01:01,880
also assumes that, you know, your cost

29
00:01:02,150 --> 00:01:03,770
function will return as

30
00:01:03,960 --> 00:01:05,640
a second return value this

31
00:01:05,830 --> 00:01:07,410
gradient which is also Rn

32
00:01:07,640 --> 00:01:09,860
and Rn plus 1. So also a vector.

33
00:01:10,840 --> 00:01:11,890
This worked fine when we

34
00:01:12,040 --> 00:01:14,030
were using logistic progression but

35
00:01:14,220 --> 00:01:15,120
now that we're using a neural

36
00:01:15,280 --> 00:01:17,160
network our parameters are

37
00:01:17,220 --> 00:01:18,370
no longer vectors, but instead

38
00:01:18,980 --> 00:01:21,110
they are these matrices where for

39
00:01:21,310 --> 00:01:22,670
a full neural network we would

40
00:01:22,830 --> 00:01:26,050
have parameter matrices theta 1, theta 2, theta 3

41
00:01:26,700 --> 00:01:28,080
that we might represent in Octave

42
00:01:28,680 --> 00:01:30,660
as these matrices theta 1, theta 2, theta 3.

43
00:01:31,450 --> 00:01:33,160
And similarly these gradient

44
00:01:33,760 --> 00:01:35,030
terms that were expected to return.

45
00:01:35,720 --> 00:01:36,890
Well, in the previous video we

46
00:01:36,980 --> 00:01:38,430
showed how to compute these

47
00:01:38,840 --> 00:01:40,520
gradient matrices, which was

48
00:01:40,980 --> 00:01:42,290
capital D1, capital D2,

49
00:01:42,560 --> 00:01:43,950
capital D3, which we

50
00:01:44,080 --> 00:01:46,130
might represent an octave as matrices D1, D2, D3.

51
00:01:48,080 --> 00:01:49,150
In this video I want

52
00:01:49,480 --> 00:01:50,420
to quickly tell you about the

53
00:01:50,510 --> 00:01:51,480
idea of how to take

54
00:01:51,980 --> 00:01:54,060
these matrices and unroll them into vectors.

55
00:01:54,590 --> 00:01:55,750
So that they end up

56
00:01:55,910 --> 00:01:57,790
being in a format suitable for

57
00:01:57,930 --> 00:02:00,090
passing into as theta here off for getting

58
00:02:00,460 --> 00:02:01,850
out for a gradient there.

59
00:02:03,220 --> 00:02:04,540
Concretely, let's say we

60
00:02:04,670 --> 00:02:06,740
have a neural network with one

61
00:02:06,950 --> 00:02:08,250
input layer with ten units,

62
00:02:09,010 --> 00:02:10,000
hidden layer with ten units

63
00:02:10,540 --> 00:02:11,870
and one output layer with

64
00:02:12,020 --> 00:02:13,090
just one unit, so s1

65
00:02:13,270 --> 00:02:14,030
is the number of units in layer one

66
00:02:14,440 --> 00:02:15,710
and s2 is the

67
00:02:15,860 --> 00:02:18,220
number of units in layer two, and s3 is a number

68
00:02:18,520 --> 00:02:20,700
of units in layer three.

69
00:02:21,560 --> 00:02:23,200
In this case, the dimension of

70
00:02:23,460 --> 00:02:25,240
your matrices theta and

71
00:02:25,350 --> 00:02:26,380
D are going to be

72
00:02:26,570 --> 00:02:28,110
given by these expressions.

73
00:02:28,520 --> 00:02:30,300
For example, theta one

74
00:02:30,630 --> 00:02:33,220
is going to a 10 by 11 matrix and so on.

75
00:02:34,420 --> 00:02:35,740
So in if you want

76
00:02:35,950 --> 00:02:37,960
to convert between these matrices.

77
00:02:38,580 --> 00:02:38,580
vectors.

78
00:02:39,330 --> 00:02:40,590
What you can do is take

79
00:02:40,830 --> 00:02:42,130
your theta 1, theta

80
00:02:42,350 --> 00:02:44,220
2, theta 3, and write this

81
00:02:44,410 --> 00:02:45,470
piece of code and this will

82
00:02:45,610 --> 00:02:46,820
take all the elements of

83
00:02:46,900 --> 00:02:48,540
your three theta matrices and

84
00:02:48,770 --> 00:02:49,400
take all the elements

85
00:02:49,860 --> 00:02:51,150
of theta one, all the

86
00:02:51,260 --> 00:02:52,290
elements of theta 2, all the

87
00:02:52,400 --> 00:02:53,840
elements of theta 3,

88
00:02:54,130 --> 00:02:55,510
and unroll them and put

89
00:02:55,770 --> 00:02:57,420
all the elements into a big long vector.

90
00:02:58,540 --> 00:02:59,880
Which is thetaVec and similarly

91
00:03:00,960 --> 00:03:02,510
the second command would take

92
00:03:02,830 --> 00:03:04,350
all of your D matrices and

93
00:03:04,490 --> 00:03:05,600
unroll them into a big

94
00:03:05,930 --> 00:03:07,340
long vector and call them

95
00:03:07,510 --> 00:03:08,810
DVec.
And finally

96
00:03:09,370 --> 00:03:10,330
if you want to go back from

97
00:03:10,520 --> 00:03:13,380
the vector representations to the matrix representations.

98
00:03:14,620 --> 00:03:15,630
What you do to get back

99
00:03:15,840 --> 00:03:17,720
to theta one say is take

100
00:03:17,940 --> 00:03:19,250
thetaVec and pull

101
00:03:19,530 --> 00:03:20,980
out the first 110 elements.

102
00:03:21,470 --> 00:03:22,930
So theta 1 has 110

103
00:03:23,390 --> 00:03:24,650
elements because it's a

104
00:03:24,720 --> 00:03:26,420
10 by 11 matrix so that

105
00:03:26,810 --> 00:03:28,200
pulls out the first 110 elements

106
00:03:28,540 --> 00:03:30,200
and then you can

107
00:03:30,370 --> 00:03:32,960
use the reshape command to reshape those back into theta 1.

108
00:03:33,010 --> 00:03:34,730
And similarly, to get

109
00:03:34,900 --> 00:03:35,850
back theta 2 you pull

110
00:03:36,280 --> 00:03:39,010
out the next 110 elements and reshape it.

111
00:03:39,670 --> 00:03:41,410
And for theta 3, you pull out

112
00:03:41,450 --> 00:03:43,320
the final eleven elements and run

113
00:03:43,500 --> 00:03:45,210
reshape to get back the theta 3.

114
00:03:48,840 --> 00:03:50,700
Here's a quick Octave demo of that process.

115
00:03:51,270 --> 00:03:52,370
So for this example

116
00:03:53,010 --> 00:03:54,530
let's set theta 1 equal

117
00:03:55,340 --> 00:03:57,440
to be ones of 10 by

118
00:03:57,670 --> 00:03:59,580
11, so it's a matrix of all ones. And

119
00:04:00,360 --> 00:04:01,400
just to make this easier seen,

120
00:04:01,750 --> 00:04:03,060
let's set that to be 2

121
00:04:03,280 --> 00:04:05,150
times ones, 10 by

122
00:04:05,310 --> 00:04:07,390
11 and let's

123
00:04:07,600 --> 00:04:09,570
set theta 3 equals 3

124
00:04:10,290 --> 00:04:12,110
times 1's of 1 by 11.

125
00:04:12,390 --> 00:04:13,680
So this is 3

126
00:04:13,980 --> 00:04:17,030
separate matrices: theta 1, theta 2, theta 3.

127
00:04:17,770 --> 00:04:19,010
We want to put all of these as a vector.

128
00:04:19,670 --> 00:04:22,740
ThetaVec equals theta

129
00:04:23,380 --> 00:04:26,660
1; theta 2

130
00:04:28,540 --> 00:04:28,990
theta 3.

131
00:04:29,260 --> 00:04:32,060
Right, that's a colon

132
00:04:32,540 --> 00:04:34,220
in the middle and like so

133
00:04:35,350 --> 00:04:37,420
and now thetavec is

134
00:04:37,590 --> 00:04:40,090
going to be a very long vector.

135
00:04:41,050 --> 00:04:41,910
That's 231 elements.

136
00:04:42,970 --> 00:04:46,000
If I display it, I find

137
00:04:46,290 --> 00:04:47,640
that this very long vector with

138
00:04:47,780 --> 00:04:48,610
all the elements of the first

139
00:04:48,880 --> 00:04:49,630
matrix, all the elements of

140
00:04:50,090 --> 00:04:52,360
the second matrix, then all the elements of the third matrix.

141
00:04:53,480 --> 00:04:54,450
And if I want to get back

142
00:04:54,930 --> 00:04:56,420
my original matrices, I can

143
00:04:56,500 --> 00:05:00,040
do reshape thetaVec.

144
00:05:01,400 --> 00:05:02,580
Let's pull out the first 110

145
00:05:03,100 --> 00:05:05,640
elements and reshape them to a 10 by 11 matrix.

146
00:05:06,810 --> 00:05:08,240
This gives me back theta 1.

147
00:05:08,690 --> 00:05:09,770
And if I then pull

148
00:05:10,280 --> 00:05:12,220
out the next 110 elements.

149
00:05:12,720 --> 00:05:14,690
So that's indices 111 to 220.

150
00:05:14,850 --> 00:05:16,470
I get back all of my 2's.

151
00:05:18,030 --> 00:05:19,330
And if I go

152
00:05:20,850 --> 00:05:22,110
from 221 up to

153
00:05:22,280 --> 00:05:24,240
the last element, which is

154
00:05:24,440 --> 00:05:25,970
element 231, and reshape to

155
00:05:26,070 --> 00:05:28,130
1 by 11, I get back theta 3.

156
00:05:30,810 --> 00:05:32,110
To make this process really concrete,

157
00:05:32,950 --> 00:05:34,750
here's how we use the unrolling

158
00:05:35,320 --> 00:05:36,990
idea to implement our learning algorithm.

159
00:05:38,200 --> 00:05:39,180
Let's say that you have some

160
00:05:39,490 --> 00:05:40,600
initial value of the parameters

161
00:05:41,170 --> 00:05:42,410
theta 1, theta 2, theta 3.

162
00:05:42,950 --> 00:05:43,740
What we're going to do

163
00:05:44,020 --> 00:05:45,880
is take these and unroll

164
00:05:46,290 --> 00:05:47,610
them into a long vector

165
00:05:47,960 --> 00:05:50,380
we're gonna call initial theta to

166
00:05:50,600 --> 00:05:52,170
pass in to fminunc

167
00:05:52,360 --> 00:05:54,900
as this initial setting of the parameters theta.

168
00:05:56,160 --> 00:05:58,310
The other thing we need to do is implement the cost function.

169
00:05:59,310 --> 00:06:01,510
Here's my implementation of the cost function.

170
00:06:02,900 --> 00:06:04,070
The cost function is going to

171
00:06:04,160 --> 00:06:05,500
give us input, thetaVec,

172
00:06:05,980 --> 00:06:07,090
which is going to be all

173
00:06:07,350 --> 00:06:08,770
of my parameters vectors that in

174
00:06:08,870 --> 00:06:10,680
the form that's been unrolled into a vector.

175
00:06:11,960 --> 00:06:12,800
So the first thing I'm going to

176
00:06:13,000 --> 00:06:13,890
do is I'm going to use

177
00:06:14,100 --> 00:06:16,580
thetaVec and I'm going to use the reshape functions.

178
00:06:17,040 --> 00:06:18,120
So I'll pull out elements from

179
00:06:18,320 --> 00:06:19,440
thetaVec and use reshape

180
00:06:19,750 --> 00:06:20,950
to get back my

181
00:06:21,320 --> 00:06:23,560
original parameter matrices, theta 1, theta 2, theta 3.

182
00:06:24,120 --> 00:06:26,530
So these are going to be matrices that I'm going to get.

183
00:06:26,620 --> 00:06:28,000
So that gives me a

184
00:06:28,060 --> 00:06:29,920
more convenient form in which

185
00:06:30,130 --> 00:06:31,580
to use these matrices so that I

186
00:06:31,750 --> 00:06:33,590
can run forward propagation and

187
00:06:33,880 --> 00:06:35,400
back propagation to compute my

188
00:06:35,570 --> 00:06:38,140
derivatives, and to compute my cost function j of theta.

189
00:06:39,710 --> 00:06:40,900
And finally, I can then

190
00:06:41,120 --> 00:06:42,620
take my derivatives and unroll

191
00:06:43,030 --> 00:06:44,530
them, to keeping the elements

192
00:06:45,140 --> 00:06:47,440
in the same ordering as I did when I unroll my thetas.

193
00:06:48,390 --> 00:06:49,780
But I'm gonna unroll D1, D2,

194
00:06:50,030 --> 00:06:51,330
D3, to get gradientVec

195
00:06:52,190 --> 00:06:55,180
which is now what my cost function can return.

196
00:06:55,490 --> 00:06:57,420
It can return a vector of these derivatives.

197
00:06:59,150 --> 00:07:00,310
So, hopefully, you now have

198
00:07:00,490 --> 00:07:01,650
a good sense of how to

199
00:07:01,890 --> 00:07:03,200
convert back and forth between

200
00:07:03,360 --> 00:07:04,970
the matrix representation of the

201
00:07:05,090 --> 00:07:08,220
parameters versus the vector representation of the parameters.

202
00:07:09,360 --> 00:07:10,290
The advantage of the matrix

203
00:07:10,760 --> 00:07:12,330
representation is that when

204
00:07:12,470 --> 00:07:13,530
your parameters are stored as

205
00:07:13,670 --> 00:07:15,670
matrices it's more convenient when

206
00:07:15,830 --> 00:07:17,430
you're doing forward propagation and

207
00:07:17,530 --> 00:07:19,110
back propagation and it's easier

208
00:07:19,850 --> 00:07:21,160
when your parameters are stored as

209
00:07:21,360 --> 00:07:22,770
matrices to take advantage

210
00:07:23,400 --> 00:07:24,780
of the, sort of, vectorized implementations.

211
00:07:26,230 --> 00:07:27,900
Whereas in contrast the advantage of

212
00:07:28,090 --> 00:07:30,250
the vector representation, when you

213
00:07:30,320 --> 00:07:31,820
have like thetaVec or DVec is that

214
00:07:32,500 --> 00:07:34,540
when you are using the advanced optimization algorithms.

215
00:07:34,770 --> 00:07:36,640
Those algorithms tend to

216
00:07:36,760 --> 00:07:37,730
assume that you have

217
00:07:38,090 --> 00:07:40,730
all of your parameters unrolled into a big long vector.

218
00:07:41,720 --> 00:07:42,930
And so with what we just

219
00:07:43,140 --> 00:07:44,650
went through, hopefully you can now quickly

220
00:07:45,410 --> 00:07:47,020
convert between the two as needed.