1
00:00:00,090 --> 00:00:01,450
In an earlier video, I had

2
00:00:01,610 --> 00:00:02,710
said that PCA can be

3
00:00:02,840 --> 00:00:05,410
sometimes used to speed up the running time of a learning algorithm.

4
00:00:07,070 --> 00:00:08,140
In this video, I'd like

5
00:00:08,370 --> 00:00:09,520
to explain how to actually

6
00:00:09,820 --> 00:00:11,270
do that, and also say

7
00:00:11,460 --> 00:00:12,900
some, just try to give

8
00:00:12,990 --> 00:00:14,550
some advice about how to apply PCA.

9
00:00:17,110 --> 00:00:19,630
Here's how you can use PCA to speed up a learning algorithm,

10
00:00:20,270 --> 00:00:21,940
and this supervised learning algorithm

11
00:00:22,270 --> 00:00:23,630
speed up is actually the most

12
00:00:23,870 --> 00:00:25,870
common use that I

13
00:00:26,530 --> 00:00:27,720
personally make of PCA.

14
00:00:28,710 --> 00:00:29,640
Let's say you have a supervised

15
00:00:30,300 --> 00:00:31,660
learning problem, note this is

16
00:00:31,810 --> 00:00:33,380
a supervised learning problem with inputs

17
00:00:33,690 --> 00:00:35,510
X and labels Y, and

18
00:00:35,810 --> 00:00:37,330
let's say that your examples

19
00:00:37,830 --> 00:00:39,140
xi are very high dimensional.

20
00:00:39,840 --> 00:00:41,670
So, lets say that your examples, xi are

21
00:00:41,800 --> 00:00:44,000
10,000 dimensional feature vectors.

22
00:00:45,510 --> 00:00:46,550
One example of that, would

23
00:00:46,700 --> 00:00:48,130
be, if you were doing some computer

24
00:00:48,540 --> 00:00:50,390
vision problem, where you have

25
00:00:50,650 --> 00:00:52,410
a 100x100 images, and so

26
00:00:52,780 --> 00:00:55,550
if you have 100x100, that's 10000

27
00:00:55,850 --> 00:00:57,520
pixels, and so if xi are,

28
00:00:57,780 --> 00:00:59,240
you know, feature vectors

29
00:00:59,760 --> 00:01:01,670
that contain your 10000 pixel

30
00:01:02,470 --> 00:01:03,580
intensity values, then

31
00:01:04,410 --> 00:01:05,580
you have 10000 dimensional feature vectors.

32
00:01:06,880 --> 00:01:08,530
So with very high-dimensional

33
00:01:09,300 --> 00:01:10,890
feature vectors like this, running a

34
00:01:11,320 --> 00:01:12,860
learning algorithm can be slow, right?

35
00:01:13,030 --> 00:01:14,300
Just, if you feed 10,000 dimensional

36
00:01:14,790 --> 00:01:16,980
feature vectors into logistic regression,

37
00:01:17,570 --> 00:01:19,780
or a new network, or support vector machine or what have you,

38
00:01:20,450 --> 00:01:22,000
just because that's a lot of data,

39
00:01:22,200 --> 00:01:23,060
that's 10,000 numbers,

40
00:01:24,130 --> 00:01:25,970
it can make your learning algorithm run more slowly.

41
00:01:27,170 --> 00:01:28,530
Fortunately with PCA we'll be

42
00:01:28,680 --> 00:01:29,810
able to reduce the dimension of

43
00:01:30,060 --> 00:01:31,050
this data and so make

44
00:01:31,180 --> 00:01:32,410
our algorithms run more

45
00:01:32,920 --> 00:01:34,440
efficiently. Here's how you

46
00:01:34,590 --> 00:01:35,780
do that. We are going

47
00:01:35,980 --> 00:01:37,030
first check our labeled

48
00:01:37,400 --> 00:01:39,520
training set and extract just

49
00:01:39,800 --> 00:01:41,830
the inputs, we're just going to extract the X's

50
00:01:42,730 --> 00:01:45,130
and temporarily put aside the Y's.

51
00:01:45,860 --> 00:01:46,750
So this will now give us

52
00:01:47,090 --> 00:01:49,150
an unlabelled training set x1

53
00:01:49,400 --> 00:01:51,000
through xm which are maybe

54
00:01:51,240 --> 00:01:53,600
there's a ten thousand dimensional data,

55
00:01:53,940 --> 00:01:55,800
ten thousand dimensional examples we have.

56
00:01:55,870 --> 00:01:57,230
So just extract the input vectors

57
00:01:58,370 --> 00:01:58,930
x1 through xm.

58
00:02:00,650 --> 00:02:01,810
Then we're going to apply PCA

59
00:02:02,700 --> 00:02:03,740
and this will give me a

60
00:02:03,980 --> 00:02:06,100
reduced dimension representation of the

61
00:02:06,410 --> 00:02:08,010
data, so instead of

62
00:02:08,260 --> 00:02:09,540
10,000 dimensional feature vectors I now

63
00:02:09,780 --> 00:02:11,880
have maybe one thousand dimensional feature vectors.

64
00:02:12,330 --> 00:02:13,500
So that's like a 10x savings.

65
00:02:15,110 --> 00:02:17,260
So this gives me, if you will, a new training set.

66
00:02:17,910 --> 00:02:19,430
So whereas previously I might

67
00:02:19,620 --> 00:02:21,180
have had an example x1, y1,

68
00:02:21,490 --> 00:02:24,340
my first training input, is now represented by z1.

69
00:02:24,580 --> 00:02:25,800
And so we'll have a

70
00:02:26,050 --> 00:02:27,010
new sort of training example,

71
00:02:28,210 --> 00:02:29,240
which is Z1 paired with y1.

72
00:02:30,700 --> 00:02:33,170
And similarly Z2, Y2, and so on, up to ZM, YM.

73
00:02:33,770 --> 00:02:35,300
Because my training examples are

74
00:02:35,460 --> 00:02:36,980
now represented with this much

75
00:02:37,480 --> 00:02:41,040
lower dimensional representation Z1, Z2, up to ZM.

76
00:02:41,310 --> 00:02:42,340
Finally, I can take this

77
00:02:43,650 --> 00:02:45,060
reduced dimension training set and

78
00:02:45,240 --> 00:02:46,540
feed it to a learning algorithm maybe

79
00:02:46,640 --> 00:02:47,900
a neural network, maybe logistic

80
00:02:48,280 --> 00:02:49,450
regression, and I can

81
00:02:49,750 --> 00:02:51,990
learn the hypothesis H, that

82
00:02:52,230 --> 00:02:53,830
takes this input, these low-dimensional

83
00:02:54,330 --> 00:02:56,230
representations Z and tries to make predictions.

84
00:02:57,890 --> 00:02:59,030
So if I were using logistic

85
00:02:59,460 --> 00:03:00,880
regression for example, I would

86
00:03:01,060 --> 00:03:02,760
train a hypothesis that outputs, you know,

87
00:03:03,080 --> 00:03:04,020
one over one plus E to

88
00:03:04,180 --> 00:03:06,020
the negative-theta transpose

89
00:03:07,620 --> 00:03:10,150
Z, that

90
00:03:10,610 --> 00:03:11,530
takes this input to one these

91
00:03:11,960 --> 00:03:13,660
z vectors, and tries to make a prediction.

92
00:03:15,260 --> 00:03:16,310
And finally, if you have

93
00:03:16,630 --> 00:03:17,800
a new example, maybe a new

94
00:03:17,920 --> 00:03:20,060
test example X. What

95
00:03:20,220 --> 00:03:21,340
you do is you would

96
00:03:22,130 --> 00:03:23,730
take your test example x,

97
00:03:24,960 --> 00:03:26,590
map it through the same mapping

98
00:03:26,990 --> 00:03:27,860
that was found by PCA

99
00:03:28,220 --> 00:03:29,610
to get you your corresponding z.

100
00:03:30,390 --> 00:03:31,280
And that z then gets

101
00:03:31,950 --> 00:03:33,740
fed to this hypothesis, and this

102
00:03:33,910 --> 00:03:35,450
hypothesis then makes a

103
00:03:35,750 --> 00:03:36,740
prediction on your input x.

104
00:03:38,110 --> 00:03:40,090
One final note, what PCA does

105
00:03:40,510 --> 00:03:42,350
is it defines a mapping from

106
00:03:42,710 --> 00:03:45,090
x to z and

107
00:03:45,960 --> 00:03:46,970
this mapping from x to

108
00:03:47,050 --> 00:03:48,280
z should be defined by running

109
00:03:48,580 --> 00:03:50,840
PCA only on the training sets.

110
00:03:51,650 --> 00:03:53,310
And in particular, this mapping that

111
00:03:53,530 --> 00:03:54,770
PCA is learning, right, this

112
00:03:54,950 --> 00:03:57,650
mapping, what that does is it computes the set of parameters.

113
00:03:58,210 --> 00:04:00,500
That's the feature scaling and mean normalization.

114
00:04:01,240 --> 00:04:04,040
And there's also computing this matrix U reduced.

115
00:04:04,680 --> 00:04:05,510
But all of these things that

116
00:04:05,670 --> 00:04:06,980
U reduce, that's like a

117
00:04:07,120 --> 00:04:08,420
parameter that is learned

118
00:04:08,670 --> 00:04:09,950
by PCA and we should

119
00:04:10,150 --> 00:04:12,270
be fitting our parameters only to

120
00:04:12,480 --> 00:04:13,990
our training sets and not

121
00:04:14,040 --> 00:04:16,250
to our cross validation or test sets and

122
00:04:16,370 --> 00:04:17,560
so these things the U reduced

123
00:04:18,180 --> 00:04:19,460
so on, that should be

124
00:04:19,820 --> 00:04:22,430
obtained by running PCA only on your training set.

125
00:04:23,300 --> 00:04:26,930
And then having found U reduced, or having found the parameters for feature

126
00:04:27,350 --> 00:04:28,620
scaling where the mean normalization

127
00:04:29,320 --> 00:04:31,790
and scaling the scale

128
00:04:32,180 --> 00:04:34,500
that you divide the features by to get them on to comparable scales.

129
00:04:34,760 --> 00:04:36,010
Having found all those parameters

130
00:04:36,570 --> 00:04:38,010
on the training set, you can

131
00:04:38,220 --> 00:04:41,560
then apply the same mapping to other examples that may be

132
00:04:41,820 --> 00:04:45,020
In your cross-validation sets or

133
00:04:45,180 --> 00:04:46,680
in your test sets, OK?

134
00:04:47,150 --> 00:04:48,340
Just to summarize, when you're

135
00:04:48,450 --> 00:04:49,790
running PCA, run your

136
00:04:49,900 --> 00:04:51,070
PCA only on the

137
00:04:51,220 --> 00:04:52,450
training set portion of the

138
00:04:52,490 --> 00:04:55,880
data not the cross-validation set or the test set portion of your data.

139
00:04:56,410 --> 00:04:57,620
And that defines the mapping from

140
00:04:57,870 --> 00:04:58,770
x to z and you can

141
00:04:58,950 --> 00:05:00,320
then apply that mapping to

142
00:05:00,560 --> 00:05:02,240
your cross-validation set and your

143
00:05:02,290 --> 00:05:03,370
test set and by the

144
00:05:03,450 --> 00:05:04,660
way in this example I talked

145
00:05:05,000 --> 00:05:06,660
about reducing the data from

146
00:05:06,950 --> 00:05:08,510
ten thousand dimensional to one

147
00:05:08,740 --> 00:05:10,350
thousand dimensional, this is actually

148
00:05:10,660 --> 00:05:11,950
not that unrealistic. For many

149
00:05:12,280 --> 00:05:14,720
problems we actually reduce the dimensional data. You

150
00:05:17,600 --> 00:05:18,700
know by 5x maybe by 10x

151
00:05:18,780 --> 00:05:20,910
and still retain most of the variance and we can do this

152
00:05:21,270 --> 00:05:22,680
barely effecting the performance,

153
00:05:23,900 --> 00:05:25,840
in terms of classification accuracy, let's say,

154
00:05:26,240 --> 00:05:27,970
barely affecting the classification

155
00:05:28,770 --> 00:05:30,320
accuracy of the learning algorithm.

156
00:05:31,090 --> 00:05:32,140
And by working with lower dimensional

157
00:05:32,590 --> 00:05:33,730
data our learning algorithm

158
00:05:34,060 --> 00:05:36,500
can often run much much faster.

159
00:05:36,910 --> 00:05:38,120
To summarize, we've so far talked

160
00:05:38,410 --> 00:05:40,920
about the following applications of PCA.

161
00:05:41,970 --> 00:05:43,780
First is the compression application where

162
00:05:44,020 --> 00:05:45,140
we might do so to reduce

163
00:05:45,500 --> 00:05:46,440
the memory or the disk space

164
00:05:46,590 --> 00:05:47,960
needed to store data and we

165
00:05:48,240 --> 00:05:49,390
just talked about how to

166
00:05:49,460 --> 00:05:51,630
use this to speed up a learning algorithm.

167
00:05:52,100 --> 00:05:53,870
In these applications, in order

168
00:05:54,130 --> 00:05:56,240
to choose K, often we'll

169
00:05:56,420 --> 00:05:58,770
do so according to, figuring

170
00:05:59,160 --> 00:06:00,590
out what is the percentage of

171
00:06:00,810 --> 00:06:03,880
variance retained, and so

172
00:06:04,780 --> 00:06:06,320
for this learning algorithm, speed

173
00:06:06,570 --> 00:06:10,050
up application often will retain 99%  of the variance.

174
00:06:10,530 --> 00:06:11,690
That would be a very typical choice

175
00:06:12,100 --> 00:06:14,270
for how to choose k. So

176
00:06:14,730 --> 00:06:16,640
that's how you choose k for these compression applications.

177
00:06:17,850 --> 00:06:19,590
Whereas for visualization applications

178
00:06:20,760 --> 00:06:22,100
while usually we know

179
00:06:22,230 --> 00:06:23,550
how to plot only two dimensional

180
00:06:24,020 --> 00:06:25,520
data or three dimensional data,

181
00:06:26,540 --> 00:06:28,650
and so for visualization applications, we'll

182
00:06:28,830 --> 00:06:29,660
usually choose k equals 2

183
00:06:29,710 --> 00:06:31,930
or k equals 3, because we can plot

184
00:06:32,740 --> 00:06:33,500
only 2D and 3D data sets.

185
00:06:34,510 --> 00:06:35,720
So that summarizes the main

186
00:06:36,020 --> 00:06:37,230
applications of PCA, as well

187
00:06:37,870 --> 00:06:39,580
as how to choose the

188
00:06:39,670 --> 00:06:41,540
value of k for these different applications.

189
00:06:42,890 --> 00:06:45,710
I should mention that there is often one frequent misuse

190
00:06:46,400 --> 00:06:48,100
of PCA and

191
00:06:48,800 --> 00:06:50,300
you sometimes hear about others

192
00:06:50,580 --> 00:06:51,820
doing this hopefully not too often.

193
00:06:52,230 --> 00:06:54,780
I just want to mention this so that you know not to do it.

194
00:06:55,480 --> 00:06:56,460
And there is one bad use of

195
00:06:56,540 --> 00:06:59,170
PCA, which iss to try to use it to prevent over-fitting.

196
00:07:00,380 --> 00:07:00,660
Here's the reasoning.

197
00:07:01,910 --> 00:07:03,080
This is not a great

198
00:07:03,730 --> 00:07:04,610
way to use PCA,

199
00:07:04,670 --> 00:07:05,630
but here's the reasoning behind

200
00:07:05,690 --> 00:07:07,080
this method, which is,you know

201
00:07:07,350 --> 00:07:09,090
if we have Xi, then

202
00:07:09,300 --> 00:07:10,660
maybe we'll have n features, but

203
00:07:10,830 --> 00:07:12,640
if we compress the data, and

204
00:07:12,750 --> 00:07:13,700
use Zi instead

205
00:07:14,270 --> 00:07:15,410
and that reduces the number

206
00:07:15,560 --> 00:07:17,050
of features to k, which

207
00:07:17,290 --> 00:07:19,300
could be much lower dimensional. And

208
00:07:19,410 --> 00:07:21,130
so if we have a much smaller

209
00:07:21,490 --> 00:07:22,520
number of features, if k

210
00:07:22,770 --> 00:07:25,800
is 1,000 and n is

211
00:07:26,090 --> 00:07:27,010
10,000, then if we have

212
00:07:27,780 --> 00:07:29,390
only 1,000 dimensional data, maybe

213
00:07:29,670 --> 00:07:30,580
we're less likely to over-fit

214
00:07:31,260 --> 00:07:32,230
than if we were using 10,000-dimensional

215
00:07:33,280 --> 00:07:34,980
data with like a thousand features.

216
00:07:35,950 --> 00:07:37,160
So some people think

217
00:07:37,360 --> 00:07:39,360
of PCA as a way to prevent over-fitting.

218
00:07:39,950 --> 00:07:41,940
But just to emphasize this

219
00:07:42,110 --> 00:07:44,000
is a bad application of PCA

220
00:07:44,260 --> 00:07:46,080
and I do not recommend doing this.

221
00:07:46,520 --> 00:07:48,430
And it's not that this method works badly.

222
00:07:49,000 --> 00:07:49,920
If you want to use

223
00:07:50,330 --> 00:07:51,560
this method to reduce the dimensional

224
00:07:51,890 --> 00:07:52,830
data, to try to prevent over-fitting,

225
00:07:53,690 --> 00:07:54,830
it might actually work OK.

226
00:07:55,560 --> 00:07:56,720
But this just is not

227
00:07:57,040 --> 00:07:58,340
a good way to address

228
00:07:58,680 --> 00:08:00,390
over-fitting and instead, if you're

229
00:08:00,510 --> 00:08:01,810
worried about over-fitting, there is

230
00:08:02,030 --> 00:08:03,420
a much better way to address

231
00:08:03,800 --> 00:08:05,680
it, to use regularization instead of

232
00:08:05,900 --> 00:08:07,910
using PCA to reduce the dimension of the data.

233
00:08:08,670 --> 00:08:10,000
And the reason is, if

234
00:08:11,010 --> 00:08:12,150
you think about how PCA works,

235
00:08:12,900 --> 00:08:13,950
it does not use the labels

236
00:08:14,530 --> 00:08:15,680
y. You are just looking

237
00:08:16,050 --> 00:08:17,220
at your inputs xi, and you're

238
00:08:17,340 --> 00:08:19,070
using that to find a

239
00:08:19,130 --> 00:08:21,150
lower-dimensional approximation to your data.

240
00:08:21,390 --> 00:08:22,840
So what PCA does,

241
00:08:23,190 --> 00:08:25,410
is it throws away some information.

242
00:08:26,460 --> 00:08:28,040
It throws away or reduces the

243
00:08:28,180 --> 00:08:29,680
dimension of your data without

244
00:08:30,110 --> 00:08:31,390
knowing what the values of y

245
00:08:32,380 --> 00:08:33,700
is, so this is probably

246
00:08:34,250 --> 00:08:35,770
okay using PCA this way

247
00:08:35,920 --> 00:08:37,750
is probably okay if, say

248
00:08:37,990 --> 00:08:39,190
99 percent of the

249
00:08:39,410 --> 00:08:40,400
variance is retained, if you're keeping most

250
00:08:40,830 --> 00:08:41,970
of the variance, but

251
00:08:42,100 --> 00:08:44,230
it might also throw away some valuable information.

252
00:08:45,010 --> 00:08:45,980
And it turns out that

253
00:08:46,310 --> 00:08:47,580
if you're retaining 99% of

254
00:08:47,820 --> 00:08:49,260
the variance or 95%

255
00:08:49,360 --> 00:08:50,940
of the variance or whatever, it

256
00:08:51,020 --> 00:08:52,310
turns out that just using

257
00:08:52,720 --> 00:08:54,650
regularization will often give

258
00:08:54,790 --> 00:08:56,010
you at least as good

259
00:08:56,220 --> 00:08:57,880
a method for preventing over-fitting

260
00:08:58,900 --> 00:09:00,340
and regularization will often just

261
00:09:00,590 --> 00:09:02,220
work better, because when you

262
00:09:02,350 --> 00:09:03,890
are applying linear regression or logistic

263
00:09:04,250 --> 00:09:05,240
regression or some other method

264
00:09:05,600 --> 00:09:07,390
with regularization, well, this minimization

265
00:09:08,010 --> 00:09:09,420
problem actually knows what the

266
00:09:09,480 --> 00:09:10,740
values of y are, and

267
00:09:10,960 --> 00:09:12,680
so is less likely to throw

268
00:09:12,880 --> 00:09:14,330
away some valuable information, whereas

269
00:09:14,730 --> 00:09:15,790
PCA doesn't make use

270
00:09:16,060 --> 00:09:17,810
of the labels and is more

271
00:09:17,850 --> 00:09:19,940
likely to throw away valuable information.

272
00:09:20,230 --> 00:09:21,370
So, to summarize, it is

273
00:09:21,620 --> 00:09:22,900
a good use of PCA, if your

274
00:09:23,010 --> 00:09:24,380
main motivation to speed up

275
00:09:24,530 --> 00:09:26,490
your learning algorithm, but using

276
00:09:26,790 --> 00:09:28,360
PCA to prevent over-fitting, that

277
00:09:28,650 --> 00:09:29,630
is not a good use of

278
00:09:30,030 --> 00:09:32,270
PCA, and using regularization instead

279
00:09:32,900 --> 00:09:36,190
is really what many people

280
00:09:36,440 --> 00:09:40,490
would recommend doing instead. Finally,

281
00:09:41,310 --> 00:09:43,350
one last misuse of PCA.

282
00:09:43,750 --> 00:09:45,760
And so I should say PCA is a very useful algorithm,

283
00:09:46,270 --> 00:09:49,170
I often use it for the compression on the visualization purposes.

284
00:09:50,230 --> 00:09:51,400
But, what I sometimes

285
00:09:51,570 --> 00:09:53,310
see, is also people sometimes

286
00:09:53,710 --> 00:09:56,080
use PCA where it shouldn't be.

287
00:09:56,220 --> 00:09:57,940
So, here's a pretty common thing that

288
00:09:58,030 --> 00:09:59,140
I see, which is if someone

289
00:09:59,330 --> 00:10:00,330
is designing a machine-learning system,

290
00:10:01,010 --> 00:10:02,130
they may write down the

291
00:10:02,200 --> 00:10:04,150
plan like this: let's design a learning system.

292
00:10:05,060 --> 00:10:06,080
Get a training set and then,

293
00:10:06,570 --> 00:10:07,350
you know, what I'm going to

294
00:10:07,400 --> 00:10:08,700
do is run PCA, then train

295
00:10:08,860 --> 00:10:11,200
logistic regression and then test on my test data.

296
00:10:11,680 --> 00:10:12,770
So often at the very

297
00:10:13,090 --> 00:10:14,360
start of a project,

298
00:10:14,600 --> 00:10:15,600
someone will just write out a

299
00:10:15,720 --> 00:10:16,980
project plan than says lets

300
00:10:17,310 --> 00:10:18,610
do these four steps with PCA inside.

301
00:10:20,210 --> 00:10:21,220
Before writing down a project

302
00:10:21,530 --> 00:10:23,350
plan the incorporates PCA like

303
00:10:23,560 --> 00:10:24,860
this, one very good

304
00:10:25,030 --> 00:10:27,110
question to ask is, well, what if we

305
00:10:27,630 --> 00:10:28,560
were to just do the whole

306
00:10:29,540 --> 00:10:31,470
without using PCA.

307
00:10:32,170 --> 00:10:33,450
And often people do not

308
00:10:33,800 --> 00:10:34,940
consider this step before

309
00:10:35,440 --> 00:10:37,080
coming up with a complicated project plan and

310
00:10:37,920 --> 00:10:40,620
implementing PCA and so on.

311
00:10:40,810 --> 00:10:42,360
And sometime, and so specifically,

312
00:10:43,050 --> 00:10:44,300
what I often advise people

313
00:10:44,670 --> 00:10:45,980
is, before you implement

314
00:10:46,450 --> 00:10:47,970
PCA, I would first

315
00:10:48,220 --> 00:10:49,410
suggest that, you know, do

316
00:10:49,600 --> 00:10:50,770
whatever it is, take whatever it

317
00:10:50,850 --> 00:10:52,030
is you want to do and first

318
00:10:52,450 --> 00:10:53,650
consider doing it with your

319
00:10:53,980 --> 00:10:56,420
original raw data xi, and

320
00:10:56,600 --> 00:10:57,860
only if that doesn't do

321
00:10:57,960 --> 00:10:59,650
what you want, then implement PCA before using Zi.

322
00:11:01,010 --> 00:11:02,420
So, before using PCA you know,

323
00:11:03,030 --> 00:11:03,930
instead of reducing the dimension

324
00:11:04,360 --> 00:11:05,710
of the data, I would consider

325
00:11:06,640 --> 00:11:08,020
well, let's ditch this PCA step,

326
00:11:08,420 --> 00:11:09,690
and I would consider, let's

327
00:11:10,040 --> 00:11:11,460
just train my learning algorithm

328
00:11:12,440 --> 00:11:13,560
on my original data.

329
00:11:14,410 --> 00:11:15,990
Let's just use my original raw

330
00:11:16,300 --> 00:11:17,770
inputs xi, and I would

331
00:11:18,180 --> 00:11:19,550
recommend, instead of putting

332
00:11:19,720 --> 00:11:20,910
PCA into the algorithm, just

333
00:11:21,030 --> 00:11:23,250
try doing whatever it is you're doing with the xi first.

334
00:11:24,090 --> 00:11:25,000
And only if you have

335
00:11:25,150 --> 00:11:26,180
a reason to believe that doesn't

336
00:11:26,480 --> 00:11:27,590
work, so that only if your

337
00:11:27,790 --> 00:11:29,470
learning algorithm ends up

338
00:11:29,510 --> 00:11:31,100
running too slowly, or only if

339
00:11:31,280 --> 00:11:32,680
the memory requirement or the

340
00:11:32,910 --> 00:11:34,140
disk space requirement is too large,

341
00:11:34,430 --> 00:11:35,850
so you want to compress your

342
00:11:36,190 --> 00:11:37,810
representation, but if only

343
00:11:38,000 --> 00:11:39,020
using the xi doesn't work,

344
00:11:39,360 --> 00:11:40,640
only if you have evidence or strong

345
00:11:40,950 --> 00:11:41,890
reason to believe that using

346
00:11:42,380 --> 00:11:43,890
the xi won't work, then implement

347
00:11:44,380 --> 00:11:46,730
PCA and consider using the compressed representation.

348
00:11:47,990 --> 00:11:48,830
Because what I do see, is

349
00:11:49,100 --> 00:11:50,380
sometimes people start off with

350
00:11:50,530 --> 00:11:51,520
a project plan that incorporates PCA

351
00:11:52,100 --> 00:11:54,580
inside, and sometimes they,

352
00:11:54,650 --> 00:11:55,620
whatever they're

353
00:11:55,820 --> 00:11:57,380
doing will work just

354
00:11:57,660 --> 00:11:59,520
fine, even with out using PCA instead.

355
00:11:59,840 --> 00:12:01,650
So, just consider that

356
00:12:01,860 --> 00:12:03,130
as an alternative as well, before you

357
00:12:03,320 --> 00:12:04,170
go to spend a lot of

358
00:12:04,300 --> 00:12:05,570
time to get PCA in, figure

359
00:12:05,770 --> 00:12:08,100
out what k is and so on.

360
00:12:08,250 --> 00:12:09,330
So, that's it for PCA.

361
00:12:09,800 --> 00:12:11,000
Despite these last sets of

362
00:12:11,080 --> 00:12:12,380
comments, PCA is an

363
00:12:12,690 --> 00:12:14,060
incredibly useful algorithm, when you

364
00:12:14,150 --> 00:12:15,330
use it for the appropriate applications

365
00:12:16,070 --> 00:12:17,480
and I've actually used PCA pretty

366
00:12:17,770 --> 00:12:19,330
often and for me,

367
00:12:19,580 --> 00:12:20,650
I use it mostly to speed

368
00:12:20,850 --> 00:12:22,150
up the running time of my learning algorithms.

369
00:12:22,880 --> 00:12:24,310
But I think, just as

370
00:12:24,400 --> 00:12:25,690
common an application of PCA,

371
00:12:26,020 --> 00:12:27,300
is to use it to

372
00:12:27,410 --> 00:12:29,030
compress data, to reduce

373
00:12:29,620 --> 00:12:30,650
the memory or disk space

374
00:12:30,990 --> 00:12:33,130
requirements, or to use it to visualize data.

375
00:12:34,270 --> 00:12:35,710
And PCA is one of

376
00:12:35,750 --> 00:12:36,960
the most commonly used and one

377
00:12:36,990 --> 00:12:39,420
of the most powerful unsupervised learning algorithms.

378
00:12:40,060 --> 00:12:41,210
And with what you've learned

379
00:12:41,420 --> 00:12:43,120
in these videos, I think hopefully

380
00:12:43,500 --> 00:12:44,710
you'll be able to implement

381
00:12:45,150 --> 00:12:46,280
PCA and use them

382
00:12:46,500 --> 00:12:47,930
through all of these purposes as well.