1
00:00:00,090 --> 00:00:01,330
In this video, I'd like to

2
00:00:01,500 --> 00:00:02,560
start talking about a second

3
00:00:03,030 --> 00:00:04,620
type of unsupervised learning problem

4
00:00:04,950 --> 00:00:06,320
called dimensionality reduction.

5
00:00:07,600 --> 00:00:08,460
There are a couple of different

6
00:00:08,660 --> 00:00:09,710
reasons why one might

7
00:00:09,890 --> 00:00:11,270
want to do dimensionality reduction.

8
00:00:12,220 --> 00:00:14,420
One is data compression, and as

9
00:00:14,600 --> 00:00:15,860
we'll see later, a few videos

10
00:00:16,570 --> 00:00:18,200
later, data compression not only

11
00:00:18,490 --> 00:00:19,660
allows us to compress the

12
00:00:19,970 --> 00:00:20,940
data and have it therefore

13
00:00:21,330 --> 00:00:22,670
use up less computer memory

14
00:00:23,050 --> 00:00:24,410
or disk space, but it will

15
00:00:24,730 --> 00:00:26,960
also allow us to speed up our learning algorithms.

16
00:00:27,980 --> 00:00:29,490
But first, let's start by

17
00:00:29,620 --> 00:00:31,840
talking about what is dimensionality reduction.

18
00:00:33,490 --> 00:00:35,800
As a motivating example, let's say

19
00:00:35,990 --> 00:00:37,440
that we've collected a data set

20
00:00:37,680 --> 00:00:38,700
with many, many, many features,

21
00:00:39,290 --> 00:00:40,600
and I've plotted just two of them here.

22
00:00:41,600 --> 00:00:42,770
And let's say that unknown to

23
00:00:42,890 --> 00:00:44,000
us two of the

24
00:00:44,070 --> 00:00:45,730
features were actually the length

25
00:00:45,860 --> 00:00:47,150
of something in centimeters, and

26
00:00:47,550 --> 00:00:48,920
a different feature, x2, is

27
00:00:49,460 --> 00:00:51,150
the length of the same thing in inches.

28
00:00:52,250 --> 00:00:53,030
So, this gives us a highly

29
00:00:53,500 --> 00:00:55,910
redundant representation and maybe

30
00:00:56,170 --> 00:00:57,920
instead of having two separate features x1

31
00:00:58,430 --> 00:00:58,820
then x2,

32
00:00:59,090 --> 00:01:00,240
both of which basically measure the

33
00:01:00,370 --> 00:01:01,490
length, maybe what we

34
00:01:01,640 --> 00:01:03,340
want to do is reduce the

35
00:01:03,430 --> 00:01:06,800
data to one-dimensional and

36
00:01:06,920 --> 00:01:08,840
just have one number measuring this length.

37
00:01:09,620 --> 00:01:11,080
In case this example seems a

38
00:01:11,150 --> 00:01:13,920
bit contrived, this centimeter and

39
00:01:14,030 --> 00:01:15,850
inches example is actually not that

40
00:01:16,220 --> 00:01:17,140
unrealistic, and not that different

41
00:01:17,510 --> 00:01:18,870
from things that I see happening in industry.

42
00:01:19,970 --> 00:01:21,320
If you have hundreds

43
00:01:21,790 --> 00:01:23,160
or thousands of features, it is

44
00:01:23,240 --> 00:01:24,450
often this easy to

45
00:01:24,680 --> 00:01:26,580
lose track of exactly what features you have.

46
00:01:26,930 --> 00:01:28,190
And sometimes may have

47
00:01:28,420 --> 00:01:29,650
a few different engineering teams, maybe

48
00:01:30,110 --> 00:01:31,090
one engineering team gives you

49
00:01:31,200 --> 00:01:32,500
two hundred features, a second

50
00:01:32,770 --> 00:01:34,000
engineering team gives you another

51
00:01:34,340 --> 00:01:35,420
three hundred features, and a

52
00:01:35,550 --> 00:01:36,640
third engineering team gives you five

53
00:01:36,940 --> 00:01:38,150
hundred features so you have

54
00:01:38,290 --> 00:01:39,220
a thousand features all together,

55
00:01:39,940 --> 00:01:40,910
and it actually becomes hard to

56
00:01:41,040 --> 00:01:42,820
keep track of you know, exactly which features

57
00:01:43,200 --> 00:01:44,540
you got from which team, and

58
00:01:44,860 --> 00:01:47,310
it's actually not that want to have highly redundant features like these.

59
00:01:47,530 --> 00:01:49,440
And so if the

60
00:01:50,090 --> 00:01:51,520
length in centimeters were rounded

61
00:01:51,940 --> 00:01:53,920
off to the nearest centimeter and

62
00:01:54,060 --> 00:01:56,480
lengthened inches was rounded off to the nearest inch.

63
00:01:57,070 --> 00:01:58,050
Then, that's why these examples

64
00:01:58,720 --> 00:01:59,900
don't lie perfectly on a

65
00:02:00,100 --> 00:02:01,270
straight line, because of, you know, round-off

66
00:02:01,740 --> 00:02:03,420
error to the nearest centimeter or the nearest inch.

67
00:02:04,260 --> 00:02:05,160
And if we can reduce

68
00:02:05,610 --> 00:02:06,680
the data to one dimension

69
00:02:07,130 --> 00:02:10,320
instead of two dimensions, that reduces the redundancy.

70
00:02:11,590 --> 00:02:14,030
For a different example, again maybe when there seems fairly less contrives.

71
00:02:14,590 --> 00:02:16,560
For may years I've

72
00:02:16,920 --> 00:02:19,920
been working with autonomous helicopter pilots.

73
00:02:20,990 --> 00:02:22,610
Or I've been working with pilots that fly helicopters.

74
00:02:23,950 --> 00:02:24,040
And so.

75
00:02:25,080 --> 00:02:28,090
If you were to measure--if you

76
00:02:28,250 --> 00:02:29,100
were to, you know, do a survey

77
00:02:29,590 --> 00:02:30,500
or do a test of these different

78
00:02:30,770 --> 00:02:32,200
pilots--you might have one

79
00:02:32,440 --> 00:02:33,780
feature, x1, which is maybe

80
00:02:34,050 --> 00:02:35,600
the skill of these

81
00:02:35,820 --> 00:02:38,190
helicopter pilots, and maybe

82
00:02:38,460 --> 00:02:41,810
"x2" could be the pilot enjoyment.

83
00:02:42,700 --> 00:02:43,770
That is, you know, how

84
00:02:43,870 --> 00:02:45,050
much they enjoy flying, and maybe

85
00:02:45,280 --> 00:02:46,810
these two features will be highly correlated. And

86
00:02:48,310 --> 00:02:49,730
what you really care about might

87
00:02:49,940 --> 00:02:52,530
be this sort of

88
00:02:53,610 --> 00:02:55,120
this sort of, this direction, a different feature that really

89
00:02:55,370 --> 00:02:57,190
measures pilot aptitude.

90
00:03:00,450 --> 00:03:01,240
And I'm making up the name

91
00:03:01,590 --> 00:03:03,220
aptitude of course, but again, if

92
00:03:03,320 --> 00:03:04,780
you highly correlated features, maybe

93
00:03:04,990 --> 00:03:06,500
you really want to reduce the dimension.

94
00:03:07,570 --> 00:03:08,760
So, let me say a

95
00:03:09,040 --> 00:03:09,950
little bit more about what it

96
00:03:10,060 --> 00:03:11,390
really means to reduce the

97
00:03:11,520 --> 00:03:12,950
dimension of the data from

98
00:03:13,150 --> 00:03:14,400
2 dimensions down from 2D

99
00:03:14,600 --> 00:03:16,300
to 1 dimensional or to 1D.

100
00:03:16,840 --> 00:03:18,660
Let me color in

101
00:03:18,830 --> 00:03:19,940
these examples by using different

102
00:03:21,600 --> 00:03:21,600
colors.

103
00:03:21,730 --> 00:03:22,890
And in this case

104
00:03:23,370 --> 00:03:24,740
by reducing the dimension what

105
00:03:25,010 --> 00:03:26,320
I mean is that I would

106
00:03:26,540 --> 00:03:28,400
like to find maybe this

107
00:03:28,660 --> 00:03:30,560
line, this, you know, direction on

108
00:03:30,710 --> 00:03:31,700
which most of the data seems

109
00:03:31,910 --> 00:03:33,150
to lie and project all

110
00:03:33,380 --> 00:03:34,740
the data onto that line which

111
00:03:34,910 --> 00:03:36,230
is true, and by doing

112
00:03:36,510 --> 00:03:37,430
so, what I can do

113
00:03:37,970 --> 00:03:39,420
is just measure the

114
00:03:39,580 --> 00:03:41,480
position of each of the examples on that line.

115
00:03:42,010 --> 00:03:42,820
And what I can do is come

116
00:03:43,100 --> 00:03:45,080
up with a new feature, z1,

117
00:03:46,830 --> 00:03:48,200
and to specify the position

118
00:03:48,730 --> 00:03:49,530
on the line I need only

119
00:03:49,890 --> 00:03:50,940
one number, so it says

120
00:03:51,200 --> 00:03:51,980
z1 is a new feature

121
00:03:52,750 --> 00:03:54,630
that specifies the location of

122
00:03:54,830 --> 00:03:57,610
each of those points on this green line.

123
00:03:58,060 --> 00:03:59,300
And what this means, is

124
00:03:59,400 --> 00:04:00,680
that where as previously if i

125
00:04:00,930 --> 00:04:02,540
had an example x1, maybe

126
00:04:03,430 --> 00:04:04,740
this was my first example, x1.

127
00:04:05,040 --> 00:04:06,480
So in order to

128
00:04:06,820 --> 00:04:08,550
represent x1 originally x1.

129
00:04:09,620 --> 00:04:10,760
I needed a two dimensional number,

130
00:04:11,570 --> 00:04:12,800
or a two dimensional feature vector.

131
00:04:13,700 --> 00:04:14,920
Instead now I can represent

132
00:04:18,120 --> 00:04:20,330
z1. I could

133
00:04:20,520 --> 00:04:22,170
use just z1 to represent my first

134
00:04:23,270 --> 00:04:25,380
example, and that's going to be a real number.

135
00:04:25,940 --> 00:04:29,260
And similarly x2 you know, if x2

136
00:04:29,590 --> 00:04:31,400
is my second example there,

137
00:04:32,690 --> 00:04:35,110
then previously, whereas this required

138
00:04:35,830 --> 00:04:37,520
two numbers to represent if I

139
00:04:37,720 --> 00:04:39,930
instead compute the projection

140
00:04:40,930 --> 00:04:42,730
of that black cross

141
00:04:43,130 --> 00:04:44,250
onto the line.

142
00:04:44,700 --> 00:04:45,980
And now I only need one

143
00:04:46,210 --> 00:04:47,350
real number which is

144
00:04:47,550 --> 00:04:49,580
z2 to represent the

145
00:04:49,620 --> 00:04:51,230
location of this point

146
00:04:51,790 --> 00:04:53,070
z2 on the line.

147
00:04:54,300 --> 00:04:56,730
And so on through my M examples.

148
00:04:57,790 --> 00:04:59,560
So, just to summarize, if

149
00:04:59,810 --> 00:05:01,310
we allow ourselves to approximate

150
00:05:02,340 --> 00:05:03,800
the original data set by

151
00:05:04,000 --> 00:05:05,270
projecting all of my

152
00:05:05,590 --> 00:05:07,690
original examples onto this green

153
00:05:07,880 --> 00:05:10,260
line over here, then I

154
00:05:10,360 --> 00:05:12,090
need only one number, I

155
00:05:12,170 --> 00:05:13,700
need only real number to

156
00:05:13,820 --> 00:05:15,270
specify the position of

157
00:05:15,370 --> 00:05:16,710
a point on the line,

158
00:05:17,080 --> 00:05:18,220
and so what I can

159
00:05:18,300 --> 00:05:19,730
do is therefore use just

160
00:05:20,070 --> 00:05:21,850
one number to represent the

161
00:05:21,930 --> 00:05:23,170
location of each of

162
00:05:23,280 --> 00:05:26,520
my training examples after they've been projected onto that green line.

163
00:05:27,570 --> 00:05:29,060
So this is an approximation to

164
00:05:29,210 --> 00:05:30,300
the original training self because

165
00:05:30,570 --> 00:05:32,770
I have projected all of my training examples onto a line.

166
00:05:33,630 --> 00:05:34,790
But

167
00:05:35,130 --> 00:05:36,140
now, I need to keep around

168
00:05:36,530 --> 00:05:39,800
only one number for each of my examples.

169
00:05:41,220 --> 00:05:42,960
And so this halves the memory

170
00:05:43,340 --> 00:05:44,640
requirement, or a space requirement,

171
00:05:45,090 --> 00:05:47,760
or what have you, for how to store my data.

172
00:05:49,100 --> 00:05:50,530
And perhaps more interestingly, more

173
00:05:50,700 --> 00:05:51,940
importantly, what we'll see

174
00:05:52,200 --> 00:05:53,520
later, in the later

175
00:05:53,780 --> 00:05:55,730
video as well is that this

176
00:05:55,930 --> 00:05:56,940
will allow us to make

177
00:05:57,200 --> 00:05:59,170
our learning algorithms run more quickly as well.

178
00:05:59,480 --> 00:06:00,600
And that is actually,

179
00:06:00,920 --> 00:06:02,060
perhaps, even the more interesting

180
00:06:02,140 --> 00:06:03,800
application of this data compression

181
00:06:04,580 --> 00:06:06,220
rather than reducing the memory

182
00:06:06,680 --> 00:06:08,620
or disk space requirement for storing the data.

183
00:06:10,250 --> 00:06:11,490
On the previous slide we

184
00:06:11,580 --> 00:06:13,140
showed an example of reducing

185
00:06:13,620 --> 00:06:15,060
data from 2D to 1D.

186
00:06:15,210 --> 00:06:16,290
On this slide, I'm going

187
00:06:16,660 --> 00:06:18,010
to show another example of reducing

188
00:06:18,450 --> 00:06:21,080
data from three dimensional 3D to two dimensional 2D.

189
00:06:22,590 --> 00:06:23,360
By the way, in the more typical

190
00:06:23,750 --> 00:06:25,570
example of dimensionality reduction

191
00:06:26,390 --> 00:06:27,790
we might have a thousand dimensional

192
00:06:28,230 --> 00:06:30,330
data or 1000D data that

193
00:06:30,720 --> 00:06:31,880
we might want to reduce to

194
00:06:32,150 --> 00:06:34,080
let's say a hundred dimensional or

195
00:06:34,110 --> 00:06:35,590
100D, but because of

196
00:06:35,700 --> 00:06:37,760
the limitations of what I can plot on the slide.

197
00:06:38,460 --> 00:06:41,520
I'm going to use examples of 3D to 2D, or 2D to 1D.

198
00:06:43,160 --> 00:06:45,830
So, let's have a data set like that shown here.

199
00:06:46,050 --> 00:06:47,420
And so, I would have a set of examples

200
00:06:48,110 --> 00:06:49,430
x(i) which are points

201
00:06:49,800 --> 00:06:51,790
in r3. So, I have three dimension examples.

202
00:06:52,740 --> 00:06:53,300
I know it might be a little

203
00:06:53,690 --> 00:06:54,610
bit hard to see this on the slide,

204
00:06:54,920 --> 00:06:55,980
but I'll show a 3D point

205
00:06:56,310 --> 00:06:58,190
cloud in a little bit.

206
00:06:59,050 --> 00:07:00,280
And it might be hard to see

207
00:07:00,380 --> 00:07:01,970
here, but all of this

208
00:07:02,230 --> 00:07:04,020
data maybe lies roughly on

209
00:07:04,130 --> 00:07:05,700
the plane, like so.

210
00:07:07,110 --> 00:07:08,130
And so what we can do

211
00:07:08,380 --> 00:07:09,970
with dimensionality reduction, is take

212
00:07:10,210 --> 00:07:11,960
all of this data and

213
00:07:12,110 --> 00:07:13,800
project the data down onto

214
00:07:14,630 --> 00:07:15,350
a two dimensional plane.

215
00:07:15,700 --> 00:07:16,670
So, here what I've done is,

216
00:07:16,730 --> 00:07:18,060
I've taken all the data and I've

217
00:07:18,300 --> 00:07:19,250
projected all of the data,

218
00:07:19,770 --> 00:07:21,390
so that it all lies on the plane.

219
00:07:22,590 --> 00:07:23,910
Now, finally, in order to

220
00:07:24,040 --> 00:07:25,580
specify the location of a

221
00:07:25,750 --> 00:07:27,810
point within a plane, we need two numbers, right?

222
00:07:28,000 --> 00:07:29,150
We need to, maybe, specify the

223
00:07:29,290 --> 00:07:30,660
location of a point along

224
00:07:30,970 --> 00:07:32,370
this axis, and then also

225
00:07:32,650 --> 00:07:35,090
specify it's location along that axis.

226
00:07:35,730 --> 00:07:37,470
So, we need two numbers, maybe called

227
00:07:37,850 --> 00:07:39,900
z1 and z2 to specify

228
00:07:40,600 --> 00:07:42,450
the location of a point within a plane.

229
00:07:43,290 --> 00:07:44,730
And so, what that means,

230
00:07:44,890 --> 00:07:45,910
is that we can now represent

231
00:07:46,690 --> 00:07:48,310
each example, each training example,

232
00:07:48,740 --> 00:07:50,310
using two numbers that

233
00:07:50,630 --> 00:07:52,950
I've drawn here, z1, and z2.

234
00:07:53,990 --> 00:07:55,890
So, our data can be represented

235
00:07:56,610 --> 00:07:59,130
using vector z which are in r2.

236
00:08:00,580 --> 00:08:02,110
And these subscript, z subscript

237
00:08:02,350 --> 00:08:03,990
1, z subscript 2, what

238
00:08:04,560 --> 00:08:05,440
I just mean by that is that my

239
00:08:05,500 --> 00:08:07,520
vectors here, z, you know, are two

240
00:08:07,750 --> 00:08:09,680
dimensional vectors, z1, z2.

241
00:08:10,600 --> 00:08:11,580
And so if I have some

242
00:08:11,790 --> 00:08:13,690
particular examples, z(i), or

243
00:08:13,760 --> 00:08:15,700
that's the two dimensional vector, z(i)1,

244
00:08:16,350 --> 00:08:19,110
z(i)2.

245
00:08:20,580 --> 00:08:21,990
And on the previous slide when

246
00:08:22,230 --> 00:08:23,750
I was reducing data to one

247
00:08:23,950 --> 00:08:25,270
dimensional data then I

248
00:08:25,360 --> 00:08:27,500
had only z1, right?

249
00:08:27,760 --> 00:08:28,610
And that is what a z1 subscript 1

250
00:08:28,700 --> 00:08:29,830
on the previous slide was,

251
00:08:30,550 --> 00:08:31,720
but here I have two dimensional data,

252
00:08:32,100 --> 00:08:32,730
so I have z1 and z2 as

253
00:08:33,040 --> 00:08:34,940
the two components of the data.

254
00:08:36,690 --> 00:08:37,830
Now, let me just make sure

255
00:08:38,020 --> 00:08:39,200
that these figures make sense. So

256
00:08:39,290 --> 00:08:40,790
let me just reshow these exact

257
00:08:41,600 --> 00:08:45,080
three figures again but with 3D plots.

258
00:08:45,540 --> 00:08:46,570
So the process we went through was that

259
00:08:47,040 --> 00:08:48,110
shown in the lab is the optimal

260
00:08:48,480 --> 00:08:49,520
data set, in the middle the

261
00:08:49,590 --> 00:08:50,540
data set projects on the 2D,

262
00:08:51,040 --> 00:08:52,140
and on the right the 2D

263
00:08:52,820 --> 00:08:54,900
data sets with z1 and z2 as the axis.

264
00:08:55,780 --> 00:08:56,610
Let's look at them a little

265
00:08:56,820 --> 00:08:57,960
bit further. Here's my original

266
00:08:58,270 --> 00:08:59,210
data set, shown on the

267
00:08:59,410 --> 00:09:00,680
left, and so I had started

268
00:09:01,380 --> 00:09:02,420
off with a 3D point

269
00:09:02,660 --> 00:09:04,000
cloud like so, where the

270
00:09:04,360 --> 00:09:05,390
axis are labeled x1,

271
00:09:05,570 --> 00:09:07,410
x2, x3, and so there's a 3D

272
00:09:07,960 --> 00:09:08,970
point but most of the data,

273
00:09:09,500 --> 00:09:10,750
maybe roughly lies on some,

274
00:09:10,850 --> 00:09:12,800
you know, not too far from some 2D plain.

275
00:09:13,930 --> 00:09:14,950
So, what we can

276
00:09:15,040 --> 00:09:17,460
do is take this data and here's my middle figure.

277
00:09:17,800 --> 00:09:19,110
I'm going to project it onto 2D.

278
00:09:19,370 --> 00:09:20,790
So, I've projected this data so

279
00:09:20,900 --> 00:09:23,220
that all of it now lies on this 2D surface.

280
00:09:23,750 --> 00:09:25,330
As you can see all the data

281
00:09:26,190 --> 00:09:27,470
lies on a plane, 'cause we've

282
00:09:27,700 --> 00:09:30,520
projected everything onto a

283
00:09:30,570 --> 00:09:31,490
plane, and so what this means is that

284
00:09:31,800 --> 00:09:33,190
now I need only two numbers,

285
00:09:33,820 --> 00:09:35,090
z1 and z2, to represent

286
00:09:35,620 --> 00:09:37,470
the location of point on the plane.

287
00:09:40,530 --> 00:09:41,480
And so that's the process that

288
00:09:41,810 --> 00:09:42,990
we can go through to reduce our

289
00:09:43,500 --> 00:09:45,180
data from three dimensional to

290
00:09:45,340 --> 00:09:48,520
two dimensional. So that's

291
00:09:49,230 --> 00:09:50,850
dimensionality reduction and how

292
00:09:51,070 --> 00:09:52,740
we can use it to compress our data.

293
00:09:54,010 --> 00:09:55,400
And as we'll see

294
00:09:55,580 --> 00:09:56,970
later this will allow us to

295
00:09:57,110 --> 00:09:58,020
make some of our learning algorithms

296
00:09:58,580 --> 00:09:59,670
run much later as well, but

297
00:09:59,740 --> 00:10:01,210
we'll get to that only in a later video.