1
00:00:00,090 --> 00:00:01,270
I've seen over and over that

2
00:00:01,570 --> 00:00:03,160
one of the most reliable ways to

3
00:00:03,300 --> 00:00:04,800
get a high performance machine learning

4
00:00:05,040 --> 00:00:06,170
system is to take

5
00:00:06,550 --> 00:00:07,860
a low bias learning algorithm

6
00:00:08,750 --> 00:00:10,220
and to train it on a massive training set.

7
00:00:11,230 --> 00:00:12,830
But where did you get so much training data from?

8
00:00:13,510 --> 00:00:14,440
Turns out that the machine earnings

9
00:00:14,820 --> 00:00:16,520
there's a fascinating idea called artificial

10
00:00:17,220 --> 00:00:19,000
data synthesis, this doesn't

11
00:00:19,370 --> 00:00:20,740
apply to every single problem, and

12
00:00:20,980 --> 00:00:22,120
to apply to a specific

13
00:00:22,360 --> 00:00:25,060
problem, often takes some thought and innovation and insight.

14
00:00:25,780 --> 00:00:27,170
But if this idea applies

15
00:00:27,580 --> 00:00:29,120
to your machine, only problem, it

16
00:00:29,230 --> 00:00:30,270
can sometimes be a an

17
00:00:30,510 --> 00:00:31,600
easy way to get a

18
00:00:31,680 --> 00:00:33,470
huge training set to give to your learning algorithm.

19
00:00:33,900 --> 00:00:35,520
The idea of artificial

20
00:00:36,230 --> 00:00:38,410
data synthesis comprises of two

21
00:00:38,590 --> 00:00:40,210
variations, main the first

22
00:00:40,650 --> 00:00:41,940
is if we are essentially creating

23
00:00:42,520 --> 00:00:44,940
data from [xx], creating new data from scratch.

24
00:00:45,380 --> 00:00:46,700
And the second is if

25
00:00:46,930 --> 00:00:48,350
we already have it's small

26
00:00:48,590 --> 00:00:49,970
label training set and we

27
00:00:50,210 --> 00:00:51,490
somehow have amplify that training

28
00:00:51,840 --> 00:00:52,680
set or use a small training

29
00:00:52,980 --> 00:00:54,390
set to turn that into

30
00:00:54,660 --> 00:00:56,290
a larger training set and in

31
00:00:56,450 --> 00:00:58,120
this video we'll go over both those ideas.

32
00:01:00,350 --> 00:01:02,220
To talk about the artificial data

33
00:01:02,440 --> 00:01:04,030
synthesis idea, let's use

34
00:01:04,330 --> 00:01:06,930
the character portion of

35
00:01:07,090 --> 00:01:08,470
the photo OCR pipeline, we

36
00:01:08,690 --> 00:01:09,710
want to take it's input image

37
00:01:10,060 --> 00:01:11,370
and recognize what character it is.

38
00:01:13,330 --> 00:01:14,690
If we go out and collect

39
00:01:14,880 --> 00:01:16,270
a large label data set,

40
00:01:16,890 --> 00:01:17,980
here's what it is and what it look like.

41
00:01:18,580 --> 00:01:21,770
For this particular example, I've chosen a square aspect ratio.

42
00:01:22,130 --> 00:01:23,250
So we're taking square image patches.

43
00:01:24,180 --> 00:01:25,110
And the goal is to take

44
00:01:25,770 --> 00:01:27,420
an image patch and recognize the

45
00:01:27,530 --> 00:01:29,270
character in the middle of that image patch.

46
00:01:31,090 --> 00:01:31,990
And for the sake of simplicity,

47
00:01:32,660 --> 00:01:33,740
I'm going to treat these images

48
00:01:34,240 --> 00:01:36,380
as grey scale images, rather than color images.

49
00:01:36,870 --> 00:01:38,550
It turns out that using color

50
00:01:38,930 --> 00:01:41,180
doesn't seem to help that much for this particular problem.

51
00:01:42,190 --> 00:01:43,530
So given this image patch, we'd

52
00:01:43,660 --> 00:01:44,890
like to recognize that that's a

53
00:01:45,010 --> 00:01:46,230
T. Given this image patch,

54
00:01:46,550 --> 00:01:47,920
we'd like to recognize that it's an 'S'.

55
00:01:49,540 --> 00:01:50,740
Given that image patch we

56
00:01:50,850 --> 00:01:52,950
would like to recognize that as an 'I' and so on.

57
00:01:54,110 --> 00:01:55,310
So all of these, our

58
00:01:55,450 --> 00:01:57,240
examples of row images, how

59
00:01:57,380 --> 00:01:59,460
can we come up with a much larger training set?

60
00:02:00,000 --> 00:02:01,580
Modern computers often have a

61
00:02:01,640 --> 00:02:03,700
huge font library and

62
00:02:03,890 --> 00:02:05,330
if you use a word processing

63
00:02:05,950 --> 00:02:07,090
software, depending on what word

64
00:02:07,240 --> 00:02:08,580
processor you use, you might

65
00:02:08,800 --> 00:02:09,980
have all of these fonts and

66
00:02:10,120 --> 00:02:12,490
many, many more Already stored inside.

67
00:02:12,950 --> 00:02:14,350
And, in fact, if you go different websites, there

68
00:02:14,680 --> 00:02:16,280
are, again, huge, free font

69
00:02:16,690 --> 00:02:18,200
libraries on the internet we

70
00:02:18,370 --> 00:02:19,960
can download many, many different

71
00:02:20,250 --> 00:02:22,580
types of fonts, hundreds or perhaps thousands of different fonts.

72
00:02:23,960 --> 00:02:25,180
So if you want more

73
00:02:25,570 --> 00:02:27,020
training examples, one thing you

74
00:02:27,100 --> 00:02:28,340
can do is just take

75
00:02:28,870 --> 00:02:30,220
characters from different fonts

76
00:02:31,240 --> 00:02:32,870
and paste these characters against

77
00:02:33,290 --> 00:02:35,890
different random backgrounds.

78
00:02:36,730 --> 00:02:39,500
So you might take this ----  and paste that c against a random background.

79
00:02:40,680 --> 00:02:41,640
If you do that you now have

80
00:02:42,060 --> 00:02:43,830
a training example of an

81
00:02:44,080 --> 00:02:45,260
image of the character C.

82
00:02:46,360 --> 00:02:47,500
So after some amount of

83
00:02:47,570 --> 00:02:48,920
work, you know this,

84
00:02:48,980 --> 00:02:49,710
and it is a little bit of

85
00:02:49,830 --> 00:02:51,760
work to synthisize realistic looking data.

86
00:02:52,180 --> 00:02:53,080
But after some amount of work,

87
00:02:53,700 --> 00:02:56,130
you can get a synthetic training set like that.

88
00:02:57,180 --> 00:02:59,910
Every image shown on the right was actually a synthesized image.

89
00:03:00,360 --> 00:03:02,080
Where you take a font,

90
00:03:02,810 --> 00:03:04,240
maybe a random font downloaded off

91
00:03:04,400 --> 00:03:05,680
the web and you paste

92
00:03:06,160 --> 00:03:07,320
an image of one character

93
00:03:07,800 --> 00:03:08,870
or a few characters from that font

94
00:03:09,570 --> 00:03:11,440
against this other random background image.

95
00:03:12,140 --> 00:03:12,840
And then apply maybe a little

96
00:03:13,540 --> 00:03:15,160
blurring operators  -----of app

97
00:03:15,680 --> 00:03:17,380
finder, distortions that app

98
00:03:17,620 --> 00:03:18,650
finder, meaning just the sharing

99
00:03:19,350 --> 00:03:20,740
and scaling and little rotation

100
00:03:21,000 --> 00:03:22,260
operations and if you

101
00:03:22,370 --> 00:03:23,330
do that you get a synthetic

102
00:03:23,580 --> 00:03:25,520
training set, on what the one shown here.

103
00:03:26,510 --> 00:03:28,050
And this is work,

104
00:03:28,530 --> 00:03:29,640
grade, it is, it takes

105
00:03:29,970 --> 00:03:31,460
thought at work, in order to

106
00:03:31,700 --> 00:03:33,250
make the synthetic data look realistic,

107
00:03:34,020 --> 00:03:34,710
and if you do a sloppy

108
00:03:35,120 --> 00:03:36,200
job in terms of how

109
00:03:36,250 --> 00:03:38,910
you create the synthetic data then it actually won't work well.

110
00:03:39,620 --> 00:03:40,600
But if you look at

111
00:03:40,790 --> 00:03:43,940
the synthetic data looks remarkably similar to the real data.

112
00:03:45,120 --> 00:03:46,850
And so by using synthetic data

113
00:03:47,340 --> 00:03:48,550
you have essentially an unlimited

114
00:03:48,990 --> 00:03:50,970
supply of training examples for

115
00:03:51,310 --> 00:03:53,060
artificial training synthesis And

116
00:03:53,150 --> 00:03:54,110
so, if you use this

117
00:03:54,330 --> 00:03:55,820
source synthetic data, you have

118
00:03:56,150 --> 00:03:58,100
essentially unlimited supply of

119
00:03:58,560 --> 00:04:00,000
label data to create

120
00:04:00,140 --> 00:04:01,610
a improvised learning algorithm

121
00:04:02,300 --> 00:04:03,990
for the character recognition problem.

122
00:04:05,120 --> 00:04:06,540
So this is an example of

123
00:04:07,000 --> 00:04:08,500
artificial data synthesis where youre

124
00:04:09,040 --> 00:04:10,870
basically creating new data from

125
00:04:11,080 --> 00:04:13,780
scratch, you just generating brand new images from scratch.

126
00:04:14,880 --> 00:04:16,450
The other main approach to artificial data

127
00:04:16,710 --> 00:04:18,210
synthesis is where you

128
00:04:18,370 --> 00:04:19,610
take a examples that you

129
00:04:19,740 --> 00:04:20,780
currently have, that we take

130
00:04:21,020 --> 00:04:22,430
a real example, maybe from

131
00:04:22,700 --> 00:04:24,130
real image, and you create

132
00:04:24,770 --> 00:04:26,130
additional data, so as to

133
00:04:26,380 --> 00:04:27,900
amplify your training set.

134
00:04:28,070 --> 00:04:28,810
So here is an image of a compared

135
00:04:28,910 --> 00:04:30,490
to a from a real image,

136
00:04:31,410 --> 00:04:32,550
not a synthesized image, and

137
00:04:32,630 --> 00:04:33,790
I have overlayed this with

138
00:04:33,880 --> 00:04:35,750
the grid lines just for the purpose of illustration.

139
00:04:36,430 --> 00:04:36,880
Actually have these ----.

140
00:04:36,970 --> 00:04:39,030
So what you

141
00:04:39,100 --> 00:04:40,110
can do is then take this

142
00:04:40,480 --> 00:04:41,500
alphabet here, take this image

143
00:04:42,240 --> 00:04:43,760
and introduce artificial warpings[sp?]

144
00:04:44,290 --> 00:04:45,810
or artificial distortions into the

145
00:04:46,040 --> 00:04:47,030
image so they can

146
00:04:47,220 --> 00:04:48,240
take the image a and turn

147
00:04:48,430 --> 00:04:50,060
that into 16 new examples.

148
00:04:51,110 --> 00:04:52,000
So in this way you can

149
00:04:52,450 --> 00:04:53,740
take a small label training set

150
00:04:54,090 --> 00:04:55,360
and amplify your training set

151
00:04:56,180 --> 00:04:57,190
to suddenly get a lot

152
00:04:57,300 --> 00:05:00,020
more examples, all of it.

153
00:05:01,210 --> 00:05:02,360
Again, in order to do

154
00:05:02,560 --> 00:05:03,940
this for application, it does

155
00:05:04,120 --> 00:05:05,060
take thought and it does

156
00:05:05,140 --> 00:05:06,270
take insight to figure out

157
00:05:06,430 --> 00:05:07,840
what our reasonable sets of

158
00:05:08,420 --> 00:05:09,460
distortions, or whether these

159
00:05:09,720 --> 00:05:11,000
are ways that amplify and multiply

160
00:05:11,470 --> 00:05:12,760
your training set, and for

161
00:05:13,070 --> 00:05:15,130
the specific example of

162
00:05:15,260 --> 00:05:17,170
character recognition, introducing these

163
00:05:17,480 --> 00:05:18,310
warping seems like a natural

164
00:05:18,780 --> 00:05:19,910
choice, but for a

165
00:05:20,090 --> 00:05:21,970
different learning machine application, there may

166
00:05:22,080 --> 00:05:24,180
be different the distortions that might make more sense.

167
00:05:24,860 --> 00:05:25,600
Let me just show one example

168
00:05:26,180 --> 00:05:28,750
from the totally different domain of speech recognition.

169
00:05:30,230 --> 00:05:31,480
So the speech recognition, let's say

170
00:05:31,580 --> 00:05:33,450
you have audio clips and you

171
00:05:33,600 --> 00:05:35,010
want to learn from the audio

172
00:05:35,350 --> 00:05:37,240
clip to recognize what were

173
00:05:37,460 --> 00:05:38,780
the words spoken in that clip.

174
00:05:39,510 --> 00:05:41,340
So let's see how one labeled training example.

175
00:05:42,290 --> 00:05:43,190
So let's say you have one

176
00:05:43,400 --> 00:05:45,000
labeled training example, of someone

177
00:05:45,330 --> 00:05:46,660
saying a few specific words.

178
00:05:46,860 --> 00:05:48,720
So let's play that audio clip here.

179
00:05:49,150 --> 00:05:51,230
0 -1-2-3-4-5.

180
00:05:51,570 --> 00:05:53,810
Alright, so someone

181
00:05:54,220 --> 00:05:55,110
counting from 0 to 5,

182
00:05:55,450 --> 00:05:57,180
and so you want to

183
00:05:57,290 --> 00:05:58,460
try to apply a learning algorithm

184
00:05:59,380 --> 00:06:01,320
to try to recognize the words said in that.

185
00:06:02,040 --> 00:06:04,030
So, how can we amplify the data set?

186
00:06:04,390 --> 00:06:05,340
Well, one thing we do is

187
00:06:06,020 --> 00:06:09,180
introduce additional audio distortions into the data set.

188
00:06:09,970 --> 00:06:10,960
So here I'm going to

189
00:06:11,640 --> 00:06:14,700
add background sounds to simulate a bad cell phone connection.

190
00:06:15,360 --> 00:06:16,800
When you hear beeping sounds, that's

191
00:06:16,980 --> 00:06:17,710
actually part of the audio

192
00:06:17,740 --> 00:06:20,350
track, that's nothing wrong with the speakers, I'm going to play this now.

193
00:06:20,580 --> 00:06:20,580
0-1-2-3-4-5.

194
00:06:21,380 --> 00:06:22,260
Right, so you can listen

195
00:06:22,640 --> 00:06:24,890
to that sort of audio clip and

196
00:06:25,720 --> 00:06:28,600
recognize the sounds,

197
00:06:28,960 --> 00:06:30,800
that seems like another useful training

198
00:06:31,370 --> 00:06:33,230
example to have, here's another example, noisy background.

199
00:06:34,890 --> 00:06:36,870
Zero, one, two, three

200
00:06:37,560 --> 00:06:39,060
four five you know

201
00:06:39,090 --> 00:06:40,280
of cars driving past, people walking

202
00:06:40,580 --> 00:06:42,200
in the background, here's another

203
00:06:42,450 --> 00:06:43,880
one, so taking the original

204
00:06:44,430 --> 00:06:45,980
clean audio clip so

205
00:06:46,090 --> 00:06:47,810
taking the clean audio of

206
00:06:47,990 --> 00:06:48,960
someone saying 0 1 2 3

207
00:06:49,090 --> 00:06:50,490
4 5 we can then automatically

208
00:06:51,790 --> 00:06:54,090
synthesize these additional training

209
00:06:54,470 --> 00:06:55,850
examples and thus amplify

210
00:06:56,410 --> 00:06:57,860
one training example into maybe four different training examples.

211
00:07:00,110 --> 00:07:00,940
So let me play this final

212
00:07:01,300 --> 00:07:03,180
example, as well.

213
00:07:03,340 --> 00:07:07,180
0-1 3-4-5 So by

214
00:07:07,530 --> 00:07:08,510
taking just one labelled example,

215
00:07:09,000 --> 00:07:10,260
we have to go through the effort

216
00:07:10,360 --> 00:07:11,760
to collect just one labelled example

217
00:07:11,950 --> 00:07:13,270
fall of the 01205, and by

218
00:07:14,140 --> 00:07:16,520
synthesizing additional distortions,

219
00:07:17,290 --> 00:07:18,560
by introducing different background sounds,

220
00:07:19,000 --> 00:07:20,240
we've now multiplied this one

221
00:07:20,370 --> 00:07:21,810
example into many more examples.

222
00:07:23,420 --> 00:07:24,480
Much work by just automatically

223
00:07:25,270 --> 00:07:27,090
adding these different background sounds

224
00:07:27,680 --> 00:07:30,510
to the clean audio Just

225
00:07:30,740 --> 00:07:31,980
one word of warning about synthesizing

226
00:07:33,190 --> 00:07:35,220
data by introducing distortions: if

227
00:07:35,310 --> 00:07:36,630
you try to do this

228
00:07:36,810 --> 00:07:38,580
yourself, the distortions you

229
00:07:39,020 --> 00:07:40,300
introduce should be representative the source

230
00:07:40,660 --> 00:07:42,010
of noises, or distortions, that

231
00:07:42,110 --> 00:07:43,680
you might see in the test set.

232
00:07:44,010 --> 00:07:45,350
So, for the character recognition example,

233
00:07:45,930 --> 00:07:47,230
you know, the working things

234
00:07:47,440 --> 00:07:48,620
begin introduced are actually kind

235
00:07:48,770 --> 00:07:49,980
of reasonable, because an image

236
00:07:50,340 --> 00:07:51,510
A that looks like that, that's,

237
00:07:52,000 --> 00:07:53,020
could be an image that

238
00:07:53,210 --> 00:07:55,170
we could actually see in a test set.Reflect

239
00:07:55,370 --> 00:07:57,180
a fact And, you know, that

240
00:07:57,380 --> 00:08:00,200
image on the upper-right, that

241
00:08:00,350 --> 00:08:01,800
could be an image that we could imagine seeing.

242
00:08:03,280 --> 00:08:04,570
And for audio, well, we do

243
00:08:04,740 --> 00:08:06,560
wanna recognize speech, even against

244
00:08:06,970 --> 00:08:07,990
a bad self internal connection, against

245
00:08:08,480 --> 00:08:09,440
different types of background noise, and

246
00:08:09,590 --> 00:08:10,920
so for the audio, we're again

247
00:08:11,230 --> 00:08:12,800
synthesizing examples are actually

248
00:08:13,530 --> 00:08:14,770
representative of the sorts of

249
00:08:14,850 --> 00:08:15,830
examples that we want to

250
00:08:15,990 --> 00:08:17,360
classify, that we want to recognize correctly.

251
00:08:18,770 --> 00:08:20,660
In contrast, usually it does

252
00:08:20,770 --> 00:08:21,940
not help perhaps you actually

253
00:08:22,170 --> 00:08:23,760
a meaning as noise to your data.

254
00:08:24,420 --> 00:08:25,170
I'm not sure you can see

255
00:08:25,440 --> 00:08:26,400
this, but what we've done

256
00:08:26,620 --> 00:08:28,050
here is taken the image, and

257
00:08:28,210 --> 00:08:29,540
for each pixel, in each

258
00:08:29,720 --> 00:08:30,710
of these 4 images, has just

259
00:08:30,990 --> 00:08:32,970
added some random Gaussian noise to each pixel.

260
00:08:33,240 --> 00:08:34,690
To each pixel, is the

261
00:08:35,060 --> 00:08:36,370
pixel brightness, it would

262
00:08:36,500 --> 00:08:38,880
just add some, you know, maybe Gaussian random noise to each pixel.

263
00:08:39,360 --> 00:08:40,940
So it's just a totally meaningless noise, right?

264
00:08:41,650 --> 00:08:43,280
And so, unless you're expecting

265
00:08:43,800 --> 00:08:45,510
to see these sorts of pixel

266
00:08:45,910 --> 00:08:46,830
wise noise in your test

267
00:08:46,910 --> 00:08:48,190
set, this sort of

268
00:08:48,660 --> 00:08:51,540
purely random meaningless noise is less likely to be useful.

269
00:08:52,880 --> 00:08:53,750
But the process of artificial

270
00:08:54,250 --> 00:08:55,570
data synthesis it is you

271
00:08:55,640 --> 00:08:56,660
know a little bit of

272
00:08:56,710 --> 00:08:57,850
an art as well and sometimes

273
00:08:58,140 --> 00:09:00,250
you just have to try it and see if it works.

274
00:09:01,280 --> 00:09:02,060
But if you're trying to

275
00:09:02,140 --> 00:09:03,170
decide what sorts of distortions

276
00:09:03,870 --> 00:09:04,720
to add, you know, do

277
00:09:04,820 --> 00:09:06,260
think about what other meaningful

278
00:09:06,670 --> 00:09:08,180
distortions you might add that

279
00:09:08,660 --> 00:09:09,720
will cause you to generate additional

280
00:09:10,110 --> 00:09:11,370
training examples that are at

281
00:09:11,880 --> 00:09:13,410
least somewhat representative of the

282
00:09:13,480 --> 00:09:15,830
sorts of images you expect to see in your test sets.

283
00:09:18,100 --> 00:09:19,000
Finally, to wrap up this

284
00:09:19,150 --> 00:09:19,920
video, I just wanna say

285
00:09:20,140 --> 00:09:21,420
a couple of words, more about

286
00:09:21,790 --> 00:09:23,360
this idea of getting loss

287
00:09:23,600 --> 00:09:25,610
of data via artificial data synthesis.

288
00:09:26,920 --> 00:09:28,780
As always, before expending a lot

289
00:09:29,170 --> 00:09:30,280
of effort, you know, figuring out

290
00:09:30,450 --> 00:09:32,020
how to create artificial training

291
00:09:33,060 --> 00:09:34,140
examples, it's often a good

292
00:09:34,220 --> 00:09:35,310
practice is to make sure

293
00:09:35,650 --> 00:09:36,540
that you really have a low biased

294
00:09:36,920 --> 00:09:38,350
crossfire, and having a

295
00:09:38,460 --> 00:09:40,320
lot more training data will be of help.

296
00:09:41,010 --> 00:09:41,840
And standard way to do

297
00:09:41,970 --> 00:09:42,810
this is to plot the learning

298
00:09:43,030 --> 00:09:43,970
curves, and make sure that

299
00:09:44,130 --> 00:09:44,920
you only have a low

300
00:09:45,000 --> 00:09:47,470
as well, high variance falsifier.

301
00:09:47,760 --> 00:09:48,650
Or if you don't have a low

302
00:09:48,720 --> 00:09:50,090
bias falsifier, you know,

303
00:09:50,160 --> 00:09:51,040
one other thing that's worth trying

304
00:09:51,450 --> 00:09:53,270
is to keep increasing the number

305
00:09:53,540 --> 00:09:54,440
of features that your classifier

306
00:09:54,600 --> 00:09:55,650
has, increasing the number of

307
00:09:55,740 --> 00:09:56,710
hidden units in your network,

308
00:09:57,180 --> 00:09:58,470
saying, until you actually have a

309
00:09:58,540 --> 00:10:00,000
low bias falsifier, and only

310
00:10:00,310 --> 00:10:01,820
then, should you put

311
00:10:02,040 --> 00:10:04,020
the effort into creating a

312
00:10:04,260 --> 00:10:05,760
large, artificial training set, so

313
00:10:05,860 --> 00:10:06,660
what you really want to avoid

314
00:10:06,870 --> 00:10:07,930
is to, you know, spend

315
00:10:08,110 --> 00:10:08,890
a whole week or spend a few

316
00:10:09,090 --> 00:10:10,370
months figuring out how

317
00:10:10,540 --> 00:10:11,720
to get a great artificially

318
00:10:12,450 --> 00:10:13,260
synthesized data set.

319
00:10:13,820 --> 00:10:15,520
Only to realize afterward, that,

320
00:10:15,760 --> 00:10:17,410
you know, your learning algorithm, performance

321
00:10:18,030 --> 00:10:20,730
doesn't improve that much, even when you're given a huge training set.

322
00:10:22,190 --> 00:10:23,060
So that's about my usual advice

323
00:10:23,420 --> 00:10:24,690
about of a testing that

324
00:10:25,030 --> 00:10:26,290
you really can make use

325
00:10:26,530 --> 00:10:27,760
of a large training set before

326
00:10:28,080 --> 00:10:30,530
spending a lot of effort going out to get that large training set.

327
00:10:31,960 --> 00:10:33,280
Second is, when i'm working

328
00:10:33,590 --> 00:10:35,250
on machine learning problems, one question

329
00:10:35,690 --> 00:10:37,520
I often ask the team

330
00:10:37,880 --> 00:10:39,210
I'm working with, often ask my

331
00:10:39,430 --> 00:10:40,550
students, which is, how much work

332
00:10:40,620 --> 00:10:42,810
would it be to get 10 times as much date as we currently had.

333
00:10:46,720 --> 00:10:47,850
When I face a new machine

334
00:10:48,200 --> 00:10:49,760
learning application very often I

335
00:10:49,980 --> 00:10:50,940
will sit down with a team

336
00:10:51,210 --> 00:10:52,440
and ask exactly this question,

337
00:10:52,920 --> 00:10:53,870
I've asked this question over and

338
00:10:53,970 --> 00:10:55,870
over and over and I've

339
00:10:56,000 --> 00:10:57,540
been very surprised how often

340
00:10:58,390 --> 00:10:59,660
this answer has been that.

341
00:11:00,010 --> 00:11:01,070
You know, it's really not that hard,

342
00:11:01,680 --> 00:11:02,670
maybe a few days of work

343
00:11:02,930 --> 00:11:03,930
at most, to get ten times

344
00:11:04,250 --> 00:11:05,300
as much data as we currently

345
00:11:05,450 --> 00:11:06,650
have for a machine

346
00:11:06,810 --> 00:11:08,820
running application and very

347
00:11:09,080 --> 00:11:09,830
often if you can get

348
00:11:09,950 --> 00:11:11,030
ten times as much data there

349
00:11:11,270 --> 00:11:13,680
will be a way to make your algorithm do much better.

350
00:11:14,060 --> 00:11:15,040
So, you know, if you

351
00:11:15,260 --> 00:11:16,510
ever join the product team

352
00:11:17,820 --> 00:11:18,880
working on some machine learning

353
00:11:19,110 --> 00:11:20,430
application product this is

354
00:11:20,550 --> 00:11:21,710
a very good questions ask yourself

355
00:11:22,290 --> 00:11:23,500
ask the team don't be

356
00:11:23,650 --> 00:11:25,120
too surprised if after a

357
00:11:25,240 --> 00:11:26,530
few minutes of brainstorming if your

358
00:11:26,650 --> 00:11:27,520
team comes up with a

359
00:11:27,660 --> 00:11:28,950
way to get literally ten

360
00:11:29,200 --> 00:11:30,250
times this much data, in

361
00:11:30,380 --> 00:11:31,320
which case, I think you would

362
00:11:31,430 --> 00:11:32,330
be a hero to that team,

363
00:11:32,940 --> 00:11:34,000
because with 10 times as

364
00:11:34,240 --> 00:11:35,360
much data, I think you'll really

365
00:11:35,450 --> 00:11:38,460
get much better performance, just from learning from so much data.

366
00:11:39,650 --> 00:11:44,500
So there are several waysand

367
00:11:47,450 --> 00:11:48,510
that comprised both the ideas

368
00:11:48,970 --> 00:11:50,440
of generating data from

369
00:11:50,640 --> 00:11:53,050
scratch using random fonts and so on.

370
00:11:53,570 --> 00:11:54,430
As well as the second idea

371
00:11:54,840 --> 00:11:56,600
of taking an existing example and

372
00:11:56,670 --> 00:11:58,100
and introducing distortions that amplify

373
00:11:58,280 --> 00:12:00,910
to enlarge the training set A

374
00:12:01,090 --> 00:12:02,150
couple of other examples of

375
00:12:02,280 --> 00:12:03,130
ways to get a lot more

376
00:12:03,270 --> 00:12:04,610
data are to collect the

377
00:12:04,670 --> 00:12:06,600
data or to label them yourself.

378
00:12:07,600 --> 00:12:09,090
So one useful calculation that

379
00:12:09,210 --> 00:12:11,580
I often do is, you know,

380
00:12:11,780 --> 00:12:13,320
how many minutes, how many

381
00:12:13,520 --> 00:12:15,140
hours does it take to

382
00:12:15,350 --> 00:12:16,420
get a certain number of

383
00:12:16,610 --> 00:12:17,780
examples, so actually sit down and

384
00:12:17,900 --> 00:12:19,410
figure out, you know, suppose it

385
00:12:19,550 --> 00:12:21,830
takes me ten seconds to

386
00:12:22,060 --> 00:12:23,990
label one example then

387
00:12:24,120 --> 00:12:25,820
and, suppose that, for

388
00:12:26,190 --> 00:12:29,050
our application, currently we

389
00:12:29,190 --> 00:12:31,500
have 1000 labeled examples examples

390
00:12:31,620 --> 00:12:32,730
so ten times as

391
00:12:32,860 --> 00:12:34,090
much of that would be

392
00:12:34,200 --> 00:12:35,940
if n were equal to ten thousand.

393
00:12:37,440 --> 00:12:40,260
A second way to

394
00:12:40,400 --> 00:12:41,530
get a lot of data is

395
00:12:41,800 --> 00:12:43,540
to just collect the data and you label it yourself.

396
00:12:44,510 --> 00:12:45,380
So what I mean by this is

397
00:12:45,690 --> 00:12:46,970
I will often set down and

398
00:12:47,240 --> 00:12:48,570
do a calculation to figure

399
00:12:48,950 --> 00:12:50,190
out how much time, you

400
00:12:50,350 --> 00:12:51,140
know just like how many hours

401
00:12:52,640 --> 00:12:54,000
will it take, how many

402
00:12:54,200 --> 00:12:55,130
hours or how many days will

403
00:12:55,230 --> 00:12:56,890
it take for me or

404
00:12:57,020 --> 00:12:58,400
for someone else to just sit

405
00:12:58,640 --> 00:12:59,870
down and collect ten times

406
00:13:00,190 --> 00:13:01,490
as much data, as we have

407
00:13:01,800 --> 00:13:03,560
currently, by collecting the data ourselves and labeling them ourselves.

408
00:13:05,260 --> 00:13:06,550
So, for example, that, for

409
00:13:06,630 --> 00:13:08,200
our machine learning application, currently

410
00:13:08,690 --> 00:13:10,180
we have 1,000 examples, so M 1,000.

411
00:13:12,010 --> 00:13:12,750
That what we do is sit

412
00:13:12,870 --> 00:13:14,500
down and ask, how long does

413
00:13:14,720 --> 00:13:16,930
it take me really to collect and label one example.

414
00:13:17,340 --> 00:13:18,480
And sometimes maybe it will

415
00:13:18,600 --> 00:13:19,510
take you, you know ten

416
00:13:19,790 --> 00:13:22,100
seconds to label

417
00:13:23,310 --> 00:13:25,120
one new example, and so

418
00:13:25,520 --> 00:13:27,720
if I want 10 X as many examples, I'd do a calculation.

419
00:13:28,360 --> 00:13:30,400
If it takes me 10 seconds to get one training example.

420
00:13:31,370 --> 00:13:32,340
If I wanted to get 10

421
00:13:32,580 --> 00:13:35,320
times as much data, then I need 10,000 examples.

422
00:13:35,830 --> 00:13:38,470
So I do the calculation, how long

423
00:13:38,770 --> 00:13:40,380
is it gonna take to label,

424
00:13:40,840 --> 00:13:42,640
to manually label 10,000 examples,

425
00:13:43,340 --> 00:13:45,280
if it takes me 10 seconds to label 1 example.

426
00:13:47,070 --> 00:13:47,940
So when you do this calculation,

427
00:13:48,840 --> 00:13:49,920
often I've seen many you

428
00:13:50,390 --> 00:13:51,780
would be surprised, you know,

429
00:13:51,870 --> 00:13:53,140
how little, or sometimes a

430
00:13:53,240 --> 00:13:54,730
few days at work, sometimes a

431
00:13:54,880 --> 00:13:55,560
small number of days of work,

432
00:13:55,780 --> 00:13:57,180
well I've seen many teams be very

433
00:13:57,500 --> 00:13:59,160
surprised that sometimes how

434
00:13:59,340 --> 00:14:00,280
little work it could be,

435
00:14:00,410 --> 00:14:01,200
to just get a lot more

436
00:14:01,370 --> 00:14:02,510
data, and let that be

437
00:14:02,580 --> 00:14:03,470
a way to give your learning

438
00:14:03,580 --> 00:14:04,310
app to give you a huge boost

439
00:14:04,640 --> 00:14:06,350
in performance, and necessarily, you

440
00:14:06,450 --> 00:14:07,550
know, sometimes when you've just

441
00:14:07,790 --> 00:14:08,900
managed to do this, you

442
00:14:09,190 --> 00:14:10,780
will be a hero and whatever product

443
00:14:11,360 --> 00:14:12,520
development, whatever team you're working

444
00:14:12,910 --> 00:14:14,150
on, because this can

445
00:14:14,320 --> 00:14:15,760
be a great way to get much better performance.

446
00:14:17,650 --> 00:14:19,490
Third and finally, one sometimes

447
00:14:20,020 --> 00:14:21,230
good way to get a

448
00:14:21,450 --> 00:14:22,650
lot of data is to use

449
00:14:23,080 --> 00:14:24,350
what's now called crowd sourcing.

450
00:14:25,280 --> 00:14:26,350
So today, there are a

451
00:14:26,520 --> 00:14:27,270
few websites or a few

452
00:14:27,460 --> 00:14:29,520
services that allow you

453
00:14:29,920 --> 00:14:32,210
to hire people on

454
00:14:32,350 --> 00:14:33,410
the web to, you know, fairly

455
00:14:33,730 --> 00:14:36,140
inexpensively label large training sets for you.

456
00:14:36,810 --> 00:14:37,870
So this idea of crowd

457
00:14:38,190 --> 00:14:39,460
sourcing, or crowd sourced

458
00:14:39,950 --> 00:14:41,390
data labeling, is something

459
00:14:41,810 --> 00:14:43,180
that has, is obviously, like

460
00:14:43,340 --> 00:14:45,200
an entire academic literature,

461
00:14:45,660 --> 00:14:47,040
has some of it's own complications and

462
00:14:47,210 --> 00:14:49,390
so on, pertaining to labeler reliability.

463
00:14:50,440 --> 00:14:51,470
Maybe, you know, hundreds of thousands

464
00:14:51,860 --> 00:14:53,420
of labelers, around the

465
00:14:53,580 --> 00:14:55,530
world, working fairly inexpensively to

466
00:14:55,630 --> 00:14:56,810
help label data for you,

467
00:14:57,030 --> 00:14:58,580
and that I've just had mentioned,

468
00:14:58,930 --> 00:15:00,120
there's this one alternative as well.

469
00:15:00,390 --> 00:15:02,170
And probably Amazon Mechanical Turk

470
00:15:02,510 --> 00:15:03,750
systems is probably the most

471
00:15:03,900 --> 00:15:05,860
popular crowd sourcing option right now.

472
00:15:06,860 --> 00:15:08,070
This is often quite a

473
00:15:08,220 --> 00:15:10,040
bit of work to

474
00:15:10,190 --> 00:15:10,940
get to work, if you want

475
00:15:11,150 --> 00:15:12,520
to get very high quality labels,

476
00:15:12,780 --> 00:15:14,160
but is sometimes an

477
00:15:14,240 --> 00:15:15,760
option worth considering as well.

478
00:15:17,330 --> 00:15:18,870
If you want to try to

479
00:15:19,320 --> 00:15:21,000
hire many people, fairly inexpensively

480
00:15:21,810 --> 00:15:24,220
on the web, our labels launch miles of data for you.

481
00:15:26,320 --> 00:15:27,570
So this video, we

482
00:15:27,660 --> 00:15:28,840
talked about the idea of

483
00:15:29,100 --> 00:15:30,870
artificial data synthesis of

484
00:15:31,120 --> 00:15:32,440
either creating new data

485
00:15:32,750 --> 00:15:34,400
from scratch, looking, using

486
00:15:34,640 --> 00:15:35,400
the ramming funds as an example,

487
00:15:35,830 --> 00:15:37,710
or by amplifying an

488
00:15:37,790 --> 00:15:38,980
existing training set, by taking

489
00:15:39,420 --> 00:15:41,340
existing label examples and

490
00:15:41,560 --> 00:15:42,980
introducing distortions to it,

491
00:15:43,240 --> 00:15:44,880
to sort of create extra label examples.

492
00:15:46,010 --> 00:15:47,450
And finally, one thing that

493
00:15:47,630 --> 00:15:48,810
I hope you remember from this

494
00:15:49,120 --> 00:15:49,970
video this idea of if

495
00:15:50,540 --> 00:15:51,540
you are facing a machine learning

496
00:15:51,830 --> 00:15:54,350
problem, it is often worth doing two things.

497
00:15:54,660 --> 00:15:55,830
One just a sanity check,

498
00:15:56,160 --> 00:15:58,600
with learning curves, that having more data would help.

499
00:15:59,520 --> 00:16:00,340
And second, assuming that that's the case,

500
00:16:00,730 --> 00:16:01,780
I will often seat down and

501
00:16:01,850 --> 00:16:03,670
ask yourself seriously: what would

502
00:16:04,050 --> 00:16:05,150
it take to get ten times as

503
00:16:05,260 --> 00:16:06,510
much creative data as you

504
00:16:06,630 --> 00:16:08,450
currently have, and not always,

505
00:16:08,960 --> 00:16:10,440
but sometimes, you may be

506
00:16:10,640 --> 00:16:12,310
surprised by how easy that

507
00:16:12,580 --> 00:16:13,990
turns out to be, maybe

508
00:16:14,060 --> 00:16:15,020
a few days, a few weeks at

509
00:16:15,150 --> 00:16:16,160
work, and that can be

510
00:16:16,260 --> 00:16:18,700
a great way to give your learning algorithm a huge boost in performance