1
00:00:00,370 --> 00:00:01,590
In the previous video, we talked

2
00:00:01,890 --> 00:00:04,570
about the photo OCR pipeline and how that worked.

3
00:00:05,480 --> 00:00:06,370
In which we would take an image

4
00:00:07,050 --> 00:00:08,070
and pass the Through a

5
00:00:08,130 --> 00:00:10,010
sequence of machine learning

6
00:00:10,280 --> 00:00:11,680
components in order to

7
00:00:11,890 --> 00:00:13,820
try to read the text that appears in an image.

8
00:00:14,590 --> 00:00:15,820
In this video I like to.

9
00:00:16,210 --> 00:00:17,360
A little bit more about how the

10
00:00:17,780 --> 00:00:20,310
individual components of the pipeline works.

11
00:00:21,270 --> 00:00:24,070
In particular most of this video will center around the discussion.

12
00:00:24,680 --> 00:00:25,950
of whats called a sliding windows.

13
00:00:26,750 --> 00:00:31,570
The first stage

14
00:00:32,000 --> 00:00:33,390
of the filter was the

15
00:00:33,730 --> 00:00:35,090
Text detection where we look

16
00:00:35,330 --> 00:00:36,640
at an image like this and try

17
00:00:37,020 --> 00:00:39,320
to find the regions of text that appear in this image.

18
00:00:39,850 --> 00:00:42,490
Text detection is an unusual problem in computer vision.

19
00:00:43,220 --> 00:00:44,820
Because depending on the length

20
00:00:45,140 --> 00:00:46,150
of the text you're trying to

21
00:00:46,290 --> 00:00:47,870
find, these rectangles that you're

22
00:00:47,970 --> 00:00:49,600
trying to find can have different aspect.

23
00:00:51,100 --> 00:00:52,060
So in order to talk

24
00:00:52,220 --> 00:00:53,550
about detecting things in images

25
00:00:54,300 --> 00:00:55,860
let's start with a simpler example

26
00:00:56,550 --> 00:01:00,080
of pedestrian detection and we'll then later go back to.

27
00:01:00,460 --> 00:01:02,300
Ideas that were developed

28
00:01:02,570 --> 00:01:04,840
in pedestrian detection and apply them to text detection.

29
00:01:06,280 --> 00:01:08,010
So in pedestrian detection you want

30
00:01:08,360 --> 00:01:09,440
to take an image that looks

31
00:01:09,600 --> 00:01:11,010
like this and the whole

32
00:01:11,160 --> 00:01:12,920
idea is the individual pedestrians that appear in the image.

33
00:01:13,260 --> 00:01:14,440
So there's one pedestrian that we

34
00:01:14,520 --> 00:01:15,550
found, there's a second

35
00:01:15,780 --> 00:01:17,920
one, a third one a fourth one, a fifth one.

36
00:01:18,290 --> 00:01:19,390
And a one.

37
00:01:19,560 --> 00:01:20,990
This problem is maybe slightly

38
00:01:21,320 --> 00:01:22,770
simpler than text detection just

39
00:01:23,100 --> 00:01:24,200
for the reason that the aspect

40
00:01:24,560 --> 00:01:27,490
ratio of most pedestrians are pretty similar.

41
00:01:28,170 --> 00:01:29,280
Just using a fixed aspect

42
00:01:29,630 --> 00:01:31,960
ratio for these rectangles that we're trying to find.

43
00:01:32,420 --> 00:01:33,610
So by aspect ratio I mean

44
00:01:33,920 --> 00:01:36,420
the ratio between the height and the width of these rectangles.

45
00:01:37,820 --> 00:01:38,190
They're all the same.

46
00:01:38,650 --> 00:01:40,120
for different pedestrians but for

47
00:01:40,490 --> 00:01:42,650
text detection the height

48
00:01:43,030 --> 00:01:44,560
and width ratio is different

49
00:01:44,960 --> 00:01:45,830
for different lines of text

50
00:01:46,460 --> 00:01:47,940
Although for pedestrian detection, the

51
00:01:48,020 --> 00:01:49,250
pedestrians can be different distances

52
00:01:49,810 --> 00:01:51,250
away from the camera and

53
00:01:51,390 --> 00:01:52,730
so the height of these rectangles

54
00:01:53,380 --> 00:01:55,600
can be different depending on how far away they are.

55
00:01:55,990 --> 00:01:57,090
but the aspect ratio is the same.

56
00:01:57,720 --> 00:01:58,880
In order to build a pedestrian

57
00:01:59,440 --> 00:02:02,460
detection system here's how you can go about it.

58
00:02:02,520 --> 00:02:03,650
Let's say that we decide to

59
00:02:03,970 --> 00:02:06,100
standardize on this aspect

60
00:02:06,690 --> 00:02:08,010
ratio of 82 by 36

61
00:02:08,180 --> 00:02:10,040
and we could

62
00:02:10,330 --> 00:02:11,510
have chosen some rounded number

63
00:02:12,020 --> 00:02:14,000
like 80 by 40 or something, but 82 by 36 seems alright.

64
00:02:16,110 --> 00:02:17,280
What we would do is then go

65
00:02:17,650 --> 00:02:20,420
out and collect large training sets of positive and negative examples.

66
00:02:21,240 --> 00:02:22,790
Here are examples of 82

67
00:02:22,900 --> 00:02:24,230
X 36 image patches that do

68
00:02:24,360 --> 00:02:26,230
contain pedestrians and here are

69
00:02:26,550 --> 00:02:28,360
examples of images that do not.

70
00:02:29,470 --> 00:02:30,710
On this slide I show 12

71
00:02:31,050 --> 00:02:33,170
positive examples of y1

72
00:02:33,730 --> 00:02:34,990
and 12 examples of y0.

73
00:02:36,410 --> 00:02:37,790
In a more typical pedestrian detection

74
00:02:38,180 --> 00:02:39,200
application, we may have

75
00:02:39,500 --> 00:02:40,880
anywhere from a 1,000 training

76
00:02:41,230 --> 00:02:42,210
examples up to maybe

77
00:02:42,300 --> 00:02:44,410
10,000 training examples, or

78
00:02:44,460 --> 00:02:45,360
even more if you can

79
00:02:45,510 --> 00:02:47,180
get even larger training sets.

80
00:02:47,460 --> 00:02:48,590
And what you can do, is then train

81
00:02:48,910 --> 00:02:50,160
in your network or some

82
00:02:50,510 --> 00:02:52,420
other learning algorithm to

83
00:02:52,610 --> 00:02:54,570
take this input, an MS

84
00:02:54,970 --> 00:02:56,710
patch of dimension 82 by

85
00:02:56,850 --> 00:02:59,180
36, and to classify  'y'

86
00:02:59,710 --> 00:03:01,070
and to classify that image patch

87
00:03:01,510 --> 00:03:03,850
as either containing a pedestrian or not.

88
00:03:05,250 --> 00:03:06,250
So this gives you a way

89
00:03:06,470 --> 00:03:08,050
of applying supervised learning in

90
00:03:08,210 --> 00:03:09,290
order to take an image

91
00:03:09,530 --> 00:03:12,420
patch can determine whether or not a pedestrian appears in that image capture.

92
00:03:14,310 --> 00:03:15,190
Now, lets say we get

93
00:03:15,400 --> 00:03:16,520
a new image, a test set

94
00:03:16,850 --> 00:03:17,920
image like this and we

95
00:03:18,030 --> 00:03:20,240
want to try to find a pedestrian's picture image.

96
00:03:21,520 --> 00:03:22,340
What we would do is start

97
00:03:22,670 --> 00:03:25,140
by taking a rectangular patch of this image.

98
00:03:25,580 --> 00:03:26,800
Like that shown up here, so

99
00:03:26,900 --> 00:03:27,930
that's maybe a 82 X

100
00:03:28,010 --> 00:03:29,440
36 patch of this image,

101
00:03:30,270 --> 00:03:31,530
and run that image patch through

102
00:03:31,830 --> 00:03:33,660
our classifier to determine whether

103
00:03:33,840 --> 00:03:34,900
or not there is a

104
00:03:34,980 --> 00:03:36,310
pedestrian in that image patch,

105
00:03:36,620 --> 00:03:38,100
and hopefully our classifier will

106
00:03:38,260 --> 00:03:40,600
return y equals 0 for that patch, since there is no pedestrian.

107
00:03:42,020 --> 00:03:42,900
Next, we then take that green

108
00:03:43,140 --> 00:03:44,380
rectangle and we slide it

109
00:03:44,490 --> 00:03:45,680
over a bit and then

110
00:03:45,940 --> 00:03:47,180
run that new image patch

111
00:03:47,560 --> 00:03:49,700
through our classifier to decide if there's a pedestrian there.

112
00:03:50,760 --> 00:03:51,740
And having done that, we then

113
00:03:51,920 --> 00:03:53,070
slide the window further to the

114
00:03:53,160 --> 00:03:54,160
right and run that patch

115
00:03:54,420 --> 00:03:56,690
through the classifier again.

116
00:03:56,970 --> 00:03:57,850
The amount by which you shift

117
00:03:58,280 --> 00:03:59,770
the rectangle over each time

118
00:04:00,260 --> 00:04:01,720
is a parameter, that's sometimes

119
00:04:02,190 --> 00:04:04,000
called the step size of the

120
00:04:04,070 --> 00:04:06,020
parameter, sometimes also called

121
00:04:06,380 --> 00:04:08,970
the slide parameter, and if

122
00:04:09,120 --> 00:04:11,050
you step this one pixel at a time.

123
00:04:11,210 --> 00:04:12,020
So you can use the step size

124
00:04:12,360 --> 00:04:14,020
or stride of 1, that usually

125
00:04:14,340 --> 00:04:15,560
performs best, that is

126
00:04:15,700 --> 00:04:16,960
more cost effective, and

127
00:04:17,430 --> 00:04:18,940
so using a step size of

128
00:04:19,090 --> 00:04:20,010
maybe 4 pixels at a

129
00:04:20,210 --> 00:04:20,970
time, or eight pixels at a

130
00:04:21,250 --> 00:04:22,350
time or some large number of

131
00:04:22,550 --> 00:04:23,600
pixels might be more common,

132
00:04:24,010 --> 00:04:25,320
since you're then moving the

133
00:04:25,430 --> 00:04:26,570
rectangle a little bit

134
00:04:26,700 --> 00:04:28,570
more each time.

135
00:04:28,870 --> 00:04:30,090
So, using this process, you continue

136
00:04:30,870 --> 00:04:32,310
stepping the rectangle over to

137
00:04:32,340 --> 00:04:33,160
the right a bit at a

138
00:04:33,370 --> 00:04:34,450
time and running each of

139
00:04:34,520 --> 00:04:35,780
these patches through a classifier,

140
00:04:36,620 --> 00:04:38,220
until eventually, as you

141
00:04:38,900 --> 00:04:42,080
slide this window over the

142
00:04:42,150 --> 00:04:43,340
different locations in the image,

143
00:04:43,550 --> 00:04:44,680
first starting with the first

144
00:04:44,850 --> 00:04:46,080
row and then we

145
00:04:46,160 --> 00:04:47,580
go further rows in

146
00:04:47,710 --> 00:04:49,100
the image, you would

147
00:04:49,290 --> 00:04:50,490
then run all of

148
00:04:50,550 --> 00:04:52,070
these different image patches at

149
00:04:52,240 --> 00:04:53,330
some step size or some

150
00:04:53,430 --> 00:04:54,990
stride through your classifier.

151
00:04:56,990 --> 00:04:57,870
Now, that was a pretty

152
00:04:57,970 --> 00:04:59,870
small rectangle, that would only

153
00:05:00,310 --> 00:05:02,310
detect pedestrians of one specific size.

154
00:05:02,780 --> 00:05:04,210
What we do next is

155
00:05:04,470 --> 00:05:05,990
start to look at larger image patches.

156
00:05:06,730 --> 00:05:08,270
So now let's take larger images

157
00:05:08,610 --> 00:05:09,700
patches, like those shown here

158
00:05:10,310 --> 00:05:11,960
and run those through the crossfire as well.

159
00:05:13,540 --> 00:05:14,320
And by the way when I say

160
00:05:14,600 --> 00:05:15,830
take a larger image patch, what

161
00:05:16,080 --> 00:05:17,780
I really mean is when you

162
00:05:17,860 --> 00:05:18,850
take an image patch like this,

163
00:05:19,490 --> 00:05:20,720
what you're really doing is taking

164
00:05:20,880 --> 00:05:22,110
that image patch, and resizing

165
00:05:22,800 --> 00:05:24,750
it down to 82 X 36, say.

166
00:05:25,000 --> 00:05:26,260
So you take this larger

167
00:05:26,550 --> 00:05:28,180
patch and re-size it to

168
00:05:28,300 --> 00:05:29,800
be smaller image and then

169
00:05:29,970 --> 00:05:31,260
it would be the smaller size image

170
00:05:31,600 --> 00:05:32,620
that is what you

171
00:05:32,990 --> 00:05:35,340
would pass through your classifier to try and decide if there is a pedestrian in that patch.

172
00:05:37,230 --> 00:05:38,310
And finally you can do

173
00:05:38,470 --> 00:05:39,530
this at an even larger

174
00:05:39,930 --> 00:05:41,870
scales and run

175
00:05:42,080 --> 00:05:43,830
that side of Windows to

176
00:05:43,980 --> 00:05:45,920
the end And after

177
00:05:45,980 --> 00:05:47,480
this whole process hopefully your algorithm

178
00:05:48,040 --> 00:05:49,670
will detect whether theres pedestrian

179
00:05:50,140 --> 00:05:52,070
appears in the image, so

180
00:05:52,470 --> 00:05:53,850
thats how you train a

181
00:05:54,290 --> 00:05:55,630
the classifier, and then

182
00:05:55,890 --> 00:05:57,360
use a sliding windows classifier,

183
00:05:57,920 --> 00:05:59,820
or use a sliding windows detector in

184
00:05:59,970 --> 00:06:01,740
order to find pedestrians in the image.

185
00:06:03,070 --> 00:06:04,050
Let's have a turn to the

186
00:06:04,150 --> 00:06:05,910
text detection example and talk

187
00:06:06,100 --> 00:06:07,490
about that stage in our

188
00:06:07,790 --> 00:06:09,330
photo OCR pipeline, where our

189
00:06:09,570 --> 00:06:11,340
goal is to find the text regions in unit.

190
00:06:13,250 --> 00:06:15,010
similar to pedestrian detection you

191
00:06:15,250 --> 00:06:16,730
can come up with a label

192
00:06:17,030 --> 00:06:18,410
training set with positive examples

193
00:06:19,060 --> 00:06:20,930
and negative examples with examples

194
00:06:21,530 --> 00:06:23,810
corresponding to regions where text appears.

195
00:06:24,300 --> 00:06:27,290
So instead of trying to detect pedestrians, we're now trying to detect texts.

196
00:06:28,130 --> 00:06:29,670
And so positive examples are going

197
00:06:29,770 --> 00:06:31,640
to be patches of images where there is text.

198
00:06:31,970 --> 00:06:33,330
And negative examples is going

199
00:06:33,380 --> 00:06:36,000
to be patches of images where there isn't text.

200
00:06:36,330 --> 00:06:37,530
Having trained this we can

201
00:06:38,030 --> 00:06:39,450
now apply it to a

202
00:06:39,870 --> 00:06:41,190
new image, into a test

203
00:06:42,460 --> 00:06:42,910
set image.

204
00:06:43,310 --> 00:06:44,900
So here's the image that we've been using as example.

205
00:06:46,040 --> 00:06:47,300
Now, last time we run,

206
00:06:47,440 --> 00:06:48,400
for this example we are going

207
00:06:48,560 --> 00:06:50,300
to run a sliding windows at

208
00:06:50,640 --> 00:06:52,030
just one fixed scale just

209
00:06:52,370 --> 00:06:54,360
for purpose of illustration, meaning that

210
00:06:54,450 --> 00:06:56,000
I'm going to use just one rectangle size.

211
00:06:56,790 --> 00:06:58,110
But lets say I run my little

212
00:06:58,350 --> 00:07:00,070
sliding windows classifier on lots

213
00:07:00,170 --> 00:07:01,570
of little image patches like

214
00:07:01,630 --> 00:07:04,340
this if I

215
00:07:04,430 --> 00:07:05,430
do that, what Ill end

216
00:07:05,530 --> 00:07:06,670
up with is a result

217
00:07:07,040 --> 00:07:08,530
like this where the white

218
00:07:08,900 --> 00:07:10,700
region show where my

219
00:07:10,940 --> 00:07:12,190
text detection system has found

220
00:07:12,210 --> 00:07:15,960
text and so the axis' of these two figures are the same.

221
00:07:16,390 --> 00:07:17,700
So there is a region

222
00:07:18,110 --> 00:07:19,200
up here, of course also

223
00:07:19,230 --> 00:07:20,710
a region up here, so the

224
00:07:20,840 --> 00:07:22,040
fact that this black up here

225
00:07:22,850 --> 00:07:24,390
represents that the classifier

226
00:07:24,840 --> 00:07:25,940
does not think it's found any

227
00:07:26,170 --> 00:07:28,100
texts up there, whereas the

228
00:07:28,170 --> 00:07:29,630
fact that there's a lot

229
00:07:29,810 --> 00:07:31,300
of white stuff here, that reflects that

230
00:07:31,540 --> 00:07:33,260
classifier thinks that it's found a bunch of texts.

231
00:07:33,520 --> 00:07:34,310
over there on the image.

232
00:07:35,040 --> 00:07:35,700
What i have done on this

233
00:07:35,780 --> 00:07:36,870
image on the lower left is

234
00:07:37,070 --> 00:07:38,820
actually use white to

235
00:07:38,970 --> 00:07:41,050
show where the classifier thinks it has found text.

236
00:07:41,810 --> 00:07:43,280
And different shades of grey

237
00:07:43,880 --> 00:07:45,560
correspond to the probability that

238
00:07:45,670 --> 00:07:46,750
was output by the classifier,

239
00:07:47,110 --> 00:07:48,000
so like the shades of grey

240
00:07:48,520 --> 00:07:49,860
corresponds to where it

241
00:07:49,930 --> 00:07:50,750
thinks it might have found text

242
00:07:51,210 --> 00:07:53,900
but has lower confidence the bright

243
00:07:54,260 --> 00:07:55,980
white response to whether the classifier,

244
00:07:57,440 --> 00:07:58,400
up with a very high

245
00:07:58,660 --> 00:08:00,470
probability, estimated probability of

246
00:08:00,630 --> 00:08:03,110
there being pedestrians in that location.

247
00:08:04,110 --> 00:08:05,270
We aren't quite done yet because

248
00:08:05,690 --> 00:08:06,580
what we actually want to do

249
00:08:06,830 --> 00:08:08,620
is draw rectangles around all

250
00:08:08,850 --> 00:08:09,780
the region where this text

251
00:08:10,490 --> 00:08:12,540
in the image, so were

252
00:08:12,650 --> 00:08:13,540
going to take one more step

253
00:08:13,840 --> 00:08:14,990
which is we take the output

254
00:08:15,230 --> 00:08:16,880
of the classifier and apply

255
00:08:17,290 --> 00:08:19,280
to it what is called an expansion operator.

256
00:08:20,750 --> 00:08:22,250
So what that does is, it

257
00:08:22,430 --> 00:08:24,270
take the image here,

258
00:08:25,450 --> 00:08:26,700
and it takes each of

259
00:08:26,800 --> 00:08:28,200
the white blobs, it takes each

260
00:08:28,270 --> 00:08:30,590
of the white regions and it expands that white region.

261
00:08:31,460 --> 00:08:32,460
Mathematically, the way you

262
00:08:32,610 --> 00:08:34,110
implement that is, if you

263
00:08:34,270 --> 00:08:35,280
look at the image on the right, what

264
00:08:35,690 --> 00:08:36,780
we're doing to create the

265
00:08:36,930 --> 00:08:38,110
image on the right is, for every

266
00:08:38,370 --> 00:08:39,510
pixel we are going

267
00:08:39,610 --> 00:08:40,790
to ask, is it withing

268
00:08:41,370 --> 00:08:42,960
some distance of a

269
00:08:43,100 --> 00:08:44,650
white pixel in the left image.

270
00:08:45,430 --> 00:08:46,800
And so, if a specific pixel

271
00:08:47,220 --> 00:08:48,420
is within, say, five pixels

272
00:08:48,950 --> 00:08:50,280
or ten pixels of a white

273
00:08:50,610 --> 00:08:52,310
pixel in the leftmost image, then

274
00:08:52,540 --> 00:08:55,020
we'll also color that pixel white in the rightmost image.

275
00:08:56,190 --> 00:08:57,010
And so, the effect of this

276
00:08:57,300 --> 00:08:58,350
is, we'll take each of the

277
00:08:58,730 --> 00:08:59,630
white blobs in the leftmost

278
00:09:00,030 --> 00:09:01,370
image and expand them a

279
00:09:01,500 --> 00:09:02,200
bit, grow them a little

280
00:09:02,670 --> 00:09:04,110
bit, by seeing whether the

281
00:09:04,170 --> 00:09:05,420
nearby pixels, the white pixels,

282
00:09:05,900 --> 00:09:07,980
and then coloring those nearby pixels in white as well.

283
00:09:08,430 --> 00:09:09,900
Finally, we are just about done.

284
00:09:10,180 --> 00:09:11,210
We can now look at this

285
00:09:11,480 --> 00:09:12,900
right most image and just

286
00:09:13,210 --> 00:09:14,650
look at the connecting components

287
00:09:15,320 --> 00:09:16,700
and look at the as white

288
00:09:16,990 --> 00:09:19,350
regions and draw bounding boxes around them.

289
00:09:20,260 --> 00:09:20,990
And in particular, if we look at

290
00:09:21,390 --> 00:09:22,850
all the white regions, like

291
00:09:23,080 --> 00:09:24,750
this one, this one, this

292
00:09:24,990 --> 00:09:26,670
one, and so on, and

293
00:09:27,030 --> 00:09:27,810
if we use a simple heuristic

294
00:09:28,390 --> 00:09:30,240
to rule out rectangles whose aspect

295
00:09:30,660 --> 00:09:32,760
ratios look funny because we

296
00:09:32,870 --> 00:09:34,460
know that boxes around text

297
00:09:34,730 --> 00:09:36,130
should be much wider than they are tall.

298
00:09:37,110 --> 00:09:38,310
And so if we ignore the

299
00:09:38,410 --> 00:09:39,990
thin, tall blobs like this one

300
00:09:40,230 --> 00:09:42,120
and this one, and

301
00:09:42,190 --> 00:09:43,390
we discard these ones because

302
00:09:43,880 --> 00:09:45,490
they are too tall and thin, and

303
00:09:45,660 --> 00:09:46,780
we then draw a the rectangles

304
00:09:47,470 --> 00:09:48,440
around the ones whose aspect

305
00:09:48,840 --> 00:09:50,420
ratio thats a height

306
00:09:50,610 --> 00:09:51,800
to what ratio looks like for

307
00:09:51,950 --> 00:09:53,310
text regions, then we

308
00:09:53,380 --> 00:09:55,070
can draw rectangles, the bounding

309
00:09:55,450 --> 00:09:56,660
boxes around this text

310
00:09:56,970 --> 00:09:58,500
region, this text region, and

311
00:09:58,610 --> 00:10:00,550
that text region, corresponding to

312
00:10:01,060 --> 00:10:02,180
the Lula B's antique mall logo,

313
00:10:02,650 --> 00:10:04,690
the Lula B's, and this little open sign.

314
00:10:05,840 --> 00:10:06,000
Of over there.

315
00:10:07,100 --> 00:10:09,550
This example by the actually misses one piece of text.

316
00:10:09,860 --> 00:10:12,550
This is very hard to read, but there is actually one piece of text there.

317
00:10:13,080 --> 00:10:14,710
That says [xx] are corresponding

318
00:10:14,950 --> 00:10:16,180
to this but the aspect ratio

319
00:10:16,530 --> 00:10:17,960
looks wrong so we discarded that one.

320
00:10:19,100 --> 00:10:20,240
So you know it's ok

321
00:10:20,530 --> 00:10:21,460
on this image, but in

322
00:10:21,660 --> 00:10:22,760
this particular example the classifier

323
00:10:23,290 --> 00:10:24,400
actually missed one piece of text.

324
00:10:24,760 --> 00:10:25,780
It's very hard to read because

325
00:10:25,960 --> 00:10:26,900
there's a piece of text

326
00:10:27,240 --> 00:10:28,700
written against a transparent window.

327
00:10:29,750 --> 00:10:31,200
So that's text detection

328
00:10:32,430 --> 00:10:33,120
using sliding windows.

329
00:10:33,800 --> 00:10:35,300
And having found these rectangles

330
00:10:36,100 --> 00:10:37,010
with the text in it, we

331
00:10:37,110 --> 00:10:38,240
can now just cut out

332
00:10:38,450 --> 00:10:39,890
these image regions and then

333
00:10:40,070 --> 00:10:42,100
use later stages of pipeline to try to meet the texts.

334
00:10:45,390 --> 00:10:46,820
Now, you recall that the

335
00:10:46,880 --> 00:10:48,360
second stage of pipeline was

336
00:10:48,570 --> 00:10:50,620
character segmentation, so given an

337
00:10:50,890 --> 00:10:52,530
image like that shown on top,

338
00:10:52,790 --> 00:10:55,660
how do we segment out the individual characters in this image?

339
00:10:56,580 --> 00:10:57,460
So what we can do is

340
00:10:57,910 --> 00:10:59,590
again use a supervised learning

341
00:11:00,010 --> 00:11:01,020
algorithm with some set of

342
00:11:01,100 --> 00:11:01,990
positive and some set of

343
00:11:02,100 --> 00:11:03,810
negative examples, what were

344
00:11:03,880 --> 00:11:04,840
going to do is look in

345
00:11:04,900 --> 00:11:06,160
the image patch and try

346
00:11:06,390 --> 00:11:08,110
to decide if there

347
00:11:08,370 --> 00:11:09,690
is split between two characters

348
00:11:10,700 --> 00:11:12,070
right in the middle of that image match.

349
00:11:13,030 --> 00:11:14,100
So for initial positive examples.

350
00:11:14,960 --> 00:11:17,040
This first cross example, this image

351
00:11:17,290 --> 00:11:18,590
patch looks like the

352
00:11:18,650 --> 00:11:20,050
middle of it is indeed

353
00:11:21,320 --> 00:11:22,890
the middle has splits between two

354
00:11:23,110 --> 00:11:24,120
characters and the second example

355
00:11:24,680 --> 00:11:25,770
again this looks like a

356
00:11:25,950 --> 00:11:27,370
positive example, because if I split

357
00:11:27,840 --> 00:11:29,020
two characters by putting a

358
00:11:29,160 --> 00:11:31,190
line right down the middle, that's the right thing to do.

359
00:11:31,350 --> 00:11:33,310
So, these are positive examples, where

360
00:11:33,510 --> 00:11:35,370
the middle of the image represents

361
00:11:35,970 --> 00:11:36,930
a gap or a split

362
00:11:37,960 --> 00:11:40,320
between two distinct characters, whereas

363
00:11:40,560 --> 00:11:41,870
the negative examples, well, you

364
00:11:42,010 --> 00:11:43,160
know, you don't want to split

365
00:11:43,690 --> 00:11:44,810
two characters right in the

366
00:11:44,900 --> 00:11:46,610
middle, and so

367
00:11:46,820 --> 00:11:48,160
these are negative examples because

368
00:11:48,460 --> 00:11:50,660
they don't represent the midpoint between two characters.

369
00:11:51,760 --> 00:11:52,490
So what we will do

370
00:11:52,650 --> 00:11:53,940
is, we will train a classifier,

371
00:11:54,500 --> 00:11:55,910
maybe using new network, maybe

372
00:11:56,180 --> 00:11:58,000
using a different learning algorithm, to

373
00:11:58,120 --> 00:12:01,420
try to classify between the positive and negative examples.

374
00:12:02,770 --> 00:12:03,980
Having trained such a classifier,

375
00:12:04,320 --> 00:12:06,030
we can then run this on

376
00:12:06,690 --> 00:12:07,830
this sort of text that our

377
00:12:07,940 --> 00:12:09,410
text detection system has pulled out.

378
00:12:09,590 --> 00:12:10,970
As we start by looking at

379
00:12:11,130 --> 00:12:12,080
that rectangle, and we ask,

380
00:12:12,230 --> 00:12:13,280
"Gee, does it look

381
00:12:13,510 --> 00:12:15,000
like the middle of

382
00:12:15,100 --> 00:12:16,600
that green rectangle, does it

383
00:12:16,680 --> 00:12:18,470
look like the midpoint between two characters?".

384
00:12:18,980 --> 00:12:20,220
And hopefully, the classifier will

385
00:12:20,320 --> 00:12:21,760
say no, then we slide

386
00:12:22,170 --> 00:12:23,280
the window over and this

387
00:12:23,410 --> 00:12:24,850
is a one dimensional sliding

388
00:12:25,200 --> 00:12:26,410
window classifier, because were

389
00:12:26,500 --> 00:12:27,820
going to slide the window only

390
00:12:28,470 --> 00:12:29,560
in one straight line from

391
00:12:29,780 --> 00:12:32,070
left to right, theres no different rows here.

392
00:12:32,270 --> 00:12:34,420
There's only one row here.

393
00:12:34,520 --> 00:12:36,160
But now, with the classifier in

394
00:12:36,240 --> 00:12:37,250
this position, we ask, well,

395
00:12:37,490 --> 00:12:38,700
should we split those two characters

396
00:12:39,570 --> 00:12:41,580
or should we put a split right down the middle of this rectangle.

397
00:12:41,950 --> 00:12:43,040
And hopefully, the classifier will

398
00:12:43,190 --> 00:12:44,720
output y equals one, in

399
00:12:44,780 --> 00:12:46,460
which case we will decide to

400
00:12:46,630 --> 00:12:49,690
draw a line down there, to try to split two characters.

401
00:12:50,710 --> 00:12:51,620
Then we slide the window over

402
00:12:51,870 --> 00:12:53,440
again, optic process, don't

403
00:12:53,650 --> 00:12:55,020
close the gap, slide over again,

404
00:12:55,300 --> 00:12:56,580
optic says yes, do split

405
00:12:57,230 --> 00:12:58,830
there and so

406
00:12:59,200 --> 00:13:00,410
on, and we slowly slide the

407
00:13:00,560 --> 00:13:01,770
classifier over to the

408
00:13:01,920 --> 00:13:03,310
right and hopefully it will

409
00:13:03,380 --> 00:13:05,160
classify this as another positive example and

410
00:13:05,770 --> 00:13:07,470
so on.

411
00:13:08,010 --> 00:13:09,180
And we will slide this window

412
00:13:09,820 --> 00:13:10,990
over to the right, running the

413
00:13:11,160 --> 00:13:12,670
classifier at every step, and

414
00:13:12,800 --> 00:13:13,800
hopefully it will tell us,

415
00:13:14,210 --> 00:13:15,070
you know, what are the right locations

416
00:13:16,190 --> 00:13:17,820
to split these characters up into,

417
00:13:18,290 --> 00:13:20,410
just split this image up into individual characters.

418
00:13:21,090 --> 00:13:22,450
And so thats 1D sliding

419
00:13:22,810 --> 00:13:24,190
windows for character segmentation.

420
00:13:25,520 --> 00:13:28,430
So, here's the overall photo OCR pipe line again.

421
00:13:29,120 --> 00:13:30,280
In this video we've talked about

422
00:13:30,780 --> 00:13:32,170
the text detection step, where

423
00:13:32,360 --> 00:13:34,570
we use sliding windows to detect text.

424
00:13:35,200 --> 00:13:36,390
And we also use a one-dimensional

425
00:13:37,070 --> 00:13:38,420
sliding windows to do character

426
00:13:38,790 --> 00:13:40,160
segmentation to segment out,

427
00:13:40,730 --> 00:13:42,860
you know, this text image in division of characters.

428
00:13:43,900 --> 00:13:44,770
The final step through the

429
00:13:44,810 --> 00:13:46,040
pipeline is the character qualification

430
00:13:46,720 --> 00:13:48,150
step and that step you might

431
00:13:48,370 --> 00:13:49,750
already be much more familiar

432
00:13:50,020 --> 00:13:51,490
with the early videos

433
00:13:52,080 --> 00:13:54,470
on supervised learning

434
00:13:55,170 --> 00:13:56,440
where you can apply a standard

435
00:13:56,940 --> 00:13:58,150
supervised learning within maybe

436
00:13:58,360 --> 00:13:59,250
on your network or maybe something

437
00:13:59,570 --> 00:14:00,650
else in order to

438
00:14:00,860 --> 00:14:02,100
take it's input, an image

439
00:14:02,980 --> 00:14:05,030
like that and classify which alphabet

440
00:14:05,480 --> 00:14:07,120
or which 26 characters A

441
00:14:07,230 --> 00:14:08,320
to Z, or maybe we should

442
00:14:08,570 --> 00:14:09,670
have 36 characters if you

443
00:14:09,780 --> 00:14:11,140
have the numerical digits as

444
00:14:11,270 --> 00:14:12,650
well, the multi class

445
00:14:13,080 --> 00:14:14,410
classification problem where you

446
00:14:14,510 --> 00:14:15,690
take it's input and image

447
00:14:16,050 --> 00:14:17,390
contained a character and decide

448
00:14:18,140 --> 00:14:20,450
what is the character that appears in that image?

449
00:14:21,080 --> 00:14:22,460
So that was the photo OCR

450
00:14:23,730 --> 00:14:24,750
pipeline and how you can

451
00:14:24,910 --> 00:14:26,140
use ideas like sliding windows

452
00:14:26,520 --> 00:14:27,960
classifiers in order to

453
00:14:28,100 --> 00:14:29,790
put these different components to

454
00:14:30,060 --> 00:14:31,570
develop a photo OCR system.

455
00:14:32,430 --> 00:14:33,570
In the next few videos we

456
00:14:33,680 --> 00:14:34,930
keep on using the problem of

457
00:14:35,150 --> 00:14:36,550
photo OCR to explore somewhat

458
00:14:36,960 --> 00:14:39,070
interesting issues surrounding building an application like this.