1
00:00:00,090 --> 00:00:00,950
In this and the next few

2
00:00:01,070 --> 00:00:02,010
videos, I want to tell

3
00:00:02,160 --> 00:00:03,410
you about a machine learning application

4
00:00:04,020 --> 00:00:04,980
example, or a machine

5
00:00:05,160 --> 00:00:07,670
learning application history centered

6
00:00:08,030 --> 00:00:09,630
around an application called Photo OCR  .

7
00:00:10,520 --> 00:00:11,730
There are three reasons

8
00:00:12,170 --> 00:00:13,220
why I want to do this,

9
00:00:13,480 --> 00:00:14,350
first I wanted to show you an

10
00:00:14,770 --> 00:00:15,700
example of how a complex

11
00:00:16,290 --> 00:00:18,000
machine learning system can be put together.

12
00:00:19,350 --> 00:00:20,960
Second, once told the concepts of

13
00:00:21,170 --> 00:00:22,280
a machine learning a type line

14
00:00:22,970 --> 00:00:24,740
and how to allocate resources when

15
00:00:24,860 --> 00:00:26,550
you're trying to decide what to do next.

16
00:00:26,780 --> 00:00:27,700
And this can either be in

17
00:00:27,730 --> 00:00:28,950
the context of you working

18
00:00:29,380 --> 00:00:30,220
by yourself on the big

19
00:00:30,500 --> 00:00:31,690
application Or it can

20
00:00:31,770 --> 00:00:32,980
be the context of a team

21
00:00:33,100 --> 00:00:34,190
of developers trying to build

22
00:00:34,440 --> 00:00:35,930
a complex application together.

23
00:00:37,030 --> 00:00:38,670
And then finally, the Photo

24
00:00:39,130 --> 00:00:40,690
OCR problem also gives

25
00:00:40,880 --> 00:00:41,810
me an excuse to tell you

26
00:00:41,880 --> 00:00:42,850
about just a couple more interesting

27
00:00:43,260 --> 00:00:44,370
ideas for machine learning.

28
00:00:45,120 --> 00:00:47,300
One is some ideas of

29
00:00:47,400 --> 00:00:48,250
how to apply machine learning to

30
00:00:48,600 --> 00:00:50,210
computer vision problems, and second

31
00:00:50,340 --> 00:00:51,890
is the idea of artificial data

32
00:00:52,220 --> 00:00:53,880
synthesis, which we'll see in a couple of videos.

33
00:00:54,820 --> 00:00:57,680
So, let's start by talking about what is the Photo OCR problem.

34
00:01:00,130 --> 00:01:01,710
Photo OCR stands for

35
00:01:02,050 --> 00:01:03,760
Photo Optical Character Recognition.

36
00:01:05,180 --> 00:01:06,460
With the growth of digital photography

37
00:01:07,300 --> 00:01:08,740
and more recently the growth of

38
00:01:09,080 --> 00:01:10,360
camera in our cell phones

39
00:01:11,140 --> 00:01:12,140
we now have tons of visual

40
00:01:12,500 --> 00:01:13,790
pictures that we take all over the place.

41
00:01:14,620 --> 00:01:15,700
And one of the things that

42
00:01:16,150 --> 00:01:17,850
has interested many developers is

43
00:01:18,080 --> 00:01:19,680
how to get our computers to

44
00:01:19,990 --> 00:01:22,300
understand the content of these pictures a little bit better.

45
00:01:23,140 --> 00:01:24,690
The photo OCR problem focuses

46
00:01:25,300 --> 00:01:26,790
on how to get computers to

47
00:01:26,980 --> 00:01:29,390
read the text to the purest in images that we take.

48
00:01:30,730 --> 00:01:31,990
Given an image like this it

49
00:01:32,070 --> 00:01:32,850
might be nice if a computer

50
00:01:33,530 --> 00:01:34,480
can read the text in this

51
00:01:34,670 --> 00:01:35,540
image so that if you're

52
00:01:35,650 --> 00:01:37,040
trying to look for this

53
00:01:37,220 --> 00:01:38,530
picture again you type in

54
00:01:38,850 --> 00:01:40,220
the words, lulu bees and

55
00:01:41,000 --> 00:01:42,910
and have it automatically pull

56
00:01:43,130 --> 00:01:44,190
up this picture, so that

57
00:01:44,360 --> 00:01:45,890
you're not spending lots of

58
00:01:45,980 --> 00:01:47,130
time digging through your photo

59
00:01:47,670 --> 00:01:49,230
collection Maybe hundreds of

60
00:01:49,490 --> 00:01:50,730
thousands of pictures in.

61
00:01:50,870 --> 00:01:53,100
The Photo OCR problem

62
00:01:53,450 --> 00:01:56,080
does exactly this, and it does so in several steps.

63
00:01:56,870 --> 00:01:57,790
First, given the picture it

64
00:01:58,060 --> 00:01:58,800
has to look through the image

65
00:01:59,480 --> 00:02:01,680
and detect where there is text in the picture.

66
00:02:03,020 --> 00:02:03,960
And after it has done

67
00:02:04,160 --> 00:02:05,340
that or if it successfully does

68
00:02:05,570 --> 00:02:06,750
that it then has to

69
00:02:06,980 --> 00:02:09,020
look at these text regions and

70
00:02:09,170 --> 00:02:10,530
actually read the text in

71
00:02:10,670 --> 00:02:12,150
those regions, and hopefully if

72
00:02:12,250 --> 00:02:13,670
it reads it correctly, it'll come

73
00:02:15,040 --> 00:02:16,440
up with these transcriptions of

74
00:02:16,800 --> 00:02:18,710
what is the text that appears in the image.

75
00:02:19,480 --> 00:02:21,160
Whereas OCR, or optical

76
00:02:21,440 --> 00:02:22,850
character recognition of scanned

77
00:02:23,600 --> 00:02:25,760
documents is relatively easier

78
00:02:26,180 --> 00:02:27,840
problem, doing OCR from

79
00:02:27,980 --> 00:02:29,480
photographs today is still a

80
00:02:29,750 --> 00:02:30,970
very difficult machine learning problem,

81
00:02:31,640 --> 00:02:32,730
and you can do this.

82
00:02:33,000 --> 00:02:34,320
Not only can this help

83
00:02:34,750 --> 00:02:36,390
our computers to understand the

84
00:02:36,450 --> 00:02:38,380
content of our though

85
00:02:38,500 --> 00:02:40,030
images better, there are

86
00:02:40,240 --> 00:02:42,240
also applications like helping blind

87
00:02:42,530 --> 00:02:43,900
people, for example, if you

88
00:02:44,000 --> 00:02:45,010
could provide to a blind person

89
00:02:45,780 --> 00:02:47,210
a camera that can look

90
00:02:47,460 --> 00:02:48,430
at what's in front of

91
00:02:48,530 --> 00:02:49,700
them, and just tell them the

92
00:02:49,910 --> 00:02:52,990
words that my be on

93
00:02:53,460 --> 00:02:55,830
the street sign in front of them.

94
00:02:56,540 --> 00:02:57,780
With car navigation systems.

95
00:02:58,310 --> 00:02:59,750
For example, imagine if your

96
00:02:59,920 --> 00:03:00,900
car could read the street

97
00:03:01,250 --> 00:03:03,480
signs and help you navigate to your destination.

98
00:03:04,610 --> 00:03:07,260
In order to perform photo OCR, here's what we can do.

99
00:03:07,970 --> 00:03:08,840
First we can go through the

100
00:03:09,080 --> 00:03:11,490
image and find the regions where there's text and image.

101
00:03:11,850 --> 00:03:13,380
So, shown here is

102
00:03:13,580 --> 00:03:15,430
one example of text and

103
00:03:15,730 --> 00:03:17,700
image that the photo OCR system may find.

104
00:03:19,980 --> 00:03:21,850
Second, given the rectangle around

105
00:03:22,210 --> 00:03:23,390
that text region, we can

106
00:03:23,700 --> 00:03:25,930
then do character segmentation, where

107
00:03:26,170 --> 00:03:28,210
we might take this text box

108
00:03:28,490 --> 00:03:30,310
that says "Antique Mall" and

109
00:03:30,530 --> 00:03:31,760
try to segment it out

110
00:03:32,090 --> 00:03:34,150
into the locations of the individual characters.

111
00:03:35,450 --> 00:03:37,280
And finally, having segmented out

112
00:03:37,450 --> 00:03:39,050
into individual characters, we can

113
00:03:39,320 --> 00:03:41,040
then run a crossfire, which

114
00:03:41,290 --> 00:03:42,950
looks at the images of the

115
00:03:43,090 --> 00:03:44,620
visual characters, and tries to

116
00:03:44,760 --> 00:03:45,990
figure out the first character's an

117
00:03:46,150 --> 00:03:47,070
A, the second character's an

118
00:03:47,230 --> 00:03:48,010
N, the third character is

119
00:03:48,480 --> 00:03:49,930
a T, and so on,

120
00:03:50,110 --> 00:03:51,130
so that up by doing all

121
00:03:51,190 --> 00:03:52,350
this how that hopefully you can then

122
00:03:52,530 --> 00:03:53,610
figure out that this phrase

123
00:03:54,180 --> 00:03:55,670
is Rulegee's antique mall

124
00:03:56,340 --> 00:03:57,760
and similarly for some of

125
00:03:57,930 --> 00:04:01,690
the other words that appear in that image.

126
00:04:01,980 --> 00:04:02,390
I should say that there are some photo OCR systems

127
00:04:02,910 --> 00:04:04,350
that do even more complex things,

128
00:04:04,680 --> 00:04:06,370
like a bit of spelling correction at the end.

129
00:04:06,640 --> 00:04:08,470
So if, for example, your

130
00:04:08,710 --> 00:04:10,730
character segmentation and character

131
00:04:11,110 --> 00:04:12,450
classification system tells you

132
00:04:12,690 --> 00:04:14,390
that it sees the

133
00:04:14,530 --> 00:04:16,050
word c 1 e a

134
00:04:16,260 --> 00:04:17,930
n i n g. Then,

135
00:04:18,350 --> 00:04:19,570
you know, a sort of spelling

136
00:04:19,760 --> 00:04:21,910
correction system might tell

137
00:04:22,240 --> 00:04:23,270
you that this is probably the

138
00:04:23,350 --> 00:04:24,880
word 'cleaning', and your

139
00:04:25,340 --> 00:04:27,160
character classification algorithm had

140
00:04:27,310 --> 00:04:29,650
just mistaken the l for a 1.

141
00:04:30,370 --> 00:04:31,320
But for the purpose of what

142
00:04:31,640 --> 00:04:32,510
we want to do in

143
00:04:32,620 --> 00:04:33,980
this video, let's ignore this last

144
00:04:34,620 --> 00:04:35,780
step and just focus on the

145
00:04:36,110 --> 00:04:37,490
system that does these three

146
00:04:37,700 --> 00:04:39,340
steps of text detection, character

147
00:04:39,660 --> 00:04:41,040
segmentation, and character classification.

148
00:04:42,410 --> 00:04:43,790
A system like this is

149
00:04:44,080 --> 00:04:46,010
what we call a machine learning pipeline.

150
00:04:47,550 --> 00:04:49,220
In particular, here's a picture

151
00:04:49,950 --> 00:04:52,220
showing the photo OCR pipeline.

152
00:04:53,140 --> 00:04:54,200
We have an image, which then

153
00:04:54,470 --> 00:04:57,590
fed to the text detection system

154
00:04:57,970 --> 00:04:58,960
text regions, we then segment

155
00:04:59,420 --> 00:05:01,350
out the characters--the individual characters in

156
00:05:01,420 --> 00:05:04,360
the text--and then finally we recognize the individual characters.

157
00:05:05,730 --> 00:05:07,190
In many complex machine learning

158
00:05:07,800 --> 00:05:09,050
systems, these sorts of

159
00:05:09,490 --> 00:05:11,400
pipelines are common, where you

160
00:05:11,660 --> 00:05:13,450
can have multiple modules--in this

161
00:05:13,680 --> 00:05:14,960
example, the text detection, character

162
00:05:15,390 --> 00:05:17,820
segmentation, character recognition modules--each of

163
00:05:17,930 --> 00:05:19,170
which may be machine learning component,

164
00:05:19,880 --> 00:05:20,740
or sometimes it may not

165
00:05:20,980 --> 00:05:22,660
be a machine learning component but

166
00:05:22,810 --> 00:05:23,660
to have a set of modules

167
00:05:24,290 --> 00:05:26,050
that act one after another on

168
00:05:26,280 --> 00:05:27,780
some piece of data in order

169
00:05:28,100 --> 00:05:29,170
to produce the output you want,

170
00:05:29,640 --> 00:05:30,930
which in the photo OCR example

171
00:05:31,580 --> 00:05:32,690
is to find the

172
00:05:32,800 --> 00:05:35,050
transcription of the text that appeared in the image.

173
00:05:35,730 --> 00:05:37,370
If you're designing a machine learning

174
00:05:37,710 --> 00:05:39,090
system one of the

175
00:05:39,200 --> 00:05:41,010
most important decisions will often

176
00:05:41,330 --> 00:05:44,350
be what exactly is the pipeline that you want to put together.

177
00:05:44,970 --> 00:05:46,010
In other words, given the photo

178
00:05:46,530 --> 00:05:47,930
OCR problem, how do you

179
00:05:47,990 --> 00:05:49,390
break this problem down into

180
00:05:49,770 --> 00:05:51,220
a sequence of different modules.

181
00:05:51,690 --> 00:05:53,060
And you design the pipeline

182
00:05:53,820 --> 00:05:56,060
and each the performance of each of the modules in your pipeline.

183
00:05:56,660 --> 00:05:57,610
will often have a big

184
00:05:57,710 --> 00:05:59,880
impact on the final performance of your algorithm.

185
00:06:01,480 --> 00:06:02,330
If you have a team of

186
00:06:02,550 --> 00:06:03,610
engineers working on a

187
00:06:03,800 --> 00:06:05,150
problem like this is also very

188
00:06:05,460 --> 00:06:06,900
common to have different

189
00:06:07,340 --> 00:06:08,720
individuals work on different modules.

190
00:06:09,500 --> 00:06:11,480
So I could easily imagine tech

191
00:06:12,140 --> 00:06:13,240
easily being the of anywhere

192
00:06:13,670 --> 00:06:14,610
from 1 to 5 engineers,

193
00:06:15,460 --> 00:06:16,970
character segmentation maybe another

194
00:06:17,470 --> 00:06:19,010
1-5 engineers, and character

195
00:06:19,220 --> 00:06:20,550
recognition being another 1-5

196
00:06:21,670 --> 00:06:23,100
engineers, and so having a

197
00:06:23,340 --> 00:06:24,850
pipeline like often offers a

198
00:06:25,280 --> 00:06:26,720
natural way to divide up

199
00:06:27,110 --> 00:06:30,370
the workload amongst different members of an engineering team, as well.

200
00:06:31,040 --> 00:06:31,970
Although, or course, all of

201
00:06:32,090 --> 00:06:33,210
this work could also be done

202
00:06:33,450 --> 00:06:35,910
by just one person if that's how you want to do it.

203
00:06:39,090 --> 00:06:40,370
In complex machine learning systems

204
00:06:41,340 --> 00:06:42,700
the idea of a pipeline, of

205
00:06:42,870 --> 00:06:44,770
a machine of a pipeline, is pretty pervasive.

206
00:06:45,820 --> 00:06:47,070
And what you just saw is

207
00:06:47,400 --> 00:06:49,180
a specific example of how

208
00:06:49,440 --> 00:06:51,280
a Photo OCR pipeline might work.

209
00:06:52,230 --> 00:06:53,590
In the next few videos I'll

210
00:06:53,740 --> 00:06:54,590
tell you a little bit more

211
00:06:54,650 --> 00:06:55,780
about this pipeline, and we'll continue

212
00:06:56,290 --> 00:06:57,170
to use this as an example

213
00:06:58,120 --> 00:06:59,860
to illustrate--I think--a few more

214
00:07:00,280 --> 00:07:01,400
key concepts of machine learning.