1 00:00:00,090 --> 00:00:00,950 In this and the next few 2 00:00:01,070 --> 00:00:02,010 videos, I want to tell 3 00:00:02,160 --> 00:00:03,410 you about a machine learning application 4 00:00:04,020 --> 00:00:04,980 example, or a machine 5 00:00:05,160 --> 00:00:07,670 learning application history centered 6 00:00:08,030 --> 00:00:09,630 around an application called Photo OCR . 7 00:00:10,520 --> 00:00:11,730 There are three reasons 8 00:00:12,170 --> 00:00:13,220 why I want to do this, 9 00:00:13,480 --> 00:00:14,350 first I wanted to show you an 10 00:00:14,770 --> 00:00:15,700 example of how a complex 11 00:00:16,290 --> 00:00:18,000 machine learning system can be put together. 12 00:00:19,350 --> 00:00:20,960 Second, once told the concepts of 13 00:00:21,170 --> 00:00:22,280 a machine learning a type line 14 00:00:22,970 --> 00:00:24,740 and how to allocate resources when 15 00:00:24,860 --> 00:00:26,550 you're trying to decide what to do next. 16 00:00:26,780 --> 00:00:27,700 And this can either be in 17 00:00:27,730 --> 00:00:28,950 the context of you working 18 00:00:29,380 --> 00:00:30,220 by yourself on the big 19 00:00:30,500 --> 00:00:31,690 application Or it can 20 00:00:31,770 --> 00:00:32,980 be the context of a team 21 00:00:33,100 --> 00:00:34,190 of developers trying to build 22 00:00:34,440 --> 00:00:35,930 a complex application together. 23 00:00:37,030 --> 00:00:38,670 And then finally, the Photo 24 00:00:39,130 --> 00:00:40,690 OCR problem also gives 25 00:00:40,880 --> 00:00:41,810 me an excuse to tell you 26 00:00:41,880 --> 00:00:42,850 about just a couple more interesting 27 00:00:43,260 --> 00:00:44,370 ideas for machine learning. 28 00:00:45,120 --> 00:00:47,300 One is some ideas of 29 00:00:47,400 --> 00:00:48,250 how to apply machine learning to 30 00:00:48,600 --> 00:00:50,210 computer vision problems, and second 31 00:00:50,340 --> 00:00:51,890 is the idea of artificial data 32 00:00:52,220 --> 00:00:53,880 synthesis, which we'll see in a couple of videos. 33 00:00:54,820 --> 00:00:57,680 So, let's start by talking about what is the Photo OCR problem. 34 00:01:00,130 --> 00:01:01,710 Photo OCR stands for 35 00:01:02,050 --> 00:01:03,760 Photo Optical Character Recognition. 36 00:01:05,180 --> 00:01:06,460 With the growth of digital photography 37 00:01:07,300 --> 00:01:08,740 and more recently the growth of 38 00:01:09,080 --> 00:01:10,360 camera in our cell phones 39 00:01:11,140 --> 00:01:12,140 we now have tons of visual 40 00:01:12,500 --> 00:01:13,790 pictures that we take all over the place. 41 00:01:14,620 --> 00:01:15,700 And one of the things that 42 00:01:16,150 --> 00:01:17,850 has interested many developers is 43 00:01:18,080 --> 00:01:19,680 how to get our computers to 44 00:01:19,990 --> 00:01:22,300 understand the content of these pictures a little bit better. 45 00:01:23,140 --> 00:01:24,690 The photo OCR problem focuses 46 00:01:25,300 --> 00:01:26,790 on how to get computers to 47 00:01:26,980 --> 00:01:29,390 read the text to the purest in images that we take. 48 00:01:30,730 --> 00:01:31,990 Given an image like this it 49 00:01:32,070 --> 00:01:32,850 might be nice if a computer 50 00:01:33,530 --> 00:01:34,480 can read the text in this 51 00:01:34,670 --> 00:01:35,540 image so that if you're 52 00:01:35,650 --> 00:01:37,040 trying to look for this 53 00:01:37,220 --> 00:01:38,530 picture again you type in 54 00:01:38,850 --> 00:01:40,220 the words, lulu bees and 55 00:01:41,000 --> 00:01:42,910 and have it automatically pull 56 00:01:43,130 --> 00:01:44,190 up this picture, so that 57 00:01:44,360 --> 00:01:45,890 you're not spending lots of 58 00:01:45,980 --> 00:01:47,130 time digging through your photo 59 00:01:47,670 --> 00:01:49,230 collection Maybe hundreds of 60 00:01:49,490 --> 00:01:50,730 thousands of pictures in. 61 00:01:50,870 --> 00:01:53,100 The Photo OCR problem 62 00:01:53,450 --> 00:01:56,080 does exactly this, and it does so in several steps. 63 00:01:56,870 --> 00:01:57,790 First, given the picture it 64 00:01:58,060 --> 00:01:58,800 has to look through the image 65 00:01:59,480 --> 00:02:01,680 and detect where there is text in the picture. 66 00:02:03,020 --> 00:02:03,960 And after it has done 67 00:02:04,160 --> 00:02:05,340 that or if it successfully does 68 00:02:05,570 --> 00:02:06,750 that it then has to 69 00:02:06,980 --> 00:02:09,020 look at these text regions and 70 00:02:09,170 --> 00:02:10,530 actually read the text in 71 00:02:10,670 --> 00:02:12,150 those regions, and hopefully if 72 00:02:12,250 --> 00:02:13,670 it reads it correctly, it'll come 73 00:02:15,040 --> 00:02:16,440 up with these transcriptions of 74 00:02:16,800 --> 00:02:18,710 what is the text that appears in the image. 75 00:02:19,480 --> 00:02:21,160 Whereas OCR, or optical 76 00:02:21,440 --> 00:02:22,850 character recognition of scanned 77 00:02:23,600 --> 00:02:25,760 documents is relatively easier 78 00:02:26,180 --> 00:02:27,840 problem, doing OCR from 79 00:02:27,980 --> 00:02:29,480 photographs today is still a 80 00:02:29,750 --> 00:02:30,970 very difficult machine learning problem, 81 00:02:31,640 --> 00:02:32,730 and you can do this. 82 00:02:33,000 --> 00:02:34,320 Not only can this help 83 00:02:34,750 --> 00:02:36,390 our computers to understand the 84 00:02:36,450 --> 00:02:38,380 content of our though 85 00:02:38,500 --> 00:02:40,030 images better, there are 86 00:02:40,240 --> 00:02:42,240 also applications like helping blind 87 00:02:42,530 --> 00:02:43,900 people, for example, if you 88 00:02:44,000 --> 00:02:45,010 could provide to a blind person 89 00:02:45,780 --> 00:02:47,210 a camera that can look 90 00:02:47,460 --> 00:02:48,430 at what's in front of 91 00:02:48,530 --> 00:02:49,700 them, and just tell them the 92 00:02:49,910 --> 00:02:52,990 words that my be on 93 00:02:53,460 --> 00:02:55,830 the street sign in front of them. 94 00:02:56,540 --> 00:02:57,780 With car navigation systems. 95 00:02:58,310 --> 00:02:59,750 For example, imagine if your 96 00:02:59,920 --> 00:03:00,900 car could read the street 97 00:03:01,250 --> 00:03:03,480 signs and help you navigate to your destination. 98 00:03:04,610 --> 00:03:07,260 In order to perform photo OCR, here's what we can do. 99 00:03:07,970 --> 00:03:08,840 First we can go through the 100 00:03:09,080 --> 00:03:11,490 image and find the regions where there's text and image. 101 00:03:11,850 --> 00:03:13,380 So, shown here is 102 00:03:13,580 --> 00:03:15,430 one example of text and 103 00:03:15,730 --> 00:03:17,700 image that the photo OCR system may find. 104 00:03:19,980 --> 00:03:21,850 Second, given the rectangle around 105 00:03:22,210 --> 00:03:23,390 that text region, we can 106 00:03:23,700 --> 00:03:25,930 then do character segmentation, where 107 00:03:26,170 --> 00:03:28,210 we might take this text box 108 00:03:28,490 --> 00:03:30,310 that says "Antique Mall" and 109 00:03:30,530 --> 00:03:31,760 try to segment it out 110 00:03:32,090 --> 00:03:34,150 into the locations of the individual characters. 111 00:03:35,450 --> 00:03:37,280 And finally, having segmented out 112 00:03:37,450 --> 00:03:39,050 into individual characters, we can 113 00:03:39,320 --> 00:03:41,040 then run a crossfire, which 114 00:03:41,290 --> 00:03:42,950 looks at the images of the 115 00:03:43,090 --> 00:03:44,620 visual characters, and tries to 116 00:03:44,760 --> 00:03:45,990 figure out the first character's an 117 00:03:46,150 --> 00:03:47,070 A, the second character's an 118 00:03:47,230 --> 00:03:48,010 N, the third character is 119 00:03:48,480 --> 00:03:49,930 a T, and so on, 120 00:03:50,110 --> 00:03:51,130 so that up by doing all 121 00:03:51,190 --> 00:03:52,350 this how that hopefully you can then 122 00:03:52,530 --> 00:03:53,610 figure out that this phrase 123 00:03:54,180 --> 00:03:55,670 is Rulegee's antique mall 124 00:03:56,340 --> 00:03:57,760 and similarly for some of 125 00:03:57,930 --> 00:04:01,690 the other words that appear in that image. 126 00:04:01,980 --> 00:04:02,390 I should say that there are some photo OCR systems 127 00:04:02,910 --> 00:04:04,350 that do even more complex things, 128 00:04:04,680 --> 00:04:06,370 like a bit of spelling correction at the end. 129 00:04:06,640 --> 00:04:08,470 So if, for example, your 130 00:04:08,710 --> 00:04:10,730 character segmentation and character 131 00:04:11,110 --> 00:04:12,450 classification system tells you 132 00:04:12,690 --> 00:04:14,390 that it sees the 133 00:04:14,530 --> 00:04:16,050 word c 1 e a 134 00:04:16,260 --> 00:04:17,930 n i n g. Then, 135 00:04:18,350 --> 00:04:19,570 you know, a sort of spelling 136 00:04:19,760 --> 00:04:21,910 correction system might tell 137 00:04:22,240 --> 00:04:23,270 you that this is probably the 138 00:04:23,350 --> 00:04:24,880 word 'cleaning', and your 139 00:04:25,340 --> 00:04:27,160 character classification algorithm had 140 00:04:27,310 --> 00:04:29,650 just mistaken the l for a 1. 141 00:04:30,370 --> 00:04:31,320 But for the purpose of what 142 00:04:31,640 --> 00:04:32,510 we want to do in 143 00:04:32,620 --> 00:04:33,980 this video, let's ignore this last 144 00:04:34,620 --> 00:04:35,780 step and just focus on the 145 00:04:36,110 --> 00:04:37,490 system that does these three 146 00:04:37,700 --> 00:04:39,340 steps of text detection, character 147 00:04:39,660 --> 00:04:41,040 segmentation, and character classification. 148 00:04:42,410 --> 00:04:43,790 A system like this is 149 00:04:44,080 --> 00:04:46,010 what we call a machine learning pipeline. 150 00:04:47,550 --> 00:04:49,220 In particular, here's a picture 151 00:04:49,950 --> 00:04:52,220 showing the photo OCR pipeline. 152 00:04:53,140 --> 00:04:54,200 We have an image, which then 153 00:04:54,470 --> 00:04:57,590 fed to the text detection system 154 00:04:57,970 --> 00:04:58,960 text regions, we then segment 155 00:04:59,420 --> 00:05:01,350 out the characters--the individual characters in 156 00:05:01,420 --> 00:05:04,360 the text--and then finally we recognize the individual characters. 157 00:05:05,730 --> 00:05:07,190 In many complex machine learning 158 00:05:07,800 --> 00:05:09,050 systems, these sorts of 159 00:05:09,490 --> 00:05:11,400 pipelines are common, where you 160 00:05:11,660 --> 00:05:13,450 can have multiple modules--in this 161 00:05:13,680 --> 00:05:14,960 example, the text detection, character 162 00:05:15,390 --> 00:05:17,820 segmentation, character recognition modules--each of 163 00:05:17,930 --> 00:05:19,170 which may be machine learning component, 164 00:05:19,880 --> 00:05:20,740 or sometimes it may not 165 00:05:20,980 --> 00:05:22,660 be a machine learning component but 166 00:05:22,810 --> 00:05:23,660 to have a set of modules 167 00:05:24,290 --> 00:05:26,050 that act one after another on 168 00:05:26,280 --> 00:05:27,780 some piece of data in order 169 00:05:28,100 --> 00:05:29,170 to produce the output you want, 170 00:05:29,640 --> 00:05:30,930 which in the photo OCR example 171 00:05:31,580 --> 00:05:32,690 is to find the 172 00:05:32,800 --> 00:05:35,050 transcription of the text that appeared in the image. 173 00:05:35,730 --> 00:05:37,370 If you're designing a machine learning 174 00:05:37,710 --> 00:05:39,090 system one of the 175 00:05:39,200 --> 00:05:41,010 most important decisions will often 176 00:05:41,330 --> 00:05:44,350 be what exactly is the pipeline that you want to put together. 177 00:05:44,970 --> 00:05:46,010 In other words, given the photo 178 00:05:46,530 --> 00:05:47,930 OCR problem, how do you 179 00:05:47,990 --> 00:05:49,390 break this problem down into 180 00:05:49,770 --> 00:05:51,220 a sequence of different modules. 181 00:05:51,690 --> 00:05:53,060 And you design the pipeline 182 00:05:53,820 --> 00:05:56,060 and each the performance of each of the modules in your pipeline. 183 00:05:56,660 --> 00:05:57,610 will often have a big 184 00:05:57,710 --> 00:05:59,880 impact on the final performance of your algorithm. 185 00:06:01,480 --> 00:06:02,330 If you have a team of 186 00:06:02,550 --> 00:06:03,610 engineers working on a 187 00:06:03,800 --> 00:06:05,150 problem like this is also very 188 00:06:05,460 --> 00:06:06,900 common to have different 189 00:06:07,340 --> 00:06:08,720 individuals work on different modules. 190 00:06:09,500 --> 00:06:11,480 So I could easily imagine tech 191 00:06:12,140 --> 00:06:13,240 easily being the of anywhere 192 00:06:13,670 --> 00:06:14,610 from 1 to 5 engineers, 193 00:06:15,460 --> 00:06:16,970 character segmentation maybe another 194 00:06:17,470 --> 00:06:19,010 1-5 engineers, and character 195 00:06:19,220 --> 00:06:20,550 recognition being another 1-5 196 00:06:21,670 --> 00:06:23,100 engineers, and so having a 197 00:06:23,340 --> 00:06:24,850 pipeline like often offers a 198 00:06:25,280 --> 00:06:26,720 natural way to divide up 199 00:06:27,110 --> 00:06:30,370 the workload amongst different members of an engineering team, as well. 200 00:06:31,040 --> 00:06:31,970 Although, or course, all of 201 00:06:32,090 --> 00:06:33,210 this work could also be done 202 00:06:33,450 --> 00:06:35,910 by just one person if that's how you want to do it. 203 00:06:39,090 --> 00:06:40,370 In complex machine learning systems 204 00:06:41,340 --> 00:06:42,700 the idea of a pipeline, of 205 00:06:42,870 --> 00:06:44,770 a machine of a pipeline, is pretty pervasive. 206 00:06:45,820 --> 00:06:47,070 And what you just saw is 207 00:06:47,400 --> 00:06:49,180 a specific example of how 208 00:06:49,440 --> 00:06:51,280 a Photo OCR pipeline might work. 209 00:06:52,230 --> 00:06:53,590 In the next few videos I'll 210 00:06:53,740 --> 00:06:54,590 tell you a little bit more 211 00:06:54,650 --> 00:06:55,780 about this pipeline, and we'll continue 212 00:06:56,290 --> 00:06:57,170 to use this as an example 213 00:06:58,120 --> 00:06:59,860 to illustrate--I think--a few more 214 00:07:00,280 --> 00:07:01,400 key concepts of machine learning.