1 00:00:00,000 --> 00:00:03,075 Hello. Welcome to the Coursera course on Neural 2 00:00:03,075 --> 00:00:09,006 Networks for Machine Learning. Before we get into the details of neural 3 00:00:09,006 --> 00:00:14,004 network learning algorithms, I want to talk a little bit about machine learning, 4 00:00:14,004 --> 00:00:19,015 why we need machine learning, the kinds of things we use it for, and show you some 5 00:00:19,015 --> 00:00:23,087 examples of what it can do. So the reason we need machine learning is 6 00:00:23,087 --> 00:00:29,010 that the sum problem, where it's very hard to write the programs, recognizing a three 7 00:00:29,010 --> 00:00:33,059 dimensional object for example. When it's from a novel viewpoint and new 8 00:00:33,059 --> 00:00:37,026 lighting additions in a cluttered scene is very hard to do. 9 00:00:37,026 --> 00:00:42,018 We don't know what program to write because we don't know how it's done in our 10 00:00:42,018 --> 00:00:45,005 brain. And even if we did know what program to 11 00:00:45,005 --> 00:00:49,010 write, it might be that it was a horrendously complicated program. 12 00:00:50,029 --> 00:00:55,083 Another example is, detecting a fraudulent credit card transaction, where there may 13 00:00:55,083 --> 00:01:00,014 not be any nice, simple rules that will tell you it's fraudulent. 14 00:01:00,014 --> 00:01:05,014 You really need to combine, a very large number of, not very reliable rules. 15 00:01:05,014 --> 00:01:10,060 And also, those rules change every time because people change the tricks they use 16 00:01:10,060 --> 00:01:13,084 for fraud. So, we need a complicated program that 17 00:01:13,084 --> 00:01:17,062 combines unreliable rules, and that we can change easily. 18 00:01:18,087 --> 00:01:24,027 The machine learning approach, is to say, instead of writing each program by hand 19 00:01:24,027 --> 00:01:29,040 for each specific task, for particular task, we collect a lot of examples, and 20 00:01:29,040 --> 00:01:32,029 specify the correct output for given input. 21 00:01:32,062 --> 00:01:37,080 A machine learning algorithm then takes these examples and produces a program that 22 00:01:37,080 --> 00:01:41,029 does the job. The program produced by the linear 23 00:01:41,029 --> 00:01:45,035 algorithm may look very different from the typical handwritten program. 24 00:01:45,035 --> 00:01:49,093 For example, it might contain millions of numbers about how you weight different 25 00:01:49,093 --> 00:01:54,014 kinds of evidence. If we do it right, the program should work 26 00:01:54,014 --> 00:01:57,004 for new cases just as well as the ones it's trained on. 27 00:01:57,051 --> 00:02:03,047 And if the data changes, we should be able to change the program runs very easily by 28 00:02:03,047 --> 00:02:09,627 retraining it on the new data. And now massive amounts for computation 29 00:02:09,627 --> 00:02:14,084 are cheaper that paying someone to write a program for a specific task, so we can 30 00:02:14,084 --> 00:02:20,000 afford big complicated machine learning programs to produce these stark task 31 00:02:20,000 --> 00:02:26,023 specific systems for us. Some examples of the things that are best 32 00:02:26,023 --> 00:02:32,050 done by using a learning algorithm are recognizing patterns, so for example 33 00:02:32,050 --> 00:02:38,095 objects in real scenes, or the identities or expressions of people's faces, or 34 00:02:38,095 --> 00:02:42,053 spoken words. There's also recognizing anomalies. 35 00:02:42,053 --> 00:02:46,084 So, an unusual sequence of credit card transactions would be an anomaly. 36 00:02:47,002 --> 00:02:51,098 Another example of an anomaly would be an unusual pattern of sensor readings in a 37 00:02:51,098 --> 00:02:55,062 nuclear power plant. And you wouldn't really want to have to 38 00:02:55,062 --> 00:02:58,034 deal with those by doing supervised learning. 39 00:02:58,034 --> 00:03:03,025 Where you look at the ones that blow up, and see what, what caused them to blow up. 40 00:03:03,025 --> 00:03:07,067 You'd really like to recognize that something funny is happening without 41 00:03:07,067 --> 00:03:11,097 having any supervision signal. It's just not behaving in its normal way. 42 00:03:12,059 --> 00:03:16,047 And then this prediction. So, typically, predicting future stock 43 00:03:16,047 --> 00:03:21,333 prices or currency exchange rates or predicting which movies a person will like 44 00:03:21,333 --> 00:03:25,812 from knowing which other movies they like. And which movies a lot of other people 45 00:03:25,812 --> 00:03:31,226 liked. So in this course I'm mean as a standard 46 00:03:31,226 --> 00:03:36,306 example for explaining a lot of the machine learning algorithms. 47 00:03:36,306 --> 00:03:41,669 This is done in a lot of science. In genetics for example, a lot of genetics 48 00:03:41,669 --> 00:03:45,809 is done on fruitflies. And the reason is they're convenient. 49 00:03:45,809 --> 00:03:51,760 They breed fast and a lot is already known about the genetics of fruit flies. 50 00:03:51,760 --> 00:03:58,840 The MNIST database of handwritten digits is the machine equivalent of fruitflies. 51 00:03:58,840 --> 00:04:04,573 It's publicly available. We can get machine learning algorithms to 52 00:04:04,573 --> 00:04:09,769 learn how to recognize these handwritten digits quite quickly, so it's easy to try 53 00:04:09,769 --> 00:04:13,500 lots of variations. And we know huge amounts about how well 54 00:04:13,500 --> 00:04:16,425 different machine learning methods do on MNIST. 55 00:04:16,425 --> 00:04:21,036 And in particular, the different machine learning methods were implemented by 56 00:04:21,036 --> 00:04:24,492 people who believed in them, so we can rely on those results. 57 00:04:24,492 --> 00:04:29,395 So for all those reasons, we're gonna use MNIST as our standard task. 58 00:04:29,395 --> 00:04:33,499 Here's an example of some of the digits in MNIST. 59 00:04:33,499 --> 00:04:38,566 These are ones that were correctly recognized by neural net the first time it 60 00:04:38,566 --> 00:04:42,958 saw them. But the ones within the neural net wasn't 61 00:04:42,958 --> 00:04:45,819 very confident. And you could see why. 62 00:04:45,819 --> 00:04:50,205 I've arranged these digits in standard scan line order. 63 00:04:50,205 --> 00:04:57,163 So zeros, then ones, then twos and so on. If you look at a bunch of tubes like the 64 00:04:57,163 --> 00:05:02,025 onces in the green rectangle. You can see that if you knew they were 100 65 00:05:02,025 --> 00:05:04,086 in digit you'd probably guess they were twos. 66 00:05:04,086 --> 00:05:08,038 But it's very hard to say what it is that makes them twos. 67 00:05:08,038 --> 00:05:11,046 Theres nothing simple that they all have in common. 68 00:05:11,046 --> 00:05:16,019 In particular if you try and overlay one on another you'll see it doesn't fit. 69 00:05:16,019 --> 00:05:21,021 And even if you skew it a bit, it's very hard to make them overlay on each other. 70 00:05:21,021 --> 00:05:25,087 So a template isn't going to do the job. An in particular template is going to be 71 00:05:25,087 --> 00:05:30,090 very hard to find that will fit those twos in the green box and would also fit the 72 00:05:30,090 --> 00:05:35,074 things in the red boxes. So that's one thing that makes recognizing 73 00:05:35,074 --> 00:05:38,075 handwritten digits a good task for machine learning. 74 00:05:39,062 --> 00:05:43,076 Now, I don't want you to think that's the only thing we can do. 75 00:05:43,096 --> 00:05:48,043 It's a relatively simple for our machine learning system to do now. 76 00:05:48,043 --> 00:05:53,078 And to motivate the rest of the course, I want to show you some examples of much 77 00:05:53,078 --> 00:05:57,039 more difficult things. So we now have neural nets with 78 00:05:57,059 --> 00:06:02,087 approaching a hundred million parameters in them, that can recognize a thousand 79 00:06:02,087 --> 00:06:08,028 different object classes in 1.3 million high resolution training images got from 80 00:06:08,028 --> 00:06:12,006 the web. So, there was a competition in 2010, and 81 00:06:12,006 --> 00:06:17,001 the best system got 47 percent error rate if you look at its first choice, and 25 82 00:06:17,001 --> 00:06:21,089 percent error rate if you say it got it right if it was in its top five choices, 83 00:06:21,089 --> 00:06:24,087 which isn't bad for 1,000 different objects. 84 00:06:25,008 --> 00:06:30,070 Jitendra Malik who's an eminent neural net skeptic, and a leading computer vision 85 00:06:30,070 --> 00:06:36,046 researcher, has said that this competition is a good test of whether deep neural 86 00:06:36,046 --> 00:06:39,066 networks can work well for object recognition. 87 00:06:39,066 --> 00:06:44,068 And a very deep neural network can now do considerably better than the thing that 88 00:06:44,068 --> 00:06:48,000 won the competition. It can get less than 40 percent error, for 89 00:06:48,000 --> 00:06:52,023 its first choice, and less than twenty percent error for its top five choices. 90 00:06:52,023 --> 00:06:55,060 I'll describe that in much more detail in lecture five. 91 00:06:55,060 --> 00:06:59,065 Here's some examples of the kinds of images you have to recognize. 92 00:06:59,065 --> 00:07:03,026 These images from the test set that he's never seen before. 93 00:07:03,026 --> 00:07:08,062 And below the examples, I'm showing you what the neural net thought the right 94 00:07:08,062 --> 00:07:12,030 answer was. Where the length of the horizontal bar is 95 00:07:12,030 --> 00:07:16,006 how confident it was, and the correct answer is in red. 96 00:07:16,006 --> 00:07:20,061 So if you look in the middle, it correctly identified that as a snow plow. 97 00:07:20,061 --> 00:07:23,086 But you can see that its other choices are fairly sensible. 98 00:07:23,086 --> 00:07:26,067 It does look a little bit like a drilling platform. 99 00:07:26,067 --> 00:07:30,091 And if you look at its third choice, a lifeboat, it actually looks very like a 100 00:07:30,091 --> 00:07:33,067 lifeboat. You can see the flag on the front of the 101 00:07:33,067 --> 00:07:38,018 boat and the bridge of the boat and the flag at the back, and the high surf in the 102 00:07:38,018 --> 00:07:41,011 background. So its, its errors tell you a lot about 103 00:07:41,011 --> 00:07:43,097 how it's doing it and they're very plausible errors. 104 00:07:43,097 --> 00:07:48,049 If you look on the left, it gets it wrong possibly because the beak of the bird is 105 00:07:48,049 --> 00:07:52,475 missing and cuz the feathers of the bird look very like the wet fur of an otter. 106 00:07:52,475 --> 00:07:56,027 But it gets it in its top five, and it does better than me. 107 00:07:56,027 --> 00:07:59,853 I wouldn't know if that was a quail or a ruffed grouse or a partridge. 108 00:07:59,853 --> 00:08:03,214 If you look on the right, it gets it completely wrong. 109 00:08:03,214 --> 00:08:07,827 It a guillotine, you can why it says that. You can possibly see why it says 110 00:08:07,827 --> 00:08:12,430 orangutan, because of the sort of jungle looking background and something orange in 111 00:08:12,430 --> 00:08:15,449 the middle. But it fails to get the right answer. 112 00:08:15,449 --> 00:08:19,286 It can, however, deal with a wide range of different objects. 113 00:08:19,286 --> 00:08:23,888 If you look on the left, I would have said microwave as my first answer. 114 00:08:23,888 --> 00:08:28,225 The labels aren't very systematic. So actually, the correct answer there is 115 00:08:28,225 --> 00:08:30,955 electric range. And it does get it in its top five. 116 00:08:30,955 --> 00:08:34,822 In the middle, it's getting a turnstile, which is a distributed object. 117 00:08:34,822 --> 00:08:38,661 It does, can't, it can do more than just recognize compact things. 118 00:08:38,661 --> 00:08:43,699 And it can also deal with pictures, as well as real scenes, like the bulletproof 119 00:08:43,699 --> 00:08:46,959 vest. And it makes some very cool errors. 120 00:08:46,959 --> 00:08:49,976 If you look at the image on the left, that's an earphone. 121 00:08:49,976 --> 00:08:54,101 It doesn't get anything, like an earphone. But if you look at this fourth batch, it 122 00:08:54,101 --> 00:08:57,316 thinks it's an ant. And for you to think that's crazy. 123 00:08:57,316 --> 00:09:01,581 But then if you look at it carefully, you can see it's a view of an ant from 124 00:09:01,581 --> 00:09:04,350 underneath. The eyes are looking down at you, and you 125 00:09:04,350 --> 00:09:08,698 can see the antennae behind it. It's not the kind of view of an ant you'd 126 00:09:08,698 --> 00:09:12,777 like to have if you were a green fly. If you look at the one on the right, it 127 00:09:12,777 --> 00:09:16,547 doesn't get the right answer. But all of its answers are, cylindrical 128 00:09:16,547 --> 00:09:22,002 objects. Another task that neural nets are now very 129 00:09:22,002 --> 00:09:27,441 good at, is speech recognition. Or at least part of a speech recognition 130 00:09:27,441 --> 00:09:30,643 system. So speech recognition systems have several 131 00:09:30,643 --> 00:09:34,051 stages. First they pre-process the sound wave, to 132 00:09:34,051 --> 00:09:39,916 get a vector of acoustic coefficients, for each ten milliseconds of sound wave. 133 00:09:39,916 --> 00:09:43,638 And so they get 100 of those actors per second. 134 00:09:43,638 --> 00:09:49,418 They then take a few adjacent vectors of acoustic coefficients, and they need to 135 00:09:49,418 --> 00:09:52,965 place bets on which part of which phoneme is being spoken. 136 00:09:52,965 --> 00:09:57,894 So they look at this little window and they say, in the middle of this window, 137 00:09:57,894 --> 00:10:01,889 what do I think the phoneme is, and which part of the phoneme is it? 138 00:10:01,889 --> 00:10:06,507 And a good speech recognition system will have many alternative models for a 139 00:10:06,507 --> 00:10:09,131 phoneme. And each model, it might have three 140 00:10:09,131 --> 00:10:12,341 different parts. So it might have many thousands of 141 00:10:12,341 --> 00:10:15,609 alternative fragments that it thinks this might be. 142 00:10:15,609 --> 00:10:20,075 And you have to place bets on all those thousands of alternatives. 143 00:10:20,075 --> 00:10:26,171 And then once you place those bets you have a decoding stage that does the best 144 00:10:26,171 --> 00:10:32,211 job it can of using plausible bets, but piecing them together into a sequence of 145 00:10:32,211 --> 00:10:37,641 bets that corresponds to the kinds of things that people say. 146 00:10:37,641 --> 00:10:44,094 Currently, deep neural networks pioneered by George Dahl and Abdel-rahman Mohammed 147 00:10:44,094 --> 00:10:48,410 of the University of Toronto are doing better than previous machine learning 148 00:10:48,410 --> 00:10:52,783 methods for the acoustic model, and they're now beginning to be used in 149 00:10:52,783 --> 00:10:58,529 practical systems. So, Dahl and Mohammed, developed a system, 150 00:10:58,529 --> 00:11:05,214 that uses many layers of, binary neurons, to, take some acoustic frames, and make 151 00:11:05,214 --> 00:11:09,986 bets about the labels. They were doing it on a fairly small 152 00:11:09,986 --> 00:11:13,656 database and then used 183 alternative labels. 153 00:11:13,656 --> 00:11:20,094 And to get their system to work well, they did some pre-training, which will be 154 00:11:20,094 --> 00:11:23,825 described in the second half of the course. 155 00:11:23,825 --> 00:11:30,471 After standard post processing, they got 20.7 percent error rate on a very standard 156 00:11:30,471 --> 00:11:34,154 benchmark, which is kind of like the NMIST for speech. 157 00:11:34,154 --> 00:11:39,704 The best previous result on that benchmark for speak independent recognition was 158 00:11:39,704 --> 00:11:43,467 24.4%. And a very experienced speech researcher 159 00:11:43,467 --> 00:11:49,369 at Microsoft research realized that, that was a big enough improvement, that 160 00:11:49,369 --> 00:11:54,698 probably this would change the way speech recognition systems were done. 161 00:11:54,698 --> 00:11:58,951 And indeed, it has. So, if you look at recent results from 162 00:11:58,951 --> 00:12:04,811 several different leading speech groups, Microsoft showed that this kind of deep 163 00:12:04,811 --> 00:12:09,651 neural network, when used as the acoustic model in the speech system. 164 00:12:09,651 --> 00:12:14,927 Reduced the error rate from 27.4 percent to 18.5%, or alternatively, you could view 165 00:12:14,927 --> 00:12:21,018 it as reducing the amount of training data you needed from 2,000 hours down to 309 166 00:12:21,018 --> 00:12:26,814 hours to get comparable performance. Ibm which has the best system for one of 167 00:12:26,814 --> 00:12:33,058 the standard speech recognition tasks for large recovery speech recognition, showed 168 00:12:33,058 --> 00:12:38,297 that even it's very highly tuned system that was getting 18.8 percent can be 169 00:12:38,297 --> 00:12:41,613 beaten by one of these deep neural networks. 170 00:12:41,613 --> 00:12:46,768 And Google, fairly recently, trained a deep neural network on a large amount of 171 00:12:46,768 --> 00:12:51,301 speech, 5,800 hours. That was still much less than they trained 172 00:12:51,301 --> 00:12:55,769 their mixture model on. But even with much less data, it did a lot 173 00:12:55,769 --> 00:12:58,708 better than the technology they had before. 174 00:12:58,708 --> 00:13:03,291 So it reduced the error rate from sixteen percent to 12.3 percent and the error rate 175 00:13:03,291 --> 00:13:07,284 is still falling. And in the latest Android, if you do voice 176 00:13:07,284 --> 00:13:12,770 search, it's using one of these deep neurall networks in order to do very good 177 00:13:12,770 --> 00:13:14,017 speech recognition.