1 00:00:00,000 --> 00:00:04,075 In this video, I'm gonna talk about perceptrons. 2 00:00:05,050 --> 00:00:10,014 These were investigated in the early 1960's, and initially they looked very 3 00:00:10,014 --> 00:00:14,053 promising as learning devices. But then they fell into disfavor because 4 00:00:14,053 --> 00:00:19,073 Minsky and Papert showed they were rather restricted in what they could learn to do. 5 00:00:21,005 --> 00:00:26,055 In statistical pattern recognition, there's a statistical way to recognize 6 00:00:26,055 --> 00:00:31,082 patterns. We first take the raw input, and we 7 00:00:31,082 --> 00:00:37,046 convert it into a set or vector feature activations. 8 00:00:38,052 --> 00:00:43,006 We do this using hand written programs which are based on common sense. 9 00:00:43,006 --> 00:00:48,044 So that part of the system does not learn. We look at the problem we decide what the 10 00:00:48,044 --> 00:00:52,060 good features should be. We try some features to see if they work 11 00:00:52,060 --> 00:00:57,086 or don't work we try some more features and eventually set of features that allow 12 00:00:57,086 --> 00:01:01,096 us to solve the problem by using a subsequent learning stage. 13 00:01:03,032 --> 00:01:09,006 What we learn is how to weight each of the feature activations, in order to get a 14 00:01:09,006 --> 00:01:14,016 single scalar quantity. So the weights on the features represent 15 00:01:14,016 --> 00:01:19,083 how much evidence the feature gives you, in favor or against the hypothesis that 16 00:01:19,083 --> 00:01:25,029 the current input is an example of the kind of pattern you want to recognize. 17 00:01:25,029 --> 00:01:30,075 And when we add up all the weighted features, we get a sort of total evidence 18 00:01:30,075 --> 00:01:36,050 in favor of the hypothesis that this is the kind of pattern we want to recognize. 19 00:01:36,081 --> 00:01:42,056 And if that evidence is above some threshold, we decide that the input vector 20 00:01:42,056 --> 00:01:48,001 is a positive example of the class of patterns we're trying to recognize. 21 00:01:48,001 --> 00:01:54,006 A perceptron is a particular example of a statistical pattern recognition system. 22 00:01:54,095 --> 00:02:00,041 So there are actually many different kinds of perceptrons, but the standard kind, 23 00:02:00,041 --> 00:02:05,045 which Rosenblatt called an alpha perceptron, consists of some inputs which 24 00:02:05,045 --> 00:02:10,098 are then converted into future activities. They might be converted by things that 25 00:02:10,098 --> 00:02:15,062 look a bit like neurons, but that stage of the system does not learn. 26 00:02:15,091 --> 00:02:20,088 Once you've got the activities of the features, you then learn some weights, so 27 00:02:20,088 --> 00:02:26,009 that you can take the feature activities times the weights and you decide whether 28 00:02:26,009 --> 00:02:31,043 or not it's an example of the class you're interested in by seeing whether that sum 29 00:02:31,043 --> 00:02:36,002 of feature activities times learned weights is greater than a threshold. 30 00:02:38,059 --> 00:02:43,050 Perceptrons have an interesting history. They were popularized in the early 1960s 31 00:02:43,050 --> 00:02:46,072 by Frank Rosenblatt. He wrote a great big book called 32 00:02:46,072 --> 00:02:51,020 Principles of Neurodynamics, in which he described many different kinds of 33 00:02:51,020 --> 00:02:53,087 perceptrons, and that book was full of ideas. 34 00:02:53,087 --> 00:02:58,065 The most important thing in the book was a very powerful learning algorithm, or 35 00:02:58,065 --> 00:03:02,059 something that appeared to be a very powerful learning algorithm. 36 00:03:03,047 --> 00:03:08,095 A lot of grand claims were made for what perceptrons could do using this learning 37 00:03:08,095 --> 00:03:12,082 algorithm. For example, people claimed they could 38 00:03:12,082 --> 00:03:18,059 tell the difference between pictures of tanks and pictures of trucks, even if the 39 00:03:18,059 --> 00:03:22,093 tanks and trucks were sort of partially obscured in a forest. 40 00:03:23,069 --> 00:03:26,065 Now some of those claims turned out to be false. 41 00:03:26,065 --> 00:03:31,057 In the case of the tanks and the trucks, it turned out the pictures of the tanks 42 00:03:31,057 --> 00:03:36,006 were taken on a sunny day, and the pictures of the trucks were taken on a 43 00:03:36,006 --> 00:03:39,038 cloudy day. All the perceptron was doing was measuring 44 00:03:39,038 --> 00:03:44,005 the total intensity of all the pixels. That's something we humans are fairly 45 00:03:44,005 --> 00:03:47,019 insensitive to. We notice the things in the picture. 46 00:03:47,019 --> 00:03:51,012 But a perceptron can easily learn to add up the total intensity. 47 00:03:51,012 --> 00:03:54,081 That's the kind of thing that gives an algorithm a bad name. 48 00:03:56,037 --> 00:04:02,097 In 1969, Minsky and Papert published a book called Perceptrons that analyzed what 49 00:04:02,097 --> 00:04:07,003 perceptrons could do and showed their limitations. 50 00:04:08,021 --> 00:04:13,006 Many people thought those limitations applied to all neural network models. 51 00:04:13,006 --> 00:04:18,027 And the general feeling within artificial intelligence was that Minsky and Papert 52 00:04:18,027 --> 00:04:23,030 had shown that neural network models were nonsense or that they couldn't learn 53 00:04:23,030 --> 00:04:26,086 difficult things. Minsky and Papert themselves knew that 54 00:04:26,086 --> 00:04:31,000 they hadn't shown that. They'd just shown that perceptrons of the 55 00:04:31,000 --> 00:04:35,090 kind for which the powerful learning algorithm applied could not do a lot of 56 00:04:35,090 --> 00:04:39,015 things, or rather they couldn't do them by learning. 57 00:04:39,015 --> 00:04:43,086 They could do them if you sort of hand-wired the answer in the inputs, but 58 00:04:43,086 --> 00:04:46,070 not by learning. But that result got wildly 59 00:04:46,070 --> 00:04:50,095 overgeneralized, and when I started working on neural network models in the 60 00:04:50,095 --> 00:04:55,042 1970s, people in artificial intelligence kept telling me that Minsky and Papert 61 00:04:55,042 --> 00:04:57,086 have proved that these models were no good. 62 00:04:59,056 --> 00:05:04,060 Actually, the perceptron convergence procedure, which we'll see in a minute, is 63 00:05:04,060 --> 00:05:09,012 still widely used today for tasks that have very big feature vectors. 64 00:05:09,012 --> 00:05:14,011 So, Google, for example, uses it to predict things from very big vectors of 65 00:05:14,011 --> 00:05:19,018 features. So, the decision unit in a perceptron is a 66 00:05:19,018 --> 00:05:23,021 binary threshold neuron. We've seen this before and just to re-, 67 00:05:23,021 --> 00:05:27,093 refresh you on those. They compute a weighted sum of inputs they 68 00:05:27,093 --> 00:05:32,035 get from other neurons. They add on a bias to get their total 69 00:05:32,035 --> 00:05:35,062 input. And then they give an output of one if 70 00:05:35,062 --> 00:05:40,025 that sum exceeds zero, and they give an output of zero otherwise. 71 00:05:42,019 --> 00:05:47,029 We don't want to have to have a separate learning rule for learning biases, and it 72 00:05:47,029 --> 00:05:50,027 turns out we can treat biases just like weights. 73 00:05:50,027 --> 00:05:55,055 If we take every input vector and we stick a one on the front of it, and we treat the 74 00:05:55,055 --> 00:06:00,033 bias as like the weight on that first feature that always has a value of one. 75 00:06:00,033 --> 00:06:03,043 So the bias is just the negative of the threshold. 76 00:06:03,095 --> 00:06:08,065 And using this trick, we don't need a separate learning rule for the bias. 77 00:06:08,065 --> 00:06:13,016 It's exactly equivalent to learning a weight on this extra input line. 78 00:06:16,063 --> 00:06:21,005 So here's the very powerful learning procedure for perceptrons, and it's a 79 00:06:21,005 --> 00:06:25,071 learning procedure that's guaranteed to work, which is a nice property to have. 80 00:06:25,071 --> 00:06:30,031 Of course you have to look at the small print later, about why that guarantee 81 00:06:30,031 --> 00:06:36,068 isn't quite as good as you think it is. So we first had this extra component with 82 00:06:36,068 --> 00:06:41,031 a value of one to every input vector. Now we can forget about the biases. 83 00:06:41,081 --> 00:06:46,076 And then we keep picking training cases, using any policy we like, as long as we 84 00:06:46,076 --> 00:06:51,009 ensure that every training case gets picked without waiting too long. 85 00:06:51,009 --> 00:06:54,028 I'm not gonna define precisely what I mean by that. 86 00:06:54,028 --> 00:06:59,030 If you're a mathematician, you could think about what might be a good definition. 87 00:07:02,056 --> 00:07:07,009 Now, having picked a training case, you look to see if the output's correct. 88 00:07:07,009 --> 00:07:09,092 If it is correct, you don't change the weights. 89 00:07:10,072 --> 00:07:15,090 If the output unit outputs a zero when it should've output a one, in other words, it 90 00:07:15,090 --> 00:07:21,002 said it's not an instance of the pattern we're trying to recognize, when it really 91 00:07:21,002 --> 00:07:25,059 is, then all we do is we add the input vector to the weight vector of the 92 00:07:25,059 --> 00:07:29,086 perceptron. Conversely, if the output unit, outputs a 93 00:07:29,086 --> 00:07:35,029 one, when is should have output a zero, we subract the input vector, from the weight 94 00:07:35,029 --> 00:07:40,063 vector of the [inaudible]. And what's surprising is that, that simple 95 00:07:40,063 --> 00:07:45,073 learning procedure is guaranteed to find you a set of weights that will get a right 96 00:07:45,073 --> 00:07:50,067 answer for every training case. The proviso is that it can only do it if 97 00:07:50,067 --> 00:07:55,060 it is such a set of weights and for many interesting problems there is no such set 98 00:07:55,060 --> 00:07:58,051 of weights. Whether or not a set of weights exist 99 00:07:58,051 --> 00:08:01,007 depends very much on what features you use. 100 00:08:01,007 --> 00:08:05,088 So it turns out for many problems the difficult bit is deciding what features to 101 00:08:05,088 --> 00:08:08,056 use. If you're using the appropriate features 102 00:08:08,056 --> 00:08:12,066 learning then may become easy. If you're not using the right features 103 00:08:12,066 --> 00:08:16,081 learning becomes impossible and all the work is deciding the features.