1 00:00:00,000 --> 00:00:04,041 This video introduces the learning algorithm for a linear neuron. 2 00:00:04,041 --> 00:00:09,050 This is quite like the learning algorithm for a perceptron, but it achieves 3 00:00:09,050 --> 00:00:13,058 something different. In a perceptron, what's happening is the 4 00:00:13,058 --> 00:00:17,045 weight, so always getting closer to a good set of weights. 5 00:00:17,045 --> 00:00:22,082 In a linear neuron, the outputs are always getting closer to the target outputs. 6 00:00:25,053 --> 00:00:30,081 The perception convergence procedure works by ensuring that when we change the 7 00:00:30,081 --> 00:00:33,095 weights, we get closer to a good set of weights. 8 00:00:33,095 --> 00:00:38,042 That type of guarantee cannot be extended to more complex networks. 9 00:00:38,042 --> 00:00:44,003 Because in more complex networks when you average two good set of weights, you might 10 00:00:44,003 --> 00:00:48,034 get a bad set of weights. So for multilayer neural networks, we 11 00:00:48,034 --> 00:00:51,049 don't use the perceptron learning procedure. 12 00:00:51,049 --> 00:00:57,021 And to prove that when they're learning something is improving, we don't use the 13 00:00:57,021 --> 00:01:01,057 same kind of proof at all. They should never have been called 14 00:01:01,057 --> 00:01:05,073 multilayer perceptrons. It's partly my fault and I'm sorry. 15 00:01:05,073 --> 00:01:10,065 For multilayer nets we're gonna need a different way to show that the learning 16 00:01:10,065 --> 00:01:14,075 procedure makes progress. Instead of showing that the weights get 17 00:01:14,075 --> 00:01:19,073 closer to a good set of weights, we're gonna show that the actual output values 18 00:01:19,073 --> 00:01:24,077 get closer to the target output values. This can be true even for non-convex 19 00:01:24,077 --> 00:01:29,064 problems in which averaging the weights of two good solutions does not give you a 20 00:01:29,064 --> 00:01:33,092 good solution. It's not true for perceptual learning. 21 00:01:33,092 --> 00:01:39,057 In perceptual learning, the outputs as a whole can get further away from the target 22 00:01:39,057 --> 00:01:44,068 outputs even though the weights are getting closer to good sets of weights. 23 00:01:45,043 --> 00:01:50,085 The simplest example of learning in which you're making the outputs get closer to 24 00:01:50,085 --> 00:01:56,007 the target outputs is learning in a linear neuron with a squared error measure. 25 00:01:56,084 --> 00:02:01,074 Linear neurons, which are also called linear filters in electrical engineering, 26 00:02:01,074 --> 00:02:06,040 have a real valued output that's simply the weighted sum of their outputs. 27 00:02:06,070 --> 00:02:13,034 So the output Y, which is the neuron's estimate of the target value, is the sum 28 00:02:13,034 --> 00:02:18,071 over all the inputs i of a weight vector times an input vector. 29 00:02:18,071 --> 00:02:25,009 So we can write it in summation form or we can write it in vector notation. 30 00:02:26,075 --> 00:02:32,090 The aim of the learning is to minimize the error summed over all training cases. 31 00:02:33,080 --> 00:02:39,044 We need a measure of that error and to keep life simple, we use the square 32 00:02:39,044 --> 00:02:43,093 difference between the target output and the actual output. 33 00:02:45,024 --> 00:02:48,092 So one question is why don't we just solve it analytically. 34 00:02:48,092 --> 00:02:53,066 It's straightforward to write down a set of equations with one equation per 35 00:02:53,066 --> 00:02:57,009 training case, and to solve for the best set of weights. 36 00:02:57,036 --> 00:03:02,018 That's the standard engineering approach, and so why don't we use it? 37 00:03:02,086 --> 00:03:07,091 The first answer, and the scientific answer, is we'd like to understand what 38 00:03:07,091 --> 00:03:13,031 real neurons might be doing, and they're probably not solving a set of equations 39 00:03:13,031 --> 00:03:16,089 symbolically. An engineering answer is that we want a 40 00:03:16,089 --> 00:03:21,047 method that we can then generalize to multilayer, nonlinear networks. 41 00:03:21,081 --> 00:03:27,014 The analytic solution relies on it being linear and having a squared error measure. 42 00:03:27,014 --> 00:03:32,041 An iterative method, which we're gonna see next, is usually less efficient, but much 43 00:03:32,041 --> 00:03:35,030 easier to generalize to more complex systems. 44 00:03:36,050 --> 00:03:42,013 So I'm now gonna go through a toy example that illustrates an iterative method for 45 00:03:42,013 --> 00:03:47,063 finding the weights of a linear neuron. Suppose that every day, you get lunch at a 46 00:03:47,063 --> 00:03:51,003 cafeteria. And your diet consists entirely of fish, 47 00:03:51,003 --> 00:03:54,085 chips, and ketchup. Each day, you order several portions of 48 00:03:54,085 --> 00:03:58,041 each, but on different days, it's different numbers of portions. 49 00:03:58,041 --> 00:04:03,000 The cashier only shows you the total price of the meal, but after a few days, you 50 00:04:03,000 --> 00:04:07,071 ought to be able to figure out what the price is for each portion of each kind of 51 00:04:07,071 --> 00:04:13,030 thing. In the iterative approach, you start with 52 00:04:13,030 --> 00:04:19,007 random guesses for the prices of portions. And then you adjust these guesses so that 53 00:04:19,007 --> 00:04:23,034 you get a better fit to the prices that the cashier tells you. 54 00:04:23,034 --> 00:04:26,043 Those are the observed prices of whole meals. 55 00:04:27,039 --> 00:04:32,095 So each meal, you get a price and that gives you a linear constraint on the 56 00:04:32,095 --> 00:04:38,089 prices of the individual portions. It looks like this, the price of the whole 57 00:04:38,089 --> 00:04:44,094 meal is the number of portion of fish, x fish, times the cost of a portion of fish, 58 00:04:44,094 --> 00:04:48,007 w fish. And the same for chips and ketchup. 59 00:04:49,045 --> 00:04:53,009 So the prices of the portions are like the weights of a linear neuron. 60 00:04:53,009 --> 00:04:57,025 And we can think of the whole weight vector as being the price of a portion of 61 00:04:57,025 --> 00:05:01,019 fish, the price of a portion of chips, and the price of a portion of ketchup. 62 00:05:02,068 --> 00:05:06,086 We're going to start with guesses for these prices and then we're going to 63 00:05:06,086 --> 00:05:11,027 adjust the guesses slightly, so that we agree better with what the cashier says. 64 00:05:12,016 --> 00:05:18,007 So let's suppose that the true weights that the cashier using to figure out the 65 00:05:18,007 --> 00:05:23,069 price, are 150 for a portion of fish, 50 for portion of chips and a 100 for a 66 00:05:23,069 --> 00:05:28,038 portion of Ketchup. For the meals shown here, that will lead 67 00:05:28,038 --> 00:05:32,069 to a price of 850. So that's going to be our target value. 68 00:05:33,051 --> 00:05:39,042 That suppose that we start with guesses, but each portion costs 50. 69 00:05:40,022 --> 00:05:45,025 So for the meal with two portions of fish, five of chips, and three of ketchup, we're 70 00:05:45,025 --> 00:05:48,053 going to initially think that the price should be 500. 71 00:05:48,053 --> 00:05:53,009 That gives us a residual error of 350. The residual error is the difference 72 00:05:53,009 --> 00:05:58,025 between what the cashier says and what we think the price should be with our current 73 00:05:58,025 --> 00:06:03,012 weights. We're then gonna use the delta rule for 74 00:06:03,012 --> 00:06:09,055 revising our prices of portions. We make the change in a weight, delta WI 75 00:06:09,055 --> 00:06:15,562 be equal to a learning rate, epsilon times the number of portions of the i-th thing, 76 00:06:15,562 --> 00:06:21,077 times the residual error. The difference between the target and our 77 00:06:21,077 --> 00:06:27,013 estimate. So if we make the learning rate be one 78 00:06:27,013 --> 00:06:34,094 over 35, so the maths stays simple, then the learning rate times the residual error 79 00:06:34,094 --> 00:06:42,000 for this particular example is ten. And so, our change in the weight for fish 80 00:06:42,000 --> 00:06:46,050 will be two times ten. We'll increase that weight by twenty. 81 00:06:46,050 --> 00:06:50,092 Our change in the weight for chips will be five times ten. 82 00:06:50,092 --> 00:06:55,087 And our change in the weight for ketchup will be three times ten. 83 00:06:56,062 --> 00:06:59,062 That'll give us new weights of 70, 100, and 80. 84 00:06:59,062 --> 00:07:03,003 And notice, the weight for chips actually got worse. 85 00:07:03,003 --> 00:07:08,044 There's no guarantee with this kind of learning that the individual weights will 86 00:07:08,044 --> 00:07:12,045 keep getting better. What's getting better is the difference 87 00:07:12,045 --> 00:07:15,060 between what the cashier says and our estimate. 88 00:07:17,013 --> 00:07:20,004 So now, we're going to derive the delta rule. 89 00:07:21,007 --> 00:07:26,072 We start by defining the arrow measure, which is simply our squared residual 90 00:07:26,072 --> 00:07:32,022 summed over all training cases. That is the squared difference between the 91 00:07:32,022 --> 00:07:37,043 target and what the neural net predicts. Or the linear neuron predicts. 92 00:07:37,043 --> 00:07:43,044 Squared, in some liberal training cases. And we put a one-half in front, which will 93 00:07:43,044 --> 00:07:50,036 cancel the two, when we differentiate. We now differentiate that error measure 94 00:07:50,036 --> 00:07:56,040 with respect to one of the weights, WI. To do that differentiation we need to use 95 00:07:56,040 --> 00:07:59,086 the chain rule. The chain rule says that how the error 96 00:07:59,086 --> 00:08:04,085 changes as we change a weight, will be how the output changes as we change the 97 00:08:04,085 --> 00:08:08,062 weight, times how the error changes as we change the output. 98 00:08:08,062 --> 00:08:13,352 The chain rule is easy to remember, you just cancel those two DYs but you can only 99 00:08:13,352 --> 00:08:16,093 do that when there's no mathematicians looking. 100 00:08:17,031 --> 00:08:22,087 The reason the first one, DY by DW is written with a curly D is because it's a 101 00:08:22,087 --> 00:08:27,000 partial derivative. That is, there's many different weights 102 00:08:27,000 --> 00:08:32,012 you can change to change the output. And here, we're just considering the 103 00:08:32,012 --> 00:08:37,068 change to weight i. So, DY by DWi, is actually equal to Xi, 104 00:08:37,068 --> 00:08:45,099 and that's because Y is just Wi times Xi, and DE by DY, is just T minus Y, because 105 00:08:45,099 --> 00:08:53,089 when we differentiate that T minus Y squared, and use the half to cancel the 106 00:08:53,089 --> 00:09:01,045 two we just get T minus Y. So our learning rule is now, we change the 107 00:09:01,045 --> 00:09:07,760 weights by an amount that's equal to the learning rate epsilon times the derivative 108 00:09:07,760 --> 00:09:12,001 of the error with respect to a weight, to E by DWi. 109 00:09:12,001 --> 00:09:17,062 And with a minus sign in front cuz we want the error to go down. 110 00:09:17,062 --> 00:09:24,023 And that minus sign cancels the minus sign in the line above and we get that. 111 00:09:24,023 --> 00:09:31,034 The change in a weight is the sum of all training cases of the learning rate times 112 00:09:31,034 --> 00:09:37,077 the input value times the difference between the target and actual outputs. 113 00:09:39,077 --> 00:09:44,052 Now we can ask how does this learning procedure, this delta rule, behave? 114 00:09:44,052 --> 00:09:48,013 Does this, for example, eventually get the right answer? 115 00:09:49,062 --> 00:09:53,098 There may be no perfect answer. It may be that we give the linear neuron a 116 00:09:53,098 --> 00:09:56,064 bunch of training cases with desired answers. 117 00:09:56,064 --> 00:10:00,029 And there's no set of weights that'll give the desired answer. 118 00:10:00,029 --> 00:10:05,001 There's still some set of weights that gets the best approximation on all those 119 00:10:05,001 --> 00:10:07,061 training cases, minimizes that error measure. 120 00:10:07,061 --> 00:10:11,080 Some that are all training cases. And if we make the learning rate small 121 00:10:11,080 --> 00:10:16,052 enough and we learn for long enough, we can get as close as we like to that best 122 00:10:16,052 --> 00:10:22,008 answer. Another question is, how quickly do we get 123 00:10:22,008 --> 00:10:27,021 towards the best answer. And even for a linear system. 124 00:10:27,021 --> 00:10:31,068 The learning can be quite slow in this kind of intricate learning. 125 00:10:31,068 --> 00:10:36,091 If two input dimensions are highly correlated, its very hard to tell how much 126 00:10:36,091 --> 00:10:41,582 of the sum of the weight on both input dimensions should be attributed to each 127 00:10:41,582 --> 00:10:45,053 input dimension. So if for example, you always get the same 128 00:10:45,053 --> 00:10:50,354 number of portions of ketchup and chips is, we can't decide how much of the price 129 00:10:50,354 --> 00:10:53,678 is due to the ketchup and how much is used to the chips. 130 00:10:53,678 --> 00:10:58,702 And if they're almost always the same, it can take a long time for the learning to 131 00:10:58,702 --> 00:11:02,944 correctly attribute the price to the ketchup and the chips. 132 00:11:02,944 --> 00:11:08,035 There's an interesting relationship between the delta rule and the learning 133 00:11:08,035 --> 00:11:11,541 rule for perceptrons. So, if you, you use the online version of 134 00:11:11,541 --> 00:11:15,971 the delta rule, but we change the weights after each training case, it's quite 135 00:11:15,971 --> 00:11:20,501 similar to the perceptron learning rule. In perceptron learning, we increment or 136 00:11:20,501 --> 00:11:24,716 decrement the weight vector by the input vector, but we only change the input 137 00:11:24,716 --> 00:11:28,666 vector when we make an error. In the online version of the delta rule, 138 00:11:28,666 --> 00:11:32,665 we increment or decrement the weight vector by the imperfector. 139 00:11:32,665 --> 00:11:36,729 But we scale that by both the residual error and the learning rate. 140 00:11:36,729 --> 00:11:41,259 And one annoying thing about this is we have to choose a learning rate. 141 00:11:41,259 --> 00:11:46,248 If we choose a learning rate that's too big, the system will be unstable. 142 00:11:46,248 --> 00:11:51,599 And if we choose a learning rate that's too small, it will take an unnecessarily 143 00:11:51,599 --> 00:11:54,074 long time to, to learn a sensible set of weights