1 00:00:00,000 --> 00:00:05,002 In this video, we're going to, look at a proof that the perceptron learning 2 00:00:05,002 --> 00:00:10,032 procedure will eventually get the weights into the cone of feasible solutions. 3 00:00:10,032 --> 00:00:15,021 I don't want you to get the wrong idea about the course from this video. 4 00:00:15,021 --> 00:00:19,097 In general, it's gonna be about engineering, not about proofs of things. 5 00:00:19,097 --> 00:00:25,067 There'll be very few proofs in the course. But we get to understand quite a lot more 6 00:00:25,067 --> 00:00:31,011 about perceptrons when we try and prove that they will eventually get the right 7 00:00:31,011 --> 00:00:33,071 answer. So we going to use our geometric 8 00:00:33,071 --> 00:00:39,013 understanding of what's happening in weight space as subset from learns, to get 9 00:00:39,013 --> 00:00:44,034 a proof that the perceptron will eventually find a weight vector if it gets 10 00:00:44,034 --> 00:00:49,034 the right answer for all of the training cases, if any such vector exists. 11 00:00:49,034 --> 00:00:55,010 And our proof is gonna assume that there is a vector that gets the right answer for 12 00:00:55,010 --> 00:00:58,073 all training cases. We'll call that a feasible vector. 13 00:00:58,073 --> 00:01:03,074 An example of a feasible vector is shown by the green dot in the diagram. 14 00:01:04,054 --> 00:01:09,005 So we start with the wait vector that's getting some of the training cases wrong, 15 00:01:09,005 --> 00:01:12,090 and in the diagram we've shown a training case that is getting wrong. 16 00:01:13,034 --> 00:01:17,070 And what we want to show. This is the idea for the proof. 17 00:01:17,070 --> 00:01:23,062 Is that, every time he gets a training case wrong, it will update the current 18 00:01:23,062 --> 00:01:29,000 weight vector. In a way that makes it closer to every 19 00:01:29,000 --> 00:01:34,080 feasible weight factor. So we can represent the squared distance 20 00:01:34,080 --> 00:01:40,056 of the current weight factor from a feasible weight factor, as the sum of a 21 00:01:40,056 --> 00:01:46,054 squared distance along the line of the input factor that defines the training 22 00:01:46,054 --> 00:01:51,015 case, and another squared difference orthogonal to that line. 23 00:01:51,015 --> 00:01:57,021 The orthogonal squared distance won't change, and the squared distance along the 24 00:01:57,021 --> 00:02:03,079 line of the input factor will get smaller. So our hopeful claim is that, every time 25 00:02:03,079 --> 00:02:09,164 the perceptor makes a mistake. Our current weight factor is going to get 26 00:02:09,164 --> 00:02:15,076 closer to all feasible weight factors. Now this is almost right, but there's an 27 00:02:15,076 --> 00:02:20,079 unfortunate problem. If you look at the feasible weight vector 28 00:02:20,079 --> 00:02:26,092 in gold, it's just on the right side of the plane that defines one of the training 29 00:02:26,092 --> 00:02:30,043 cases. And the current weight factor is just on 30 00:02:30,043 --> 00:02:34,009 the wrong side, and the input vector is quite big. 31 00:02:34,009 --> 00:02:39,069 So when we add the input factor to the count weight factor, we actually get 32 00:02:39,069 --> 00:02:44,690 further away from that gold feasible weight factor. 33 00:02:44,690 --> 00:02:51,008 So our hopeful claim doesn't work, but we can fix it up so that it does. 34 00:02:53,053 --> 00:02:58,087 So what we're gonna do is we're gonna define a generously feasible weight 35 00:02:58,087 --> 00:03:02,040 factor. That's a weight factor that not only gets 36 00:03:02,040 --> 00:03:07,089 every training case right, but it gets it right by at least a certain margin. 37 00:03:07,089 --> 00:03:12,093 Where the margin is as big as the input factor for that training case. 38 00:03:13,068 --> 00:03:19,075 So we take the cone of feasible solutions, and inside that we have another cone of 39 00:03:19,075 --> 00:03:25,021 generously feasible solutions. Which get everything right by at least the 40 00:03:25,021 --> 00:03:28,088 size of the input vector. And now, our proof will work. 41 00:03:28,088 --> 00:03:33,712 Now we can make the claim, that every time the perceptron makes a mistake. 42 00:03:33,712 --> 00:03:39,093 The squared distance to all of the generously feasible weight vectors would 43 00:03:39,093 --> 00:03:44,097 be decreased by at least the squared length of the input vector, which is the 44 00:03:44,097 --> 00:03:50,122 update we make. So given that, we can get an informal 45 00:03:50,122 --> 00:03:53,979 sketch of a proof of convergence. I'm not gonna try and make this formal. 46 00:03:54,131 --> 00:03:57,480 I'm more interested in the engineering than the mathematics. 47 00:03:57,480 --> 00:04:02,842 If your mathematician i'm sure you can make it formal yourself. 48 00:04:02,842 --> 00:04:09,924 So, every time the preceptor makes a mistake, the current weight vector moves 49 00:04:09,924 --> 00:04:15,932 and it decreases its squared distance from every feas, generously feasible weight 50 00:04:15,932 --> 00:04:21,385 vector, by at least the squared length of the current input vector. 51 00:04:21,385 --> 00:04:27,304 And so the squared distance to all the generously feasible weight vectors 52 00:04:27,304 --> 00:04:32,444 decreases by at least that squared length and assuming that none of the input 53 00:04:32,444 --> 00:04:37,343 vectors are infinitesimally small. That means that after a finite number of 54 00:04:37,343 --> 00:04:42,621 mistakes the weight vector must lie in the feasible region if this region exists. 55 00:04:42,621 --> 00:04:48,104 Notice it doesn't have to lie in the generously feasible region, but it has to 56 00:04:48,104 --> 00:04:52,340 get into the feasible region to make, to stop it making mistakes. 57 00:04:52,340 --> 00:04:56,286 And that's it. That's our informal sketch of a proof that 58 00:04:56,286 --> 00:04:59,347 the perceptron convergence procedure works. 59 00:04:59,347 --> 00:05:03,587 But notice, it all depends on the assumption that there is a generously 60 00:05:03,587 --> 00:05:07,872 feasible weight vector. And if there is no such vector, the whole 61 00:05:07,872 --> 00:05:09,091 proof falls apart.