1 00:00:00,000 --> 00:00:06,019 The next few videos are about using the back propagation algorithm to learn a 2 00:00:06,019 --> 00:00:10,016 feature representation of the meaning of the word. 3 00:00:10,046 --> 00:00:16,014 I'm gonna start with a very simple case from the 1980s, when computers were very 4 00:00:16,014 --> 00:00:19,041 slow. It's a small case, but it illustrates the 5 00:00:19,041 --> 00:00:24,060 idea about how you can take some relational information, and use the back 6 00:00:24,060 --> 00:00:30,022 propagation algorithm to turn relational information into feature vectors that 7 00:00:30,022 --> 00:00:35,020 capture the meanings of words. This diagram shows a simple family tree, 8 00:00:35,020 --> 00:00:40,067 in which, for example, Christopher and Penelope marry, and have children Arthur 9 00:00:40,067 --> 00:00:45,088 and Victoria. What we'd like is to train a neural 10 00:00:45,088 --> 00:00:49,083 network to understand the information in this family tree. 11 00:00:49,083 --> 00:00:55,054 We've also given it another family tree of Italian people which has pretty much the 12 00:00:55,054 --> 00:01:00,043 same structure as the English tree. And perhaps when it tries to learn both 13 00:01:00,043 --> 00:01:05,062 sets of facts, the neural net is going to be able to take advantage of that analogy. 14 00:01:05,062 --> 00:01:10,062 The information in these family trees can be expressed as a set of propositions. 15 00:01:10,062 --> 00:01:15,012 If we make up names for the relationships depicted by the trees. 16 00:01:15,012 --> 00:01:20,054 So we're gonna use the relationships son daughter, nephew niece, father mother, 17 00:01:20,054 --> 00:01:23,056 uncle aunt, brother sister and husband wife. 18 00:01:23,056 --> 00:01:29,054 And using those relationships we can write down a set of triples such as, Colleen has 19 00:01:29,054 --> 00:01:34,040 father James, Colleen has mother Victoria and James has wife Victoria. 20 00:01:34,040 --> 00:01:40,068 And in the nice simple families depicted in the diagram, the third proposition 21 00:01:40,068 --> 00:01:46,023 follows from the previous two. Similarly, the third proposition in the 22 00:01:46,023 --> 00:01:52,096 next set follows from the previous two. So the relational learning task, that is, 23 00:01:52,096 --> 00:01:59,017 the task of learning the information in those family trees, can be viewed as 24 00:01:59,017 --> 00:02:05,013 figuring out the regularities in a large set of triples that express the 25 00:02:05,013 --> 00:02:10,039 information in those trees. Now the obvious way to express 26 00:02:10,039 --> 00:02:15,041 irregularities is as symbolic rules. For example, X has mother Y, and Y has 27 00:02:15,041 --> 00:02:20,043 husband Z, implies X has father Z. We could search for such rules, but this 28 00:02:20,043 --> 00:02:24,713 would involve a search through quite a large space, a combinatorially large 29 00:02:24,713 --> 00:02:30,094 space, of discrete possibilities. A very different way to try and capture 30 00:02:30,094 --> 00:02:37,008 the same information is to use a neural network that searches through a continuous 31 00:02:37,008 --> 00:02:41,083 space of real valued weights to try and capture the information. 32 00:02:41,083 --> 00:02:48,019 And the way it's going to do that is we're going to say it's captured the information 33 00:02:48,019 --> 00:02:53,031 if it can predict the third terminal triple from the first two terms. 34 00:02:53,031 --> 00:02:58,071 So at the bottom of this diagram here, We're going to put in a person and a 35 00:02:58,071 --> 00:03:04,065 relationship and the information is going to flow forwards through this neural 36 00:03:04,065 --> 00:03:08,048 network. And what we are going to try to get out of 37 00:03:08,048 --> 00:03:14,041 the neural network after it's learned is the person who's related to the first 38 00:03:14,041 --> 00:03:19,098 person by that relationship. The architecture of this net was designed 39 00:03:19,098 --> 00:03:23,066 by hand, that is I decided how many layers it should have. 40 00:03:23,066 --> 00:03:28,068 And I also decided where to put bottle necks to force it to learn interesting 41 00:03:28,068 --> 00:03:32,502 representations. So what we do is we encode the information 42 00:03:32,502 --> 00:03:36,031 in a neutral way, because there are 24 possible people. 43 00:03:36,031 --> 00:03:41,088 So the block at the bottom of the diagram that says, local encoding of person one, 44 00:03:41,088 --> 00:03:47,046 has 24 neurons, and exactly one of those will be turned on for each training case. 45 00:03:47,046 --> 00:03:53,018 Similarly there are twelve relationships. And exactly one of the relationship units 46 00:03:53,018 --> 00:03:56,091 will be turned on. And then for a relationship that has a 47 00:03:56,091 --> 00:04:02,010 unique answer, we would like one of the 24 people at the top, one of the 24 output 48 00:04:02,010 --> 00:04:07,089 people to turn on to represent the answer. By using a representation in which exactly 49 00:04:07,089 --> 00:04:12,079 one of the neurons is on, we don't accidentally give the network any 50 00:04:12,079 --> 00:04:17,012 similarities between people. All pairs of people are equally 51 00:04:17,012 --> 00:04:20,051 dissimilar. So, we're not cheating by giving the 52 00:04:20,051 --> 00:04:26,028 network information about who's like who. The people, as far as the network is 53 00:04:26,028 --> 00:04:33,014 concerned, are uninterpreted symbols. But now in the next layer of the network, 54 00:04:33,014 --> 00:04:39,055 we've taken the local encoding of person one, and we've connected it to a small set 55 00:04:39,055 --> 00:04:45,088 of neurons, actually six neurons for this. And because there are 24 people, it can't 56 00:04:45,088 --> 00:04:49,027 possibly dedicate one neuron to each person. 57 00:04:49,027 --> 00:04:54,091 It has to re-represent the people as patterns of activity over those six 58 00:04:54,091 --> 00:04:59,046 neurons. And what we're hoping is that when it 59 00:04:59,046 --> 00:05:05,382 learns these propositions, the way in which thing encodes a person, in that 60 00:05:05,382 --> 00:05:10,074 distributive panel activities. Will reveal structuring the task, or 61 00:05:10,074 --> 00:05:14,837 structuring the domain. So what we're going to do is we're going 62 00:05:14,837 --> 00:05:18,140 to train it up on 112 of these propositions. 63 00:05:18,140 --> 00:05:21,978 And we go through the 112 propositions many times. 64 00:05:21,978 --> 00:05:26,779 Slowly changing the weights as we go, using back propagation. 65 00:05:26,779 --> 00:05:32,786 And after training we're gonna look at the six units in that layer that says 66 00:05:32,786 --> 00:05:37,390 distributed encoding of person one to see what they are doing. 67 00:05:37,390 --> 00:05:41,017 So here are those six units as the big gray blocks. 68 00:05:41,017 --> 00:05:46,621 And I laid out the 24 people, with the twelve English people in a row along the 69 00:05:46,621 --> 00:05:50,144 top, and the twelve Italian people in a row underneath. 70 00:05:50,144 --> 00:05:54,077 So each of these blocks you'll see, has 24 blobs in it. 71 00:05:54,077 --> 00:05:59,412 And the blobs tell you the incoming weights for one of the hidden units in 72 00:05:59,412 --> 00:06:02,588 that layer. So going back to the previous slide. 73 00:06:02,588 --> 00:06:07,914 If you look at that layer that says distributed and coding of person one. 74 00:06:07,914 --> 00:06:13,008 There are six neurons there. And we're looking at the incoming weights 75 00:06:13,008 --> 00:06:19,579 of each of those six neurons. If you look at the big gray rectangle on 76 00:06:19,579 --> 00:06:24,966 the top right, you'll see an interesting structure to the weights. 77 00:06:24,966 --> 00:06:30,017 The weights along the top that come from English people are all positive. 78 00:06:30,017 --> 00:06:33,047 And the weights along the bottom are all negative. 79 00:06:33,047 --> 00:06:38,063 That means this unit tells you whether the input person is English or Italian. 80 00:06:38,063 --> 00:06:41,060 We never gave it that information explicitly. 81 00:06:41,060 --> 00:06:46,036 But obviously, it's useful information to have in this very simple world. 82 00:06:46,036 --> 00:06:51,045 Because in the family trees that we're learning about, if the input person is 83 00:06:51,045 --> 00:06:54,036 English, the output person is always English. 84 00:06:54,036 --> 00:07:00,008 And so just knowing that someone's English already allows you to predict one bit of 85 00:07:00,008 --> 00:07:04,099 information about the output. Which is according to saying it halves the 86 00:07:04,099 --> 00:07:09,042 number of possibilities. If you look at the gray blob immediately 87 00:07:09,042 --> 00:07:14,061 below that, the second one down on the right, you'll see that it has four big 88 00:07:14,061 --> 00:07:20,007 positive weights at the beginning. Those correspond to Christopher and Andrew 89 00:07:20,007 --> 00:07:24,046 with our Italian equivalents. Then it has some smaller weights. 90 00:07:24,046 --> 00:07:29,010 Then it has two big negative weights, that correspond to Collin, or his Italian 91 00:07:29,010 --> 00:07:31,090 equivalent. Then there's four more big positive 92 00:07:31,090 --> 00:07:36,043 weights, corresponding to Penelope or Christina, or their Italian equivalents. 93 00:07:36,043 --> 00:07:40,065 And right at the end, there's two big negative weights, corresponding to 94 00:07:40,065 --> 00:07:46,094 Charlotte, or her Italian equivalent. By now you've probably realized that, that 95 00:07:46,094 --> 00:07:50,034 neuron represents what generation somebody is. 96 00:07:50,034 --> 00:07:56,026 It has big positive weights to the oldest generation, big negative weight to the 97 00:07:56,026 --> 00:08:01,081 youngest generation, and intermediate weights which are roughly zero to the 98 00:08:01,081 --> 00:08:06,047 intermediate generation. So that's really a three-value feature, 99 00:08:06,047 --> 00:08:10,016 and it's telling you the generation of the person. 100 00:08:10,016 --> 00:08:15,093 Finally, if you look at the bottom gray rectangle on the left hand side, you'll 101 00:08:15,093 --> 00:08:22,558 see that has a different structure, and if we look at the top row and we look at the 102 00:08:22,558 --> 00:08:25,646 negative weights in the top row of that unit. 103 00:08:25,646 --> 00:08:31,053 It has a negative weight to Andrew, James, Charles, Christine and Jennifer and now if 104 00:08:31,053 --> 00:08:36,947 you look at the English family tree you'll see Andrew, James, Charles, Christine, and 105 00:08:36,947 --> 00:08:41,366 Jennifer are all in the right hand branch of the family tree. 106 00:08:41,366 --> 00:08:46,844 So that unit has learned to represent which branch of the family tree someone is 107 00:08:46,844 --> 00:08:49,063 in. Again, that's a very useful feature to 108 00:08:49,063 --> 00:08:53,546 have for predicting the output person, because if you know it's a close family 109 00:08:53,546 --> 00:08:58,485 relationship, you expect the output to be in the same branch of the family tree as 110 00:08:58,485 --> 00:09:01,536 the input. So the networks in the bottleneck have 111 00:09:01,536 --> 00:09:06,681 learned to represent features of people that are useful for predicting the answer. 112 00:09:06,681 --> 00:09:10,607 And notice, we didn't tell it anything about what features to use. 113 00:09:10,607 --> 00:09:15,694 We never mentioned things like nationality or brunch or family tree or generation. 114 00:09:15,694 --> 00:09:20,397 It figured out that those are good features for expressing the regularity in 115 00:09:20,397 --> 00:09:23,663 this domain. Of course, those features are only useful 116 00:09:23,663 --> 00:09:28,058 if the other bottlenecks, the one for relationships, and the one near the top of 117 00:09:28,058 --> 00:09:31,526 the network before the output person, use similar representations. 118 00:09:31,526 --> 00:09:36,053 And the central layer is able to say how the features of the input person and the 119 00:09:36,053 --> 00:09:40,620 features of the relationship predict the features of the output person. 120 00:09:40,620 --> 00:09:46,552 So for example if the input person is a generation three, and the relationship 121 00:09:46,552 --> 00:09:51,977 requires the output person to be one generation up, then the output person is a 122 00:09:51,977 --> 00:09:55,623 generation two. But notice to capture that rule, you have 123 00:09:55,623 --> 00:10:01,288 to extract appropriate features at the first hidden layer, and the last hidden 124 00:10:01,288 --> 00:10:05,497 layer of the network. And you have to make the units in the 125 00:10:05,497 --> 00:10:12,650 middle, relate those features correctly. Another way to see that the network works, 126 00:10:12,650 --> 00:10:15,742 is to train it on all but a few of the triples. 127 00:10:15,742 --> 00:10:19,682 And see if it can complete those triples correctly. 128 00:10:19,682 --> 00:10:24,210 So does it generalize? So there's 112 triples, and I trained it 129 00:10:24,210 --> 00:10:30,022 on 108 of them and tested it on the remaining four, I did that several times 130 00:10:30,022 --> 00:10:33,401 and it got either two or three of those right. 131 00:10:33,401 --> 00:10:39,268 That's not so bad for a 24 way choice, so it's true it makes mistakes, but it didn't 132 00:10:39,268 --> 00:10:44,652 have much training data, there's not enough triples in this domain to really 133 00:10:44,652 --> 00:10:50,282 nail down the regularities very well. And it does much better than chance. 134 00:10:50,282 --> 00:10:56,642 If you train it on a much bigger data set, it can generalize from a much smaller 135 00:10:56,642 --> 00:11:00,696 fraction of the data. So if you have thousands and thousands of 136 00:11:00,696 --> 00:11:05,138 relationships you only need to show a small percentage before it can start 137 00:11:05,138 --> 00:11:09,454 guessing the other ones correctly. That research was done in the 1980s, and 138 00:11:09,454 --> 00:11:13,755 was a way of showing that back-propagation could learn interesting features. 139 00:11:13,755 --> 00:11:17,546 And it was a toy example. Now we have much bigger computers, and we 140 00:11:17,546 --> 00:11:20,445 have databases of millions of relational facts. 141 00:11:20,445 --> 00:11:25,877 Many of which of the form A, R, B, A has relationship R to B, we could imagine 142 00:11:25,877 --> 00:11:31,968 training a net to discover feature vector representations of A and R, that allow it 143 00:11:31,968 --> 00:11:35,769 to predict the feature vector representation of B. 144 00:11:35,769 --> 00:11:41,069 If we did that, it would be a very good way of cleaning a database. 145 00:11:41,069 --> 00:11:45,703 It wouldn't necessarily be able to make perfect predictions. 146 00:11:45,703 --> 00:11:51,982 But it could find things in the database that it thought were highly implausible. 147 00:11:51,982 --> 00:11:57,474 So if the database contained information, like, for example, Bach was born in 1902. 148 00:11:57,474 --> 00:12:02,540 It could probably realize that was wrong, 'cuz Bach's a much older kind of person, 149 00:12:02,540 --> 00:12:06,069 and everything else he's related to is much older than 1902. 150 00:12:06,069 --> 00:12:10,562 Instead of actually using the first two terms to predict the third term, we could 151 00:12:10,562 --> 00:12:14,986 use the whole set of terms, three of them in this case, but possibly more, and 152 00:12:14,986 --> 00:12:17,784 predict the probability that the fact is correct. 153 00:12:17,784 --> 00:12:22,541 To train a net to do that, we'd need examples of a whole bunch of correct 154 00:12:22,541 --> 00:12:25,442 facts, and we'd ask it to give a high output. 155 00:12:25,442 --> 00:12:29,619 We'd also need a good source of incorrect facts, and we'd ask it to give a low 156 00:12:29,619 --> 00:12:32,079 output when we're told it was something that was false.