The next few videos are about using the back propagation algorithm to learn a feature representation of the meaning of the word. I'm gonna start with a very simple case from the 1980s, when computers were very slow. It's a small case, but it illustrates the idea about how you can take some relational information, and use the back propagation algorithm to turn relational information into feature vectors that capture the meanings of words. This diagram shows a simple family tree, in which, for example, Christopher and Penelope marry, and have children Arthur and Victoria. What we'd like is to train a neural network to understand the information in this family tree. We've also given it another family tree of Italian people which has pretty much the same structure as the English tree. And perhaps when it tries to learn both sets of facts, the neural net is going to be able to take advantage of that analogy. The information in these family trees can be expressed as a set of propositions. If we make up names for the relationships depicted by the trees. So we're gonna use the relationships son daughter, nephew niece, father mother, uncle aunt, brother sister and husband wife. And using those relationships we can write down a set of triples such as, Colleen has father James, Colleen has mother Victoria and James has wife Victoria. And in the nice simple families depicted in the diagram, the third proposition follows from the previous two. Similarly, the third proposition in the next set follows from the previous two. So the relational learning task, that is, the task of learning the information in those family trees, can be viewed as figuring out the regularities in a large set of triples that express the information in those trees. Now the obvious way to express irregularities is as symbolic rules. For example, X has mother Y, and Y has husband Z, implies X has father Z. We could search for such rules, but this would involve a search through quite a large space, a combinatorially large space, of discrete possibilities. A very different way to try and capture the same information is to use a neural network that searches through a continuous space of real valued weights to try and capture the information. And the way it's going to do that is we're going to say it's captured the information if it can predict the third terminal triple from the first two terms. So at the bottom of this diagram here, We're going to put in a person and a relationship and the information is going to flow forwards through this neural network. And what we are going to try to get out of the neural network after it's learned is the person who's related to the first person by that relationship. The architecture of this net was designed by hand, that is I decided how many layers it should have. And I also decided where to put bottle necks to force it to learn interesting representations. So what we do is we encode the information in a neutral way, because there are 24 possible people. So the block at the bottom of the diagram that says, local encoding of person one, has 24 neurons, and exactly one of those will be turned on for each training case. Similarly there are twelve relationships. And exactly one of the relationship units will be turned on. And then for a relationship that has a unique answer, we would like one of the 24 people at the top, one of the 24 output people to turn on to represent the answer. By using a representation in which exactly one of the neurons is on, we don't accidentally give the network any similarities between people. All pairs of people are equally dissimilar. So, we're not cheating by giving the network information about who's like who. The people, as far as the network is concerned, are uninterpreted symbols. But now in the next layer of the network, we've taken the local encoding of person one, and we've connected it to a small set of neurons, actually six neurons for this. And because there are 24 people, it can't possibly dedicate one neuron to each person. It has to re-represent the people as patterns of activity over those six neurons. And what we're hoping is that when it learns these propositions, the way in which thing encodes a person, in that distributive panel activities. Will reveal structuring the task, or structuring the domain. So what we're going to do is we're going to train it up on 112 of these propositions. And we go through the 112 propositions many times. Slowly changing the weights as we go, using back propagation. And after training we're gonna look at the six units in that layer that says distributed encoding of person one to see what they are doing. So here are those six units as the big gray blocks. And I laid out the 24 people, with the twelve English people in a row along the top, and the twelve Italian people in a row underneath. So each of these blocks you'll see, has 24 blobs in it. And the blobs tell you the incoming weights for one of the hidden units in that layer. So going back to the previous slide. If you look at that layer that says distributed and coding of person one. There are six neurons there. And we're looking at the incoming weights of each of those six neurons. If you look at the big gray rectangle on the top right, you'll see an interesting structure to the weights. The weights along the top that come from English people are all positive. And the weights along the bottom are all negative. That means this unit tells you whether the input person is English or Italian. We never gave it that information explicitly. But obviously, it's useful information to have in this very simple world. Because in the family trees that we're learning about, if the input person is English, the output person is always English. And so just knowing that someone's English already allows you to predict one bit of information about the output. Which is according to saying it halves the number of possibilities. If you look at the gray blob immediately below that, the second one down on the right, you'll see that it has four big positive weights at the beginning. Those correspond to Christopher and Andrew with our Italian equivalents. Then it has some smaller weights. Then it has two big negative weights, that correspond to Collin, or his Italian equivalent. Then there's four more big positive weights, corresponding to Penelope or Christina, or their Italian equivalents. And right at the end, there's two big negative weights, corresponding to Charlotte, or her Italian equivalent. By now you've probably realized that, that neuron represents what generation somebody is. It has big positive weights to the oldest generation, big negative weight to the youngest generation, and intermediate weights which are roughly zero to the intermediate generation. So that's really a three-value feature, and it's telling you the generation of the person. Finally, if you look at the bottom gray rectangle on the left hand side, you'll see that has a different structure, and if we look at the top row and we look at the negative weights in the top row of that unit. It has a negative weight to Andrew, James, Charles, Christine and Jennifer and now if you look at the English family tree you'll see Andrew, James, Charles, Christine, and Jennifer are all in the right hand branch of the family tree. So that unit has learned to represent which branch of the family tree someone is in. Again, that's a very useful feature to have for predicting the output person, because if you know it's a close family relationship, you expect the output to be in the same branch of the family tree as the input. So the networks in the bottleneck have learned to represent features of people that are useful for predicting the answer. And notice, we didn't tell it anything about what features to use. We never mentioned things like nationality or brunch or family tree or generation. It figured out that those are good features for expressing the regularity in this domain. Of course, those features are only useful if the other bottlenecks, the one for relationships, and the one near the top of the network before the output person, use similar representations. And the central layer is able to say how the features of the input person and the features of the relationship predict the features of the output person. So for example if the input person is a generation three, and the relationship requires the output person to be one generation up, then the output person is a generation two. But notice to capture that rule, you have to extract appropriate features at the first hidden layer, and the last hidden layer of the network. And you have to make the units in the middle, relate those features correctly. Another way to see that the network works, is to train it on all but a few of the triples. And see if it can complete those triples correctly. So does it generalize? So there's 112 triples, and I trained it on 108 of them and tested it on the remaining four, I did that several times and it got either two or three of those right. That's not so bad for a 24 way choice, so it's true it makes mistakes, but it didn't have much training data, there's not enough triples in this domain to really nail down the regularities very well. And it does much better than chance. If you train it on a much bigger data set, it can generalize from a much smaller fraction of the data. So if you have thousands and thousands of relationships you only need to show a small percentage before it can start guessing the other ones correctly. That research was done in the 1980s, and was a way of showing that back-propagation could learn interesting features. And it was a toy example. Now we have much bigger computers, and we have databases of millions of relational facts. Many of which of the form A, R, B, A has relationship R to B, we could imagine training a net to discover feature vector representations of A and R, that allow it to predict the feature vector representation of B. If we did that, it would be a very good way of cleaning a database. It wouldn't necessarily be able to make perfect predictions. But it could find things in the database that it thought were highly implausible. So if the database contained information, like, for example, Bach was born in 1902. It could probably realize that was wrong, 'cuz Bach's a much older kind of person, and everything else he's related to is much older than 1902. Instead of actually using the first two terms to predict the third term, we could use the whole set of terms, three of them in this case, but possibly more, and predict the probability that the fact is correct. To train a net to do that, we'd need examples of a whole bunch of correct facts, and we'd ask it to give a high output. We'd also need a good source of incorrect facts, and we'd ask it to give a low output when we're told it was something that was false.