In this, video we're going look at various ways to avoid having to use 100,000 different output units in the softmax, if we want to get probabilities for a 100,000 different words. At the end of the video, I'll show an example of the words, or the word representations that are learned by a particular method. To see what these representations look like, what we do is we embed them in a two dimensional space. And then, we can see which words have extremely similar representations. That gives us a feel for what the neural network has been able to learn, just in trying to predict the next word or perhaps, the middle word of a string of words. So, one way to avoid having a 100,000 output units, is to go through the words one at a time. So, the input consists of a context of previous words, and we plug in a candidate word. Then the output of the network is a score for how good that combination is. How well does that candidate word fit into that context? Of course, that means we have to run this net many, many times but most of the work can be shared so the inputs that come into that final hidden layer from the context, stay the same for all different candidate words and so we only need to run a small part of the network for each candidate word. We try all the candidate words one at a time. And I don't want to [sigh]. So, one way having so, one way to avoid having a 100,000 different output units, is to use a serial architecture. We put in the context words as before but now, we also put in a candidate for the next word in the same way as the context words. Then, we go forward through the net and what we output, is a score for how good that candidate word is, in that context. Of course, we have to run forward through this net, many, many times. But most of the work only needs to be done once. So, the inputs from the context to that big, big hidden layer are the same for every different candidate word. The only bit we need to run for each candidate word is the inputs coming from the candidate word and the final output to the score. And, of course, that doesn't have many weights in it. So, we try all the candidate words one at a time. And by putting in the word as a candidate at the bottom, we're able to use the learned feature vector that, for that kind of word, that we learned when it was a context word. So, we can have the same representation of the candidate word when it's in the context, and when it's a candidate for the next word. So, we can have the same representation of the word when it's part of the context, and when it's a candidate for the next word that we're trying to predict. Learning in the serial architecture works in the following way. We first compute the score for each possible candidate word and then we put all those scores, which we computed sequentially into a big softmax to get word probabilities. Now, the difference between the word probabilities and their target probabilities which is normally one for the correct word and zero for everything else, gives us the cross entropy error derivatives and we use those derivatives to change the way, in such a way that we raise the score for the correct candidate and we lower the scores for all of these high scoring rivals. We can save a lot of time in this architecture if instead of considering all possible candidate words, we only consider a small set. Perhaps, candidate words suggested by some other predictor. For example, we could use the neural net to revise the probabilities of the words that the Trigram model thinks are likely. A different way to avoid a great big softmax is to structure the words into a tree. So, we arrange all of the words in a binary tree, with the words as its leaves, we then use the context of previous words to generate a prediction vector v. We compare that prediction vector with a vector that we learned for each node of the tree. The way we do the comparison is by taking a scale of product of the, the prediction vector and the vector that we've learned for the node of the tree. And then, we apply, apply the logistic function. And then, we apply the logistic function to that scale of product, and that will give us the probability of taking the right branch in the tree and one minus that, gives us the probability of taking the left branch in the tree. So, the tree looks like this and if that sigma is the logistic function, you can see at the top of tree that we take the logistic of the prediction vector, times the vector that we learned ui of the top node, and that tells us the probability taking the right branch and conversely we take the left branch with one minus that probability and so on, all the way down the tree to the word we want. So, when we're learning, we use our contacts to get a prediction vector. And in this work, we used quite a simple neural network, that simply takes the stored, learned representations of each word, and uses the feature vector for each word to provide evidence for the great prediction vector. We use quite the simple, we use quite a simple neuron at work, for this work, where we take the feature vector for each word and those feature vectors directly contribute evidence in favor of a prediction vector. We simply add up the evidence contributed by those feature vectors and that gives us the prediction vector. That prediction vector then gets compared with the vectors that have been learned for all the nodes in the tree on the path to the correct next word. So, that would be nodes i, j, and m in this tree. That red path shows you the path to the word that actually occurred next and those are the only nodes we need to consider during learning. What we know is that, we'd like the probability of predicting that word to be as high as possible. And so we'd like the probability of taking that path to be as high as possible. So, we'd like the product of the probabilities on the individual elements of that path to be as high as possible. And that means we'd like the sum of their log probabilities to be high. So, we benefit from a nice decomposition here. That when we try and maximize the probability of picking the correct target word, we're really trying to maximize the sum of the log probabilities of taking all the branches on the path that leads to that target word. So, during learning, we only need to consider the nodes on that correct path. And that's a huge win, that's exponentially fewer nodes when considering all of the nodes. So, it's log to the base two of n instead of n. For each of those nodes, we know the correct branch, cuz we know what the next word is. We know the current probability of taking that branch, by comparing the prediction vector with the learned vector of the node. As so, we can get derivatives for learning both the prediction vector v and learned vector of that node, u. This makes training hundreds of times faster. Unfortunately, it's still slow at test time. At test time, you need to know the probabilities of many words to help speech recognizer. And so, you can't just consider one path. There's a much simpler way to learn feature vectors for words. This is what by Collobert and Weston. And what they did was learn feature vectors for words and then showed that the feature vectors they learned were very good for a whole bunch of different natural language processing tasks. They're not trying to predict the next word, they're just trying to get good feature vectors for words, and so, they use both the past context and the future context. So, they look at a window of eleven words, five in the past and five in the future. And in the middle of that window, they put either the correct word, the one that actually occurred in the text, or a random word. And then they trained a neural net to produce an output that's high if it's a correct word and low if it's a random word. The neural net works the same way as before. We map the individual words to feature vectors, these word codes, and then we use the feature vectors in the neural net, possibly with more layers in the neural net to try and predict whether this is the right or wrong word for that context. So, what they're really doing is judging whether the middle word is an appropriate word for the five-word context on either side of it. They trained this up on about 600 million examples from Wikipedia. And they showed that the vectors they get are good for a variety of different natural language processing tasks. One way of getting a sense of the vectors that they learn for words, is to display the vectors in a two-dimensional map. All we want to do is lay out the word vectors in such a way that very similar vectors are very close to one another. So, then you'll discover what words the neural network thinks have similar meanings. We're going to use a multi-scale method called t-sne. You can look up t-sne on Google and discover how it works if you want. And t-sne is able, in addition to putting very similar words close to each other, it's also able to put similar clusters close to each other. So, it gives you structure at many different scales. What we see, is that the learned feature vectors capture lots of subtle semantic distinctions. And they do this just by looking at strings of words from Wikipedia. Nobody tells them anything other than the fact that these eleven words occurred in the string. There's no extra supervision. What's remarkable is that, that contextual information, the fact that these words occurred together, tells you an awful lot about what the word means. In fact, some people think that's the main way we learn the meaning of words. So, here's and example. If I give you a sentence with a word you've never heard before like, She scrammed him with a frying pan. From that one sentence, you already have a pretty good idea what scrammed means. It's conceivable that she was trying to impress him with her cooking skills and so scrammed means impressed or something like that but probably it means something like walloped. So, here's part of a two-dimensional map in which we laid out the two and a half thousand commonest words and you'll see this part of the map is all about games. Not only that, it's got similar kinds of words together. So, matches and games and races are together. It's got players and teams and clubs together. It's got the things you win at games together, like cups, and bowls, and medals and prizes. It's got different kinds of games together. It's done a very good job of taking these words to do with games and finding out which ones are very similar in meaning. And because it's using very similar feature vectors for those, it can in any natural language processing task, say, that if one word was a good word for that context, the other word's probably also a pretty good word for that context. Here's another part of the map. This part of the map is entirely about places. At the top, it has US states. Under that it has some cities. Mainly ones in North America. And under that other cities, then it has some countries. So, if you look at the cities this one is clearly Cambridge. And underneath Cambridge, there's something else that's very similar to Cambridge. Here, we see that it's put Toronto with Detroit and Ontario and Boston. Toronto's in English-speaking Canada. And it's put Quebec, which is in French-speaking Canada, with Berlin and Paris. If we look at the bottom, we can see that it thinks Iraq is pretty similar to Vietnam. Here's another example. If you look here, these are adverbs. And it's understood that likely and probably and possibly and perhaps, all mean very similar things. It's also understood that entirely, completely, fully, and greatly have very similar meanings. And it's understood various other kinds of similarity. For example, which and that, or whom and what, or how and whether and why.