1 00:00:00,000 --> 00:00:06,061 In this, video we're going look at various ways to avoid having to use 100,000 2 00:00:06,061 --> 00:00:13,057 different output units in the softmax, if we want to get probabilities for a 100,000 3 00:00:13,057 --> 00:00:18,006 different words. At the end of the video, I'll show an 4 00:00:18,006 --> 00:00:24,009 example of the words, or the word representations that are learned by a 5 00:00:24,009 --> 00:00:29,007 particular method. To see what these representations look 6 00:00:30,052 --> 00:00:32,399 like, what we do is we embed them in a two dimensional space. 7 00:00:32,399 --> 00:00:37,352 And then, we can see which words have extremely similar representations. 8 00:00:37,352 --> 00:00:42,836 That gives us a feel for what the neural network has been able to learn, just in 9 00:00:42,836 --> 00:00:48,551 trying to predict the next word or perhaps, the middle word of a string of 10 00:00:48,551 --> 00:00:51,641 words. So, one way to avoid having a 100,000 11 00:00:51,641 --> 00:00:55,534 output units, is to go through the words one at a time. 12 00:00:55,534 --> 00:01:01,730 So, the input consists of a context of previous words, and we plug in a candidate 13 00:01:01,730 --> 00:01:05,394 word. Then the output of the network is a score 14 00:01:05,394 --> 00:01:11,106 for how good that combination is. How well does that candidate word fit into 15 00:01:11,106 --> 00:01:15,048 that context? Of course, that means we have to run this 16 00:01:15,048 --> 00:01:21,143 net many, many times but most of the work can be shared so the inputs that come into 17 00:01:21,143 --> 00:01:27,590 that final hidden layer from the context, stay the same for all different candidate 18 00:01:27,590 --> 00:01:33,697 words and so we only need to run a small part of the network for each candidate 19 00:01:33,697 --> 00:01:39,696 word. We try all the candidate words one at a 20 00:01:39,696 --> 00:01:45,090 time. And I don't want to [sigh]. 21 00:01:58,026 --> 00:02:06,773 So, one way having so, one way to avoid having a 100,000 different output units, 22 00:02:06,773 --> 00:02:14,084 is to use a serial architecture. We put in the context words as before but 23 00:02:14,084 --> 00:02:20,071 now, we also put in a candidate for the next word in the same way as the context 24 00:02:20,071 --> 00:02:25,063 words. Then, we go forward through the net and 25 00:02:25,063 --> 00:02:31,063 what we output, is a score for how good that candidate word is, in that context. 26 00:02:31,096 --> 00:02:37,034 Of course, we have to run forward through this net, many, many times. 27 00:02:37,063 --> 00:02:41,027 But most of the work only needs to be done once. 28 00:02:41,027 --> 00:02:47,010 So, the inputs from the context to that big, big hidden layer are the same for 29 00:02:47,010 --> 00:02:52,025 every different candidate word. The only bit we need to run for each 30 00:02:52,025 --> 00:02:58,061 candidate word is the inputs coming from the candidate word and the final output to 31 00:02:58,061 --> 00:03:02,017 the score. And, of course, that doesn't have many 32 00:03:02,017 --> 00:03:08,763 weights in it. So, we try all the candidate words one at 33 00:03:08,763 --> 00:03:14,030 a time. And by putting in the word as a candidate 34 00:03:14,030 --> 00:03:21,130 at the bottom, we're able to use the learned feature vector that, for that kind 35 00:03:21,130 --> 00:03:24,067 of word, that we learned when it was a context word. 36 00:03:24,067 --> 00:03:29,047 So, we can have the same representation of the candidate word when it's in the 37 00:03:29,047 --> 00:03:32,068 context, and when it's a candidate for the next word. 38 00:03:33,036 --> 00:03:38,013 So, we can have the same representation of the word when it's part of the context, 39 00:03:38,013 --> 00:03:42,043 and when it's a candidate for the next word that we're trying to predict. 40 00:03:47,081 --> 00:03:52,021 Learning in the serial architecture works in the following way. 41 00:03:52,021 --> 00:03:57,050 We first compute the score for each possible candidate word and then we put 42 00:03:57,050 --> 00:04:02,642 all those scores, which we computed sequentially into a big softmax to get 43 00:04:02,642 --> 00:04:07,062 word probabilities. Now, the difference between the word 44 00:04:07,062 --> 00:04:12,005 probabilities and their target probabilities which is normally one for 45 00:04:12,005 --> 00:04:16,096 the correct word and zero for everything else, gives us the cross entropy error 46 00:04:16,096 --> 00:04:21,093 derivatives and we use those derivatives to change the way, in such a way that we 47 00:04:21,093 --> 00:04:27,009 raise the score for the correct candidate and we lower the scores for all of these 48 00:04:27,009 --> 00:04:31,057 high scoring rivals. We can save a lot of time in this 49 00:04:31,057 --> 00:04:36,090 architecture if instead of considering all possible candidate words, we only consider 50 00:04:36,090 --> 00:04:40,028 a small set. Perhaps, candidate words suggested by some 51 00:04:40,028 --> 00:04:43,079 other predictor. For example, we could use the neural net 52 00:04:43,079 --> 00:04:48,093 to revise the probabilities of the words that the Trigram model thinks are likely. 53 00:04:51,018 --> 00:04:57,236 A different way to avoid a great big softmax is to structure the words into a 54 00:04:57,236 --> 00:05:00,534 tree. So, we arrange all of the words in a 55 00:05:00,534 --> 00:05:09,008 binary tree, with the words as its leaves, we then use the context of previous words 56 00:05:09,008 --> 00:05:15,081 to generate a prediction vector v. We compare that prediction vector with a 57 00:05:15,081 --> 00:05:19,030 vector that we learned for each node of the tree. 58 00:05:21,012 --> 00:05:25,770 The way we do the comparison is by taking a scale of product of the, the prediction 59 00:05:25,770 --> 00:05:30,029 vector and the vector that we've learned for the node of the tree. 60 00:05:30,029 --> 00:05:33,765 And then, we apply, apply the logistic function. 61 00:05:33,765 --> 00:05:39,398 And then, we apply the logistic function to that scale of product, and that will 62 00:05:39,398 --> 00:05:44,027 give us the probability of taking the right branch in the tree and one minus 63 00:05:44,027 --> 00:05:49,030 that, gives us the probability of taking the left branch in the tree. 64 00:05:49,077 --> 00:05:55,078 So, the tree looks like this and if that sigma is the logistic function, you can 65 00:05:55,078 --> 00:06:01,087 see at the top of tree that we take the logistic of the prediction vector, times 66 00:06:01,087 --> 00:06:08,003 the vector that we learned ui of the top node, and that tells us the probability 67 00:06:08,003 --> 00:06:14,034 taking the right branch and conversely we take the left branch with one minus that 68 00:06:14,034 --> 00:06:19,051 probability and so on, all the way down the tree to the word we want. 69 00:06:21,031 --> 00:06:26,083 So, when we're learning, we use our contacts to get a prediction vector. 70 00:06:26,083 --> 00:06:32,089 And in this work, we used quite a simple neural network, that simply takes the 71 00:06:32,089 --> 00:06:38,979 stored, learned representations of each word, and uses the feature vector for each 72 00:06:38,979 --> 00:06:44,229 word to provide evidence for the great prediction vector. 73 00:06:44,229 --> 00:06:51,371 We use quite the simple, we use quite a simple neuron at work, for this work, 74 00:06:51,371 --> 00:06:57,308 where we take the feature vector for each word and those feature vectors directly 75 00:06:57,308 --> 00:07:01,092 contribute evidence in favor of a prediction vector. 76 00:07:01,092 --> 00:07:07,343 We simply add up the evidence contributed by those feature vectors and that gives us 77 00:07:07,343 --> 00:07:13,142 the prediction vector. That prediction vector then gets compared 78 00:07:13,142 --> 00:07:19,795 with the vectors that have been learned for all the nodes in the tree on the path 79 00:07:19,795 --> 00:07:24,395 to the correct next word. So, that would be nodes i, j, and m in 80 00:07:24,395 --> 00:07:27,968 this tree. That red path shows you the path to the 81 00:07:27,968 --> 00:07:34,586 word that actually occurred next and those are the only nodes we need to consider 82 00:07:34,586 --> 00:07:38,741 during learning. What we know is that, we'd like the 83 00:07:38,741 --> 00:07:42,989 probability of predicting that word to be as high as possible. 84 00:07:42,989 --> 00:07:47,211 And so we'd like the probability of taking that path to be as high as possible. 85 00:07:47,211 --> 00:07:51,632 So, we'd like the product of the probabilities on the individual elements 86 00:07:51,632 --> 00:07:56,423 of that path to be as high as possible. And that means we'd like the sum of their 87 00:07:56,423 --> 00:08:00,751 log probabilities to be high. So, we benefit from a nice decomposition 88 00:08:00,751 --> 00:08:03,432 here. That when we try and maximize the 89 00:08:03,432 --> 00:08:09,408 probability of picking the correct target word, we're really trying to maximize the 90 00:08:09,408 --> 00:08:15,721 sum of the log probabilities of taking all the branches on the path that leads to 91 00:08:15,721 --> 00:08:19,400 that target word. So, during learning, we only need to 92 00:08:19,400 --> 00:08:24,001 consider the nodes on that correct path. And that's a huge win, that's 93 00:08:24,001 --> 00:08:28,119 exponentially fewer nodes when considering all of the nodes. 94 00:08:28,119 --> 00:08:31,509 So, it's log to the base two of n instead of n. 95 00:08:31,509 --> 00:08:37,572 For each of those nodes, we know the correct branch, cuz we know what the next 96 00:08:37,572 --> 00:08:40,637 word is. We know the current probability of taking 97 00:08:40,637 --> 00:08:45,717 that branch, by comparing the prediction vector with the learned vector of the 98 00:08:45,717 --> 00:08:48,608 node. As so, we can get derivatives for learning 99 00:08:48,608 --> 00:08:52,920 both the prediction vector v and learned vector of that node, u. 100 00:08:52,920 --> 00:08:56,540 This makes training hundreds of times faster. 101 00:08:56,540 --> 00:08:59,443 Unfortunately, it's still slow at test time. 102 00:08:59,443 --> 00:09:04,462 At test time, you need to know the probabilities of many words to help speech 103 00:09:04,462 --> 00:09:09,302 recognizer. And so, you can't just consider one path. 104 00:09:09,302 --> 00:09:15,141 There's a much simpler way to learn feature vectors for words. 105 00:09:15,141 --> 00:09:20,047 This is what by Collobert and Weston. And what they did was learn feature 106 00:09:20,047 --> 00:09:24,231 vectors for words and then showed that the feature vectors they learned were very 107 00:09:24,231 --> 00:09:28,062 good for a whole bunch of different natural language processing tasks. 108 00:09:29,013 --> 00:09:36,099 They're not trying to predict the next word, they're just trying to get good 109 00:09:36,099 --> 00:09:45,015 feature vectors for words, and so, they use both the past context and the future 110 00:09:45,015 --> 00:09:50,022 context. So, they look at a window of eleven words, 111 00:09:50,022 --> 00:09:58,054 five in the past and five in the future. And in the middle of that window, they put 112 00:09:58,054 --> 00:10:04,081 either the correct word, the one that actually occurred in the text, or a random 113 00:10:04,081 --> 00:10:08,022 word. And then they trained a neural net to 114 00:10:08,022 --> 00:10:14,040 produce an output that's high if it's a correct word and low if it's a random 115 00:10:14,040 --> 00:10:17,078 word. The neural net works the same way as 116 00:10:17,078 --> 00:10:20,095 before. We map the individual words to feature 117 00:10:20,095 --> 00:10:26,039 vectors, these word codes, and then we use the feature vectors in the neural net, 118 00:10:26,039 --> 00:10:31,619 possibly with more layers in the neural net to try and predict whether this is the 119 00:10:31,619 --> 00:10:45,465 right or wrong word for that context. So, what they're really doing is judging 120 00:10:45,465 --> 00:10:50,428 whether the middle word is an appropriate word for the five-word context on either 121 00:10:50,428 --> 00:10:55,855 side of it. They trained this up on about 600 million 122 00:10:55,855 --> 00:11:00,992 examples from Wikipedia. And they showed that the vectors they get 123 00:11:00,992 --> 00:11:07,494 are good for a variety of different natural language processing tasks. 124 00:11:07,494 --> 00:11:16,157 One way of getting a sense of the vectors that they learn for words, is to display 125 00:11:16,157 --> 00:11:24,950 the vectors in a two-dimensional map. All we want to do is lay out the word 126 00:11:24,950 --> 00:11:30,974 vectors in such a way that very similar vectors are very close to one another. 127 00:11:30,974 --> 00:11:36,416 So, then you'll discover what words the neural network thinks have similar 128 00:11:36,416 --> 00:11:41,300 meanings. We're going to use a multi-scale method 129 00:11:41,300 --> 00:11:44,126 called t-sne. You can look up t-sne on Google and 130 00:11:44,126 --> 00:11:48,848 discover how it works if you want. And t-sne is able, in addition to putting 131 00:11:48,848 --> 00:11:54,032 very similar words close to each other, it's also able to put similar clusters 132 00:11:54,032 --> 00:11:57,051 close to each other. So, it gives you structure at many 133 00:11:57,051 --> 00:12:02,067 different scales. What we see, is that the learned feature 134 00:12:02,067 --> 00:12:05,072 vectors capture lots of subtle semantic distinctions. 135 00:12:05,072 --> 00:12:09,063 And they do this just by looking at strings of words from Wikipedia. 136 00:12:09,063 --> 00:12:14,040 Nobody tells them anything other than the fact that these eleven words occurred in 137 00:12:14,040 --> 00:12:18,063 the string. There's no extra supervision. 138 00:12:21,032 --> 00:12:26,039 What's remarkable is that, that contextual information, the fact that these words 139 00:12:26,039 --> 00:12:30,406 occurred together, tells you an awful lot about what the word means. 140 00:12:30,406 --> 00:12:34,894 In fact, some people think that's the main way we learn the meaning of words. 141 00:12:34,894 --> 00:12:38,644 So, here's and example. If I give you a sentence with a word 142 00:12:38,644 --> 00:12:43,168 you've never heard before like, She scrammed him with a frying pan. 143 00:12:43,168 --> 00:12:48,429 From that one sentence, you already have a pretty good idea what scrammed means. 144 00:12:48,429 --> 00:12:53,552 It's conceivable that she was trying to impress him with her cooking skills and so 145 00:12:53,552 --> 00:12:59,314 scrammed means impressed or something like that but probably it means something like 146 00:12:59,314 --> 00:13:05,536 walloped. So, here's part of a two-dimensional map 147 00:13:05,536 --> 00:13:12,020 in which we laid out the two and a half thousand commonest words and you'll see 148 00:13:12,020 --> 00:13:17,972 this part of the map is all about games. Not only that, it's got similar kinds of 149 00:13:17,972 --> 00:13:22,214 words together. So, matches and games and races are 150 00:13:22,214 --> 00:13:26,235 together. It's got players and teams and clubs 151 00:13:26,235 --> 00:13:29,440 together. It's got the things you win at games 152 00:13:29,440 --> 00:13:32,769 together, like cups, and bowls, and medals and prizes. 153 00:13:32,769 --> 00:13:35,565 It's got different kinds of games together. 154 00:13:35,565 --> 00:13:40,565 It's done a very good job of taking these words to do with games and finding out 155 00:13:40,565 --> 00:13:45,434 which ones are very similar in meaning. And because it's using very similar 156 00:13:45,434 --> 00:13:50,410 feature vectors for those, it can in any natural language processing task, say, 157 00:13:50,410 --> 00:13:55,802 that if one word was a good word for that context, the other word's probably also a 158 00:13:55,802 --> 00:14:02,226 pretty good word for that context. Here's another part of the map. 159 00:14:02,226 --> 00:14:05,519 This part of the map is entirely about places. 160 00:14:05,519 --> 00:14:09,097 At the top, it has US states. Under that it has some cities. 161 00:14:09,097 --> 00:14:14,827 Mainly ones in North America. And under that other cities, then it has 162 00:14:14,827 --> 00:14:19,833 some countries. So, if you look at the cities this one is 163 00:14:19,833 --> 00:14:23,337 clearly Cambridge. And underneath Cambridge, there's 164 00:14:23,337 --> 00:14:27,500 something else that's very similar to Cambridge. 165 00:14:27,500 --> 00:14:33,773 Here, we see that it's put Toronto with Detroit and Ontario and Boston. 166 00:14:33,773 --> 00:14:39,499 Toronto's in English-speaking Canada. And it's put Quebec, which is in 167 00:14:39,499 --> 00:14:43,010 French-speaking Canada, with Berlin and Paris. 168 00:14:44,029 --> 00:14:49,086 If we look at the bottom, we can see that it thinks Iraq is pretty similar to 169 00:14:49,086 --> 00:14:55,051 Vietnam. Here's another example. 170 00:14:58,053 --> 00:15:06,770 If you look here, these are adverbs. And it's understood that likely and 171 00:15:06,770 --> 00:15:11,483 probably and possibly and perhaps, all mean very similar things. 172 00:15:11,483 --> 00:15:16,827 It's also understood that entirely, completely, fully, and greatly have very 173 00:15:16,827 --> 00:15:21,230 similar meanings. And it's understood various other kinds of 174 00:15:21,230 --> 00:15:24,805 similarity. For example, which and that, or whom and 175 00:15:24,805 --> 00:15:27,070 what, or how and whether and why.