1 00:00:00,000 --> 00:00:07,040 In this video, we're going to look at a practical use for feature vectors that 2 00:00:07,040 --> 00:00:12,008 represent words. The uses in speech recognition systems, 3 00:00:12,008 --> 00:00:17,054 where having a good idea of what somebody might say next is very helpful in 4 00:00:17,054 --> 00:00:24,007 recognizing the sounds they make. We're now going to see an important 5 00:00:24,007 --> 00:00:28,082 practical use for feature factors that describe words. 6 00:00:29,060 --> 00:00:34,032 When we're trying to do speech recognition, it's impossible to identify 7 00:00:34,032 --> 00:00:39,052 phonemes perfectly in noisy speech. The acoustic input just isn't good enough. 8 00:00:39,052 --> 00:00:43,071 It's often ambiguous. There may be several different words that 9 00:00:43,071 --> 00:00:50,038 fit the acoustic signal equally well. We don't notice this a lot of the time, 10 00:00:50,038 --> 00:00:55,075 because we're so good at using the meaning of the utterance to hear the right words. 11 00:00:56,078 --> 00:01:01,097 So if I read the next comment on the slide, I would say we do this 12 00:01:01,097 --> 00:01:07,033 unconsciously when we recognize beach. And you would hear, we do this 13 00:01:07,033 --> 00:01:13,000 unconsciously when we recognize speech. You can actually hear the slight 14 00:01:13,000 --> 00:01:17,042 difference between recognize beach and recognize speech. 15 00:01:17,042 --> 00:01:23,088 But if you're expecting recognize speech, particularly if there's noise around you 16 00:01:23,088 --> 00:01:26,064 wouldn't hear recognize beach. Okay. 17 00:01:26,064 --> 00:01:30,064 So we're very good at doing this. We do it all the time when we do it 18 00:01:30,064 --> 00:01:36,025 unconsciously. That means that speech recognizers have to 19 00:01:36,025 --> 00:01:41,052 know which words are likely to come next and which are not. 20 00:01:41,052 --> 00:01:45,065 Fortunately. Words can be predicted quite well without 21 00:01:45,065 --> 00:01:48,073 having a full understanding of what's being said. 22 00:01:50,001 --> 00:01:54,053 So there's a standard method for predicting the probabilities of the 23 00:01:54,053 --> 00:01:58,086 various words that might come next, it's called the trigram method. 24 00:01:58,086 --> 00:02:04,003 You take a huge amount of text and you count the frequencies of all triples of 25 00:02:04,003 --> 00:02:06,098 words. Then you use these frequencies to make 26 00:02:06,098 --> 00:02:12,028 bets on the relative probabilities of the next word given the previous two words. 27 00:02:12,028 --> 00:02:19,002 So if we've heard the words A and B. We can look at the counts that we have in 28 00:02:19,002 --> 00:02:23,049 our huge body of text. For the sequence ABC, and the sequence 29 00:02:23,049 --> 00:02:29,008 ABD, we can say that the relative probability that the third word will be C 30 00:02:29,008 --> 00:02:34,044 versus the third word will be D, is given by the ratio of the two counts. 31 00:02:34,044 --> 00:02:37,080 Abc and ABD. Until very recently, this was the state of 32 00:02:37,080 --> 00:02:42,066 the art method for getting the probability of the next word to help out the speech 33 00:02:42,066 --> 00:02:45,083 recognizer. We can't use much bigger contexts than two 34 00:02:45,083 --> 00:02:50,075 previous words, because there are just too many possibilities to store, and if we did 35 00:02:50,075 --> 00:02:53,081 use bigger contexts, the counts would be mostly zero. 36 00:02:53,081 --> 00:02:58,089 Even for two word contexts there's many contexts that you will never have heard. 37 00:02:58,089 --> 00:03:03,090 For example, if I say dinosaur pizza, that's probably a string of two words that 38 00:03:03,090 --> 00:03:08,015 you've never heard before. For cases like that, we have to back off 39 00:03:08,015 --> 00:03:12,004 to individual words. So, after dinosaur pizza, you predict the 40 00:03:12,004 --> 00:03:17,043 next word by just seeing what's likely to come after the word pizza, because you've 41 00:03:17,043 --> 00:03:21,090 never heard dinosaur pizza before. What you mustn't do is to say that 42 00:03:22,010 --> 00:03:26,012 probability's a zero just because you haven't seen an example. 43 00:03:26,012 --> 00:03:30,024 That's clearly not true. Now the trigram model fails to use a lot 44 00:03:30,024 --> 00:03:34,025 of obvious information that will help you predict the next word. 45 00:03:34,025 --> 00:03:39,055 Suppose for example, you have seen the sentence the cat got squashed in the yard 46 00:03:39,055 --> 00:03:43,080 on Friday. That should help you predict the words in 47 00:03:43,080 --> 00:03:47,078 the sentence, the dog got flattened in the yard on Monday. 48 00:03:47,078 --> 00:03:53,066 In particular, the trigram model doesn't understand the similarities between words 49 00:03:53,066 --> 00:03:58,071 like cat and dog, or squashed and flattened, or garden and yard, or Friday 50 00:03:58,071 --> 00:04:02,053 and Monday. So it can't use past experience with one 51 00:04:02,053 --> 00:04:05,059 of those words to help it with the other one. 52 00:04:05,059 --> 00:04:11,018 To overcome this limitation, what we need to do is convert the words into a vector 53 00:04:11,018 --> 00:04:16,043 of semantic and syntactic features. And use the features of previous words to 54 00:04:16,043 --> 00:04:22,072 predict the features of the next word. Using a feature representation, allows us 55 00:04:22,072 --> 00:04:26,078 to use much bigger context, that contains many more words. 56 00:04:26,078 --> 00:04:32,007 For example, ten previous words. Yoshua Bengio pioneered this approach for 57 00:04:32,007 --> 00:04:37,072 language models, and his initial network for doing this looks rather familiar. 58 00:04:37,072 --> 00:04:41,083 It is actually very similar to the family trees network. 59 00:04:41,083 --> 00:04:45,087 It's just applied to a real problem, and is much bigger. 60 00:04:45,087 --> 00:04:51,066 So at the bottom you can think of us as putting in the index of a word, and you 61 00:04:51,066 --> 00:04:56,043 could think of that as a set of neurons of which just one is on. 62 00:04:56,043 --> 00:05:02,192 And then the weight from that on neuron will determine the pattern of activity in 63 00:05:02,192 --> 00:05:06,021 the next hidden layer. And so the weights from the active neuron 64 00:05:06,021 --> 00:05:10,068 in the bottom layer will give you the pattern of activity in the layer that has 65 00:05:10,068 --> 00:05:13,008 the distributed representation of the word. 66 00:05:13,008 --> 00:05:16,099 That is it's feature vector. But this is just equivalent to saying you 67 00:05:16,099 --> 00:05:20,012 do table look-up. You have a stored feature vector reach 68 00:05:20,012 --> 00:05:23,014 word, and with learning, you modify that feature vector. 69 00:05:23,014 --> 00:05:27,027 Which is exactly equivalent to modifying the weights coming from a single 70 00:05:27,027 --> 00:05:30,062 active-input unit. After getting distributed representations 71 00:05:30,062 --> 00:05:35,014 of a few previous words, I've only shown two here but you would typically use, say, 72 00:05:35,014 --> 00:05:38,047 five. You can then, use those distributed 73 00:05:38,047 --> 00:05:44,023 representations, via hidden layer, to predict, via huge softmax, what the 74 00:05:44,023 --> 00:05:49,058 probabilities are for all the various words that might come next. 75 00:05:49,058 --> 00:05:54,065 When extra refinement that makes it work better is to use skip-lag connections that 76 00:05:54,065 --> 00:05:57,085 go straight from the input words to the output words. 77 00:05:57,085 --> 00:06:02,067 Because the individual input words are, are individually quite informative about 78 00:06:02,067 --> 00:06:07,007 what the output word might be. Bengio's model was actually slightly worse 79 00:06:07,007 --> 00:06:11,084 than Trigram's at predicting the next word, but it was in the same ballpark, and 80 00:06:11,084 --> 00:06:15,004 if you combined it with Trigram's it improved things. 81 00:06:15,004 --> 00:06:19,081 Since then these language models that use feature vectors for words have been 82 00:06:19,081 --> 00:06:24,058 improved considerably, and they're now considerably better than trigram models. 83 00:06:24,058 --> 00:06:29,054 One problem with having a big softmax output layer, is that you might have to 84 00:06:29,054 --> 00:06:34,124 deal with 100,000 different output works. Because typically in these language 85 00:06:34,124 --> 00:06:37,802 models, the plural of a word is a different word from the singular. 86 00:06:37,802 --> 00:06:42,487 And the various different tenses of a verb are different words from other tenses. 87 00:06:42,487 --> 00:06:46,896 So each unit in the last hidden layer of the net, might have to have a 88 00:06:46,896 --> 00:06:51,281 hundred-thousand outgoing weights. And that means we can only afford to have 89 00:06:51,281 --> 00:06:54,026 a few units there before we start over-fitting. 90 00:06:54,026 --> 00:06:58,613 That's not necessarily true. We might have a huge number of training 91 00:06:58,613 --> 00:07:01,025 cases. So some organization like Google might 92 00:07:01,025 --> 00:07:05,453 have so much training tech that it can afford to have a very big softmax layer. 93 00:07:05,453 --> 00:07:10,469 We could try and make the last hidden less small, so we don't need too many weights. 94 00:07:10,469 --> 00:07:15,181 But then we have the problem that we have to get the 100,000 probabilities of the 95 00:07:15,181 --> 00:07:18,582 various words that might come next fairly accurately right. 96 00:07:18,582 --> 00:07:21,379 It's not just the big probabilities we need. 97 00:07:21,379 --> 00:07:25,053 A very large number of words will have small probabilities. 98 00:07:25,053 --> 00:07:27,607 And the small probabilities are often relevant. 99 00:07:27,607 --> 00:07:33,400 It might be that the speech recognizer Has to decide whether it's two different rare 100 00:07:33,400 --> 00:07:39,055 words, and then it's very relevant which of those is more common in the context, 101 00:07:39,055 --> 00:07:42,140 even though both of them are pretty unlikely. 102 00:07:42,140 --> 00:07:47,550 The question is, is there a better way to deal with such a large number of outputs? 103 00:07:47,550 --> 00:07:52,024 And we'll see several different ways of doing that in the next video.