In this video, we're going to look at a
practical use for feature vectors that
represent words.
The uses in speech recognition systems,
where having a good idea of what somebody
might say next is very helpful in
recognizing the sounds they make.
We're now going to see an important
practical use for feature factors that
describe words.
When we're trying to do speech
recognition, it's impossible to identify
phonemes perfectly in noisy speech.
The acoustic input just isn't good enough.
It's often ambiguous.
There may be several different words that
fit the acoustic signal equally well.
We don't notice this a lot of the time,
because we're so good at using the meaning
of the utterance to hear the right words.
So if I read the next comment on the
slide, I would say we do this
unconsciously when we recognize beach.
And you would hear, we do this
unconsciously when we recognize speech.
You can actually hear the slight
difference between recognize beach and
recognize speech.
But if you're expecting recognize speech,
particularly if there's noise around you
wouldn't hear recognize beach.
Okay.
So we're very good at doing this.
We do it all the time when we do it
unconsciously.
That means that speech recognizers have to
know which words are likely to come next
and which are not.
Fortunately.
Words can be predicted quite well without
having a full understanding of what's
being said.
So there's a standard method for
predicting the probabilities of the
various words that might come next, it's
called the trigram method.
You take a huge amount of text and you
count the frequencies of all triples of
words.
Then you use these frequencies to make
bets on the relative probabilities of the
next word given the previous two words.
So if we've heard the words A and B.
We can look at the counts that we have in
our huge body of text.
For the sequence ABC, and the sequence
ABD, we can say that the relative
probability that the third word will be C
versus the third word will be D, is given
by the ratio of the two counts.
Abc and ABD.
Until very recently, this was the state of
the art method for getting the probability
of the next word to help out the speech
recognizer.
We can't use much bigger contexts than two
previous words, because there are just too
many possibilities to store, and if we did
use bigger contexts, the counts would be
mostly zero.
Even for two word contexts there's many
contexts that you will never have heard.
For example, if I say dinosaur pizza,
that's probably a string of two words that
you've never heard before.
For cases like that, we have to back off
to individual words.
So, after dinosaur pizza, you predict the
next word by just seeing what's likely to
come after the word pizza, because you've
never heard dinosaur pizza before.
What you mustn't do is to say that
probability's a zero just because you
haven't seen an example.
That's clearly not true.
Now the trigram model fails to use a lot
of obvious information that will help you
predict the next word.
Suppose for example, you have seen the
sentence the cat got squashed in the yard
on Friday.
That should help you predict the words in
the sentence, the dog got flattened in the
yard on Monday.
In particular, the trigram model doesn't
understand the similarities between words
like cat and dog, or squashed and
flattened, or garden and yard, or Friday
and Monday.
So it can't use past experience with one
of those words to help it with the other
one.
To overcome this limitation, what we need
to do is convert the words into a vector
of semantic and syntactic features.
And use the features of previous words to
predict the features of the next word.
Using a feature representation, allows us
to use much bigger context, that contains
many more words.
For example, ten previous words.
Yoshua Bengio pioneered this approach for
language models, and his initial network
for doing this looks rather familiar.
It is actually very similar to the family
trees network.
It's just applied to a real problem, and
is much bigger.
So at the bottom you can think of us as
putting in the index of a word, and you
could think of that as a set of neurons of
which just one is on.
And then the weight from that on neuron
will determine the pattern of activity in
the next hidden layer.
And so the weights from the active neuron
in the bottom layer will give you the
pattern of activity in the layer that has
the distributed representation of the
word.
That is it's feature vector.
But this is just equivalent to saying you
do table look-up.
You have a stored feature vector reach
word, and with learning, you modify that
feature vector.
Which is exactly equivalent to modifying
the weights coming from a single
active-input unit.
After getting distributed representations
of a few previous words, I've only shown
two here but you would typically use, say,
five.
You can then, use those distributed
representations, via hidden layer, to
predict, via huge softmax, what the
probabilities are for all the various
words that might come next.
When extra refinement that makes it work
better is to use skip-lag connections that
go straight from the input words to the
output words.
Because the individual input words are,
are individually quite informative about
what the output word might be.
Bengio's model was actually slightly worse
than Trigram's at predicting the next
word, but it was in the same ballpark, and
if you combined it with Trigram's it
improved things.
Since then these language models that use
feature vectors for words have been
improved considerably, and they're now
considerably better than trigram models.
One problem with having a big softmax
output layer, is that you might have to
deal with 100,000 different output works.
Because typically in these language
models, the plural of a word is a
different word from the singular.
And the various different tenses of a verb
are different words from other tenses.
So each unit in the last hidden layer of
the net, might have to have a
hundred-thousand outgoing weights.
And that means we can only afford to have
a few units there before we start
over-fitting.
That's not necessarily true.
We might have a huge number of training
cases.
So some organization like Google might
have so much training tech that it can
afford to have a very big softmax layer.
We could try and make the last hidden less
small, so we don't need too many weights.
But then we have the problem that we have
to get the 100,000 probabilities of the
various words that might come next fairly
accurately right.
It's not just the big probabilities we
need.
A very large number of words will have
small probabilities.
And the small probabilities are often
relevant.
It might be that the speech recognizer Has
to decide whether it's two different rare
words, and then it's very relevant which
of those is more common in the context,
even though both of them are pretty
unlikely.
The question is, is there a better way to
deal with such a large number of outputs?
And we'll see several different ways of
doing that in the next video.