In this, video we're going look at various
ways to avoid having to use 100,000
different output units in the softmax, if
we want to get probabilities for a 100,000
different words.
At the end of the video, I'll show an
example of the words, or the word
representations that are learned by a
particular method.
To see what these representations look
like, what we do is we embed them in a two
dimensional space.
And then, we can see which words have
extremely similar representations.
That gives us a feel for what the neural
network has been able to learn, just in
trying to predict the next word or
perhaps, the middle word of a string of
words.
So, one way to avoid having a 100,000
output units, is to go through the words
one at a time.
So, the input consists of a context of
previous words, and we plug in a candidate
word.
Then the output of the network is a score
for how good that combination is.
How well does that candidate word fit into
that context?
Of course, that means we have to run this
net many, many times but most of the work
can be shared so the inputs that come into
that final hidden layer from the context,
stay the same for all different candidate
words and so we only need to run a small
part of the network for each candidate
word.
We try all the candidate words one at a
time.
And I don't want to [sigh].
So, one way having so, one way to avoid
having a 100,000 different output units,
is to use a serial architecture.
We put in the context words as before but
now, we also put in a candidate for the
next word in the same way as the context
words.
Then, we go forward through the net and
what we output, is a score for how good
that candidate word is, in that context.
Of course, we have to run forward through
this net, many, many times.
But most of the work only needs to be done
once.
So, the inputs from the context to that
big, big hidden layer are the same for
every different candidate word.
The only bit we need to run for each
candidate word is the inputs coming from
the candidate word and the final output to
the score.
And, of course, that doesn't have many
weights in it.
So, we try all the candidate words one at
a time.
And by putting in the word as a candidate
at the bottom, we're able to use the
learned feature vector that, for that kind
of word, that we learned when it was a
context word.
So, we can have the same representation of
the candidate word when it's in the
context, and when it's a candidate for the
next word.
So, we can have the same representation of
the word when it's part of the context,
and when it's a candidate for the next
word that we're trying to predict.
Learning in the serial architecture works
in the following way.
We first compute the score for each
possible candidate word and then we put
all those scores, which we computed
sequentially into a big softmax to get
word probabilities.
Now, the difference between the word
probabilities and their target
probabilities which is normally one for
the correct word and zero for everything
else, gives us the cross entropy error
derivatives and we use those derivatives
to change the way, in such a way that we
raise the score for the correct candidate
and we lower the scores for all of these
high scoring rivals.
We can save a lot of time in this
architecture if instead of considering all
possible candidate words, we only consider
a small set.
Perhaps, candidate words suggested by some
other predictor.
For example, we could use the neural net
to revise the probabilities of the words
that the Trigram model thinks are likely.
A different way to avoid a great big
softmax is to structure the words into a
tree.
So, we arrange all of the words in a
binary tree, with the words as its leaves,
we then use the context of previous words
to generate a prediction vector v.
We compare that prediction vector with a
vector that we learned for each node of
the tree.
The way we do the comparison is by taking
a scale of product of the, the prediction
vector and the vector that we've learned
for the node of the tree.
And then, we apply, apply the logistic
function.
And then, we apply the logistic function
to that scale of product, and that will
give us the probability of taking the
right branch in the tree and one minus
that, gives us the probability of taking
the left branch in the tree.
So, the tree looks like this and if that
sigma is the logistic function, you can
see at the top of tree that we take the
logistic of the prediction vector, times
the vector that we learned ui of the top
node, and that tells us the probability
taking the right branch and conversely we
take the left branch with one minus that
probability and so on, all the way down
the tree to the word we want.
So, when we're learning, we use our
contacts to get a prediction vector.
And in this work, we used quite a simple
neural network, that simply takes the
stored, learned representations of each
word, and uses the feature vector for each
word to provide evidence for the great
prediction vector.
We use quite the simple, we use quite a
simple neuron at work, for this work,
where we take the feature vector for each
word and those feature vectors directly
contribute evidence in favor of a
prediction vector.
We simply add up the evidence contributed
by those feature vectors and that gives us
the prediction vector.
That prediction vector then gets compared
with the vectors that have been learned
for all the nodes in the tree on the path
to the correct next word.
So, that would be nodes i, j, and m in
this tree.
That red path shows you the path to the
word that actually occurred next and those
are the only nodes we need to consider
during learning.
What we know is that, we'd like the
probability of predicting that word to be
as high as possible.
And so we'd like the probability of taking
that path to be as high as possible.
So, we'd like the product of the
probabilities on the individual elements
of that path to be as high as possible.
And that means we'd like the sum of their
log probabilities to be high.
So, we benefit from a nice decomposition
here.
That when we try and maximize the
probability of picking the correct target
word, we're really trying to maximize the
sum of the log probabilities of taking all
the branches on the path that leads to
that target word.
So, during learning, we only need to
consider the nodes on that correct path.
And that's a huge win, that's
exponentially fewer nodes when considering
all of the nodes.
So, it's log to the base two of n instead
of n.
For each of those nodes, we know the
correct branch, cuz we know what the next
word is.
We know the current probability of taking
that branch, by comparing the prediction
vector with the learned vector of the
node.
As so, we can get derivatives for learning
both the prediction vector v and learned
vector of that node, u.
This makes training hundreds of times
faster.
Unfortunately, it's still slow at test
time.
At test time, you need to know the
probabilities of many words to help speech
recognizer.
And so, you can't just consider one path.
There's a much simpler way to learn
feature vectors for words.
This is what by Collobert and Weston.
And what they did was learn feature
vectors for words and then showed that the
feature vectors they learned were very
good for a whole bunch of different
natural language processing tasks.
They're not trying to predict the next
word, they're just trying to get good
feature vectors for words, and so, they
use both the past context and the future
context.
So, they look at a window of eleven words,
five in the past and five in the future.
And in the middle of that window, they put
either the correct word, the one that
actually occurred in the text, or a random
word.
And then they trained a neural net to
produce an output that's high if it's a
correct word and low if it's a random
word.
The neural net works the same way as
before.
We map the individual words to feature
vectors, these word codes, and then we use
the feature vectors in the neural net,
possibly with more layers in the neural
net to try and predict whether this is the
right or wrong word for that context.
So, what they're really doing is judging
whether the middle word is an appropriate
word for the five-word context on either
side of it.
They trained this up on about 600 million
examples from Wikipedia.
And they showed that the vectors they get
are good for a variety of different
natural language processing tasks.
One way of getting a sense of the vectors
that they learn for words, is to display
the vectors in a two-dimensional map.
All we want to do is lay out the word
vectors in such a way that very similar
vectors are very close to one another.
So, then you'll discover what words the
neural network thinks have similar
meanings.
We're going to use a multi-scale method
called t-sne.
You can look up t-sne on Google and
discover how it works if you want.
And t-sne is able, in addition to putting
very similar words close to each other,
it's also able to put similar clusters
close to each other.
So, it gives you structure at many
different scales.
What we see, is that the learned feature
vectors capture lots of subtle semantic
distinctions.
And they do this just by looking at
strings of words from Wikipedia.
Nobody tells them anything other than the
fact that these eleven words occurred in
the string.
There's no extra supervision.
What's remarkable is that, that contextual
information, the fact that these words
occurred together, tells you an awful lot
about what the word means.
In fact, some people think that's the main
way we learn the meaning of words.
So, here's and example.
If I give you a sentence with a word
you've never heard before like, She
scrammed him with a frying pan.
From that one sentence, you already have a
pretty good idea what scrammed means.
It's conceivable that she was trying to
impress him with her cooking skills and so
scrammed means impressed or something like
that but probably it means something like
walloped.
So, here's part of a two-dimensional map
in which we laid out the two and a half
thousand commonest words and you'll see
this part of the map is all about games.
Not only that, it's got similar kinds of
words together.
So, matches and games and races are
together.
It's got players and teams and clubs
together.
It's got the things you win at games
together, like cups, and bowls, and medals
and prizes.
It's got different kinds of games
together.
It's done a very good job of taking these
words to do with games and finding out
which ones are very similar in meaning.
And because it's using very similar
feature vectors for those, it can in any
natural language processing task, say,
that if one word was a good word for that
context, the other word's probably also a
pretty good word for that context.
Here's another part of the map.
This part of the map is entirely about
places.
At the top, it has US states.
Under that it has some cities.
Mainly ones in North America.
And under that other cities, then it has
some countries.
So, if you look at the cities this one is
clearly Cambridge.
And underneath Cambridge, there's
something else that's very similar to
Cambridge.
Here, we see that it's put Toronto with
Detroit and Ontario and Boston.
Toronto's in English-speaking Canada.
And it's put Quebec, which is in
French-speaking Canada, with Berlin and
Paris.
If we look at the bottom, we can see that
it thinks Iraq is pretty similar to
Vietnam.
Here's another example.
If you look here, these are adverbs.
And it's understood that likely and
probably and possibly and perhaps, all
mean very similar things.
It's also understood that entirely,
completely, fully, and greatly have very
similar meanings.
And it's understood various other kinds of
similarity.
For example, which and that, or whom and
what, or how and whether and why.