1
00:00:03,180 --> 00:00:07,828
[SOUND] This video is about all kind
of problems that you can have with

2
00:00:07,828 --> 00:00:10,571
vocabulary in machine translations.

3
00:00:10,571 --> 00:00:16,940
So first, vocabulary is usually too large,
and it is too long to compute softmax.

4
00:00:16,940 --> 00:00:20,470
Second, vocabulary is usually too small,
and

5
00:00:20,470 --> 00:00:25,240
you have out-of-vocabulary words, and
you need somehow to deal with them.

6
00:00:25,240 --> 00:00:27,843
And you have a whole range
on different tricks for

7
00:00:27,843 --> 00:00:31,160
you in neural language modeling and
in machine translation.

8
00:00:32,290 --> 00:00:35,500
Now, let us start with
hierarchical softmax.

9
00:00:35,500 --> 00:00:40,290
It is a nice procedure to help you
to compute softmax in a fast way.

10
00:00:41,400 --> 00:00:47,085
So the idea is to build the binary
tree for your words, like here.

11
00:00:47,085 --> 00:00:51,650
And the words in this tree
will have some codes.

12
00:00:51,650 --> 00:00:58,007
So for example, the zebra will have
code 01, which means that first,

13
00:00:58,007 --> 00:01:03,680
we go to the left, it is 0, and
then we go to the right, it is 1.

14
00:01:03,680 --> 00:01:09,070
And importantly, it is just unique mapping
between words and codes in the tree.

15
00:01:10,424 --> 00:01:14,530
So we are going to use this property,
and we are going to produce

16
00:01:14,530 --> 00:01:20,000
the probabilities for
words using their binary codes.

17
00:01:20,000 --> 00:01:21,910
This is how we make it.

18
00:01:21,910 --> 00:01:26,970
So instead of computing the probability
of the word and normalizing it

19
00:01:26,970 --> 00:01:33,060
across the whole vocabulary, we are going
to split it into separate terms.

20
00:01:33,060 --> 00:01:37,937
Each term is the probability of the digit,
is the probability of

21
00:01:37,937 --> 00:01:43,373
the next decision in the tree,
whether to go to the left or to the right.

22
00:01:43,373 --> 00:01:47,807
And we build these probabilities
into the product so

23
00:01:47,807 --> 00:01:54,070
that this product is going to estimate
the probability of the whole word.

24
00:01:55,510 --> 00:02:01,660
Now, do you believe that this
product of binary probabilities

25
00:02:01,660 --> 00:02:06,180
is going to be normalized into
one across the whole vocabulary?

26
00:02:07,705 --> 00:02:11,140
Well, sounds like a magic, isn't it?

27
00:02:11,140 --> 00:02:14,130
But let's see that actually, it happens.

28
00:02:14,130 --> 00:02:21,560
So you see that this is your binary tree,
and you see those probabilities.

29
00:02:21,560 --> 00:02:24,650
They are just written down for
every path in the tree.

30
00:02:25,970 --> 00:02:31,061
Now, you can try to sum
the first two probabilities, and

31
00:02:31,061 --> 00:02:37,830
you note that 0.1 plus 0.9 gives 1,
and this is not just random.

32
00:02:37,830 --> 00:02:42,210
This is always 1,
just because this is probability for

33
00:02:42,210 --> 00:02:47,550
two options, right, going to the left and
going to the right.

34
00:02:47,550 --> 00:02:51,120
That's why they
are necessary summed into 1.

35
00:02:51,120 --> 00:02:58,530
So if you sum those things, you get their
common prefix, which will be 0.7 by 0.8.

36
00:02:58,530 --> 00:03:02,740
Then you can try to sum two
other childs in the tree, and

37
00:03:02,740 --> 00:03:05,460
you get, again, their common prefix.

38
00:03:05,460 --> 00:03:09,610
You sum two other childs, and
you get, again, their common prefix.

39
00:03:09,610 --> 00:03:13,744
And finally, you sum these two values,
and you get 1.

40
00:03:13,744 --> 00:03:19,160
So it is clearly seen that
if you go from bottom to

41
00:03:19,160 --> 00:03:24,550
the top with this tree,
you will get probability equal to 1.

42
00:03:24,550 --> 00:03:29,180
So just to sum it up,
we split the probability into

43
00:03:29,180 --> 00:03:34,340
these binary solutions, and
we use some tree to do this.

44
00:03:34,340 --> 00:03:36,590
What kind of tree do we do?

45
00:03:36,590 --> 00:03:39,060
Well, it is actually not so important.

46
00:03:39,060 --> 00:03:43,300
So interestingly,
even just random tree can help.

47
00:03:43,300 --> 00:03:46,957
Alternatively, you can use
Huffman tree that gets to

48
00:03:46,957 --> 00:03:50,705
know the frequencies of some words,
and it uses that.

49
00:03:50,705 --> 00:03:53,340
Or you can try to use some semantics so

50
00:03:53,340 --> 00:03:59,130
you can just cluster your words in
the data based on their similarities.

51
00:03:59,130 --> 00:04:04,870
Or you can use some pre-built structure
that says that these words are similar and

52
00:04:04,870 --> 00:04:08,460
that they should be found
in one hierarchy branch.

53
00:04:09,750 --> 00:04:13,580
Okay, now I have one last question for
you for this method.

54
00:04:13,580 --> 00:04:17,750
What would be a way to
model the probability of di

55
00:04:17,750 --> 00:04:20,190
given all the other stuff?

56
00:04:21,350 --> 00:04:26,050
So usually, we have softmax
to model some probabilities.

57
00:04:26,050 --> 00:04:28,290
What would be the probability
in this case?

58
00:04:29,510 --> 00:04:34,360
Well, in this case, we have just
only two options, left and right.

59
00:04:34,360 --> 00:04:39,520
So instead of softmax, we can have sigmoid
function, which has just two outcomes.

60
00:04:41,360 --> 00:04:46,340
Okay, now let us come back to some
other problems with our bot vocabulary.

61
00:04:46,340 --> 00:04:52,420
And copy mechanism is something to
help with out-of-vocabulary words.

62
00:04:52,420 --> 00:04:57,853
Imagine you have some sentence,
and some words are UNK tokens.

63
00:04:57,853 --> 00:05:02,481
So you do not have them in the vocabulary,
and you do not know how to translate them,

64
00:05:02,481 --> 00:05:04,350
you get UNK tokens as the result.

65
00:05:05,760 --> 00:05:08,510
But what if you know the word alignments?

66
00:05:08,510 --> 00:05:12,840
If you know how the words are aligned,
you can use that.

67
00:05:12,840 --> 00:05:14,620
So you can say, okay,

68
00:05:14,620 --> 00:05:19,920
this UNK corresponds to that UNK,
which corresponds to this source word.

69
00:05:19,920 --> 00:05:24,020
Let us just do dictionary translation, or

70
00:05:24,020 --> 00:05:29,710
let us just copy, because this is the name
of some place or some other name.

71
00:05:29,710 --> 00:05:32,190
Let us just copy this as is.

72
00:05:32,190 --> 00:05:37,110
This is why it is called copy mechanism,
and the algorithm is super simple.

73
00:05:37,110 --> 00:05:42,490
So you need first to make sure that
you somehow learn word alignments.

74
00:05:42,490 --> 00:05:43,150
For example,

75
00:05:43,150 --> 00:05:47,800
your neural machine translation system
has these alignments as an input.

76
00:05:47,800 --> 00:05:52,730
And it tries to predict them along
with the translation predictions.

77
00:05:54,270 --> 00:05:59,220
Now, you get your translation with UNK
tokens, and you post-process this.

78
00:05:59,220 --> 00:06:02,141
So you just copy the source words, or

79
00:06:02,141 --> 00:06:07,710
you translate them with dictionary or
do whatever else what you want.

80
00:06:07,710 --> 00:06:13,760
Okay, very simple and nice technique, but
actually, there are still many problems.

81
00:06:13,760 --> 00:06:17,960
For example,
you can have some multi-word alignments.

82
00:06:17,960 --> 00:06:23,100
What if the morphology of
the languages are complicated,

83
00:06:23,100 --> 00:06:27,110
and probably, you want to split
it somehow into some parts?

84
00:06:27,110 --> 00:06:30,030
Or what if you have
some informal spelling?

85
00:06:30,030 --> 00:06:33,497
All those things are usually
out-of-vocabulary words.

86
00:06:33,497 --> 00:06:37,771
And these examples show
you that sometimes,

87
00:06:37,771 --> 00:06:41,610
it is very nice to go to sub-word level.

88
00:06:41,610 --> 00:06:44,120
For example, for rich morphology,

89
00:06:44,120 --> 00:06:48,270
it would be nice to model every
piece of the word independently.

90
00:06:49,300 --> 00:06:54,140
Or for informal spelling, it would be
definitely good to model them by letters.

91
00:06:54,140 --> 00:06:59,905
Because there is no chance to find these
words as a whole in the vocabulary.

92
00:06:59,905 --> 00:07:05,940
Okay, so there are two big
trends in sub-word modelling.

93
00:07:05,940 --> 00:07:10,823
One big trend is to do some hybrid
models that somehow combines word-level

94
00:07:10,823 --> 00:07:14,060
models and character-level models.

95
00:07:14,060 --> 00:07:18,830
Another big trend is to do the same
architecture, let's say, recurrent neural

96
00:07:18,830 --> 00:07:25,590
network, but with some small units,
something in between words and characters.

97
00:07:25,590 --> 00:07:28,580
So this is one architecture,
but other units.

98
00:07:29,620 --> 00:07:32,130
Okay, let us start with hybrid models.

99
00:07:33,350 --> 00:07:39,160
So you might know that sometimes,
character-based models are super useful.

100
00:07:39,160 --> 00:07:42,080
For example, you see the word drinkable.

101
00:07:42,080 --> 00:07:46,130
And if you can build your
convolutional neural network

102
00:07:46,130 --> 00:07:50,540
that has some convolutions that
represent the meaning of drink.

103
00:07:50,540 --> 00:07:54,820
And then some other convolutions
that represent the meaning of able,

104
00:07:54,820 --> 00:07:58,520
you are likely to build
the meaning of the whole word,

105
00:07:58,520 --> 00:08:01,070
even if you have never seen that.

106
00:08:01,070 --> 00:08:05,470
So character-level convolutional
neural networks can work fine.

107
00:08:05,470 --> 00:08:12,110
Also, bidirectional LSTMs can be used on
word level as well as on character level.

108
00:08:13,430 --> 00:08:19,150
Now, for our translation model,
it is super nice to have hybrid models.

109
00:08:19,150 --> 00:08:22,660
Let's say,
let us first try to work on word level.

110
00:08:23,720 --> 00:08:27,030
So let us try to produce
word translations.

111
00:08:28,630 --> 00:08:31,580
For example, we have a cute cat here.

112
00:08:31,580 --> 00:08:34,940
And a and cat are nice words.

113
00:08:34,940 --> 00:08:37,840
But what if cute is out of vocabulary?

114
00:08:37,840 --> 00:08:40,380
We cannot model it on word level.

115
00:08:40,380 --> 00:08:44,720
In this case, let us have some separate
units, some separate architecture,

116
00:08:44,720 --> 00:08:50,180
maybe even completely different,
that will model these probabilities for

117
00:08:50,180 --> 00:08:55,010
this word on character level,
and the same for the decoder.

118
00:08:55,010 --> 00:08:58,950
So first, we will try to decode
the sequence on word level.

119
00:08:58,950 --> 00:09:02,440
And then, in some moments,
the decoder will say, okay,

120
00:09:02,440 --> 00:09:05,660
this is some UNK token,
please do something about it.

121
00:09:05,660 --> 00:09:10,190
And then the character-level model will
switch on and do something about it.

122
00:09:11,380 --> 00:09:15,400
So this is a very nice and
obvious idea, and it is used a lot.

123
00:09:17,480 --> 00:09:22,174
Now, the second big trend would
be sub-word modeling, and

124
00:09:22,174 --> 00:09:26,050
one good example of that
is byte-pair encoding.

125
00:09:27,380 --> 00:09:29,670
Let us understand what it is.

126
00:09:29,670 --> 00:09:33,260
So imagine you have some sentence, and

127
00:09:33,260 --> 00:09:38,720
you want to encode this, and you are not
quite sure yet what is your vocabulary.

128
00:09:38,720 --> 00:09:43,350
You have just some constraints for
the size of the vocabulary.

129
00:09:43,350 --> 00:09:46,070
So you start with characters.

130
00:09:46,070 --> 00:09:48,850
Everything is split to single characters.

131
00:09:48,850 --> 00:09:55,770
Then you say, okay, what are the most
popular bigrams of my letters here?

132
00:09:56,810 --> 00:10:00,910
Well, I see that S-H happens three times.

133
00:10:00,910 --> 00:10:05,100
So maybe I should collapse
them into just one unit.

134
00:10:05,100 --> 00:10:09,019
And you do this, and
you have some other vocabulary right now.

135
00:10:09,019 --> 00:10:13,500
Okay, now you say, okay, what is next?

136
00:10:13,500 --> 00:10:18,780
Next, I see that these two letters
occur a lot, let us collapse them.

137
00:10:18,780 --> 00:10:21,650
These two letters also
should be collapsed.

138
00:10:21,650 --> 00:10:27,040
And then importantly, you can apply
the same procedure to sub-word units.

139
00:10:27,040 --> 00:10:33,539
So here, you would collapse your
bigrams and unigrams into trigrams.

140
00:10:33,539 --> 00:10:35,870
And actually,
you can stop whenever you want.

141
00:10:37,040 --> 00:10:41,230
So you can proceed until you get the nice
size of the vocabulary that you like.

142
00:10:43,280 --> 00:10:45,926
Yep, I have just said you this.

143
00:10:45,926 --> 00:10:52,980
So one thing to mention is how to
apply this method for test data.

144
00:10:52,980 --> 00:10:57,420
So if you have test data,
you also split it to letters first.

145
00:10:57,420 --> 00:11:01,860
And then you know the exact rules
from your training procedure.

146
00:11:01,860 --> 00:11:08,030
And you apply those rules to test data, to
collapse all the sub-word units as needed.

147
00:11:09,760 --> 00:11:15,701
Awesome, so this chart shows you
why it is so cool technique.

148
00:11:15,701 --> 00:11:21,133
So this is the vocabulary size, and
this line there, this vertical line,

149
00:11:21,133 --> 00:11:26,000
is about the size of the vocabulary
that we are allowed.

150
00:11:26,000 --> 00:11:28,160
And in case of words,

151
00:11:28,160 --> 00:11:34,130
usually you have some long tail that goes
outside of this allowed amount of words.

152
00:11:34,130 --> 00:11:39,240
But with byte-pair encoding,
you can do exactly this number of units,

153
00:11:39,240 --> 00:11:42,190
because you decide which rules to apply.

154
00:11:43,580 --> 00:11:47,940
And finally, I can show you
that this actually works fine.

155
00:11:47,940 --> 00:11:53,800
So for some different pairs of languages,
for some different tasks from, for

156
00:11:53,800 --> 00:11:59,220
example, WMT, which is Workshop on Machine
Translation, you can see that byte-pair

157
00:11:59,220 --> 00:12:05,540
encoding has better BLEU score
than word-based techniques.

158
00:12:05,540 --> 00:12:08,860
And actually, this BLEU score
improvement is very meaningful.

159
00:12:08,860 --> 00:12:12,130
This one or two points of BLEU is a lot.

160
00:12:12,130 --> 00:12:16,929
So please use this very nice technique
if you someday need to build

161
00:12:16,929 --> 00:12:19,205
machine translation system.

162
00:12:22,202 --> 00:12:32,202
[SOUND]