1 00:00:03,180 --> 00:00:07,828 [SOUND] This video is about all kind of problems that you can have with 2 00:00:07,828 --> 00:00:10,571 vocabulary in machine translations. 3 00:00:10,571 --> 00:00:16,940 So first, vocabulary is usually too large, and it is too long to compute softmax. 4 00:00:16,940 --> 00:00:20,470 Second, vocabulary is usually too small, and 5 00:00:20,470 --> 00:00:25,240 you have out-of-vocabulary words, and you need somehow to deal with them. 6 00:00:25,240 --> 00:00:27,843 And you have a whole range on different tricks for 7 00:00:27,843 --> 00:00:31,160 you in neural language modeling and in machine translation. 8 00:00:32,290 --> 00:00:35,500 Now, let us start with hierarchical softmax. 9 00:00:35,500 --> 00:00:40,290 It is a nice procedure to help you to compute softmax in a fast way. 10 00:00:41,400 --> 00:00:47,085 So the idea is to build the binary tree for your words, like here. 11 00:00:47,085 --> 00:00:51,650 And the words in this tree will have some codes. 12 00:00:51,650 --> 00:00:58,007 So for example, the zebra will have code 01, which means that first, 13 00:00:58,007 --> 00:01:03,680 we go to the left, it is 0, and then we go to the right, it is 1. 14 00:01:03,680 --> 00:01:09,070 And importantly, it is just unique mapping between words and codes in the tree. 15 00:01:10,424 --> 00:01:14,530 So we are going to use this property, and we are going to produce 16 00:01:14,530 --> 00:01:20,000 the probabilities for words using their binary codes. 17 00:01:20,000 --> 00:01:21,910 This is how we make it. 18 00:01:21,910 --> 00:01:26,970 So instead of computing the probability of the word and normalizing it 19 00:01:26,970 --> 00:01:33,060 across the whole vocabulary, we are going to split it into separate terms. 20 00:01:33,060 --> 00:01:37,937 Each term is the probability of the digit, is the probability of 21 00:01:37,937 --> 00:01:43,373 the next decision in the tree, whether to go to the left or to the right. 22 00:01:43,373 --> 00:01:47,807 And we build these probabilities into the product so 23 00:01:47,807 --> 00:01:54,070 that this product is going to estimate the probability of the whole word. 24 00:01:55,510 --> 00:02:01,660 Now, do you believe that this product of binary probabilities 25 00:02:01,660 --> 00:02:06,180 is going to be normalized into one across the whole vocabulary? 26 00:02:07,705 --> 00:02:11,140 Well, sounds like a magic, isn't it? 27 00:02:11,140 --> 00:02:14,130 But let's see that actually, it happens. 28 00:02:14,130 --> 00:02:21,560 So you see that this is your binary tree, and you see those probabilities. 29 00:02:21,560 --> 00:02:24,650 They are just written down for every path in the tree. 30 00:02:25,970 --> 00:02:31,061 Now, you can try to sum the first two probabilities, and 31 00:02:31,061 --> 00:02:37,830 you note that 0.1 plus 0.9 gives 1, and this is not just random. 32 00:02:37,830 --> 00:02:42,210 This is always 1, just because this is probability for 33 00:02:42,210 --> 00:02:47,550 two options, right, going to the left and going to the right. 34 00:02:47,550 --> 00:02:51,120 That's why they are necessary summed into 1. 35 00:02:51,120 --> 00:02:58,530 So if you sum those things, you get their common prefix, which will be 0.7 by 0.8. 36 00:02:58,530 --> 00:03:02,740 Then you can try to sum two other childs in the tree, and 37 00:03:02,740 --> 00:03:05,460 you get, again, their common prefix. 38 00:03:05,460 --> 00:03:09,610 You sum two other childs, and you get, again, their common prefix. 39 00:03:09,610 --> 00:03:13,744 And finally, you sum these two values, and you get 1. 40 00:03:13,744 --> 00:03:19,160 So it is clearly seen that if you go from bottom to 41 00:03:19,160 --> 00:03:24,550 the top with this tree, you will get probability equal to 1. 42 00:03:24,550 --> 00:03:29,180 So just to sum it up, we split the probability into 43 00:03:29,180 --> 00:03:34,340 these binary solutions, and we use some tree to do this. 44 00:03:34,340 --> 00:03:36,590 What kind of tree do we do? 45 00:03:36,590 --> 00:03:39,060 Well, it is actually not so important. 46 00:03:39,060 --> 00:03:43,300 So interestingly, even just random tree can help. 47 00:03:43,300 --> 00:03:46,957 Alternatively, you can use Huffman tree that gets to 48 00:03:46,957 --> 00:03:50,705 know the frequencies of some words, and it uses that. 49 00:03:50,705 --> 00:03:53,340 Or you can try to use some semantics so 50 00:03:53,340 --> 00:03:59,130 you can just cluster your words in the data based on their similarities. 51 00:03:59,130 --> 00:04:04,870 Or you can use some pre-built structure that says that these words are similar and 52 00:04:04,870 --> 00:04:08,460 that they should be found in one hierarchy branch. 53 00:04:09,750 --> 00:04:13,580 Okay, now I have one last question for you for this method. 54 00:04:13,580 --> 00:04:17,750 What would be a way to model the probability of di 55 00:04:17,750 --> 00:04:20,190 given all the other stuff? 56 00:04:21,350 --> 00:04:26,050 So usually, we have softmax to model some probabilities. 57 00:04:26,050 --> 00:04:28,290 What would be the probability in this case? 58 00:04:29,510 --> 00:04:34,360 Well, in this case, we have just only two options, left and right. 59 00:04:34,360 --> 00:04:39,520 So instead of softmax, we can have sigmoid function, which has just two outcomes. 60 00:04:41,360 --> 00:04:46,340 Okay, now let us come back to some other problems with our bot vocabulary. 61 00:04:46,340 --> 00:04:52,420 And copy mechanism is something to help with out-of-vocabulary words. 62 00:04:52,420 --> 00:04:57,853 Imagine you have some sentence, and some words are UNK tokens. 63 00:04:57,853 --> 00:05:02,481 So you do not have them in the vocabulary, and you do not know how to translate them, 64 00:05:02,481 --> 00:05:04,350 you get UNK tokens as the result. 65 00:05:05,760 --> 00:05:08,510 But what if you know the word alignments? 66 00:05:08,510 --> 00:05:12,840 If you know how the words are aligned, you can use that. 67 00:05:12,840 --> 00:05:14,620 So you can say, okay, 68 00:05:14,620 --> 00:05:19,920 this UNK corresponds to that UNK, which corresponds to this source word. 69 00:05:19,920 --> 00:05:24,020 Let us just do dictionary translation, or 70 00:05:24,020 --> 00:05:29,710 let us just copy, because this is the name of some place or some other name. 71 00:05:29,710 --> 00:05:32,190 Let us just copy this as is. 72 00:05:32,190 --> 00:05:37,110 This is why it is called copy mechanism, and the algorithm is super simple. 73 00:05:37,110 --> 00:05:42,490 So you need first to make sure that you somehow learn word alignments. 74 00:05:42,490 --> 00:05:43,150 For example, 75 00:05:43,150 --> 00:05:47,800 your neural machine translation system has these alignments as an input. 76 00:05:47,800 --> 00:05:52,730 And it tries to predict them along with the translation predictions. 77 00:05:54,270 --> 00:05:59,220 Now, you get your translation with UNK tokens, and you post-process this. 78 00:05:59,220 --> 00:06:02,141 So you just copy the source words, or 79 00:06:02,141 --> 00:06:07,710 you translate them with dictionary or do whatever else what you want. 80 00:06:07,710 --> 00:06:13,760 Okay, very simple and nice technique, but actually, there are still many problems. 81 00:06:13,760 --> 00:06:17,960 For example, you can have some multi-word alignments. 82 00:06:17,960 --> 00:06:23,100 What if the morphology of the languages are complicated, 83 00:06:23,100 --> 00:06:27,110 and probably, you want to split it somehow into some parts? 84 00:06:27,110 --> 00:06:30,030 Or what if you have some informal spelling? 85 00:06:30,030 --> 00:06:33,497 All those things are usually out-of-vocabulary words. 86 00:06:33,497 --> 00:06:37,771 And these examples show you that sometimes, 87 00:06:37,771 --> 00:06:41,610 it is very nice to go to sub-word level. 88 00:06:41,610 --> 00:06:44,120 For example, for rich morphology, 89 00:06:44,120 --> 00:06:48,270 it would be nice to model every piece of the word independently. 90 00:06:49,300 --> 00:06:54,140 Or for informal spelling, it would be definitely good to model them by letters. 91 00:06:54,140 --> 00:06:59,905 Because there is no chance to find these words as a whole in the vocabulary. 92 00:06:59,905 --> 00:07:05,940 Okay, so there are two big trends in sub-word modelling. 93 00:07:05,940 --> 00:07:10,823 One big trend is to do some hybrid models that somehow combines word-level 94 00:07:10,823 --> 00:07:14,060 models and character-level models. 95 00:07:14,060 --> 00:07:18,830 Another big trend is to do the same architecture, let's say, recurrent neural 96 00:07:18,830 --> 00:07:25,590 network, but with some small units, something in between words and characters. 97 00:07:25,590 --> 00:07:28,580 So this is one architecture, but other units. 98 00:07:29,620 --> 00:07:32,130 Okay, let us start with hybrid models. 99 00:07:33,350 --> 00:07:39,160 So you might know that sometimes, character-based models are super useful. 100 00:07:39,160 --> 00:07:42,080 For example, you see the word drinkable. 101 00:07:42,080 --> 00:07:46,130 And if you can build your convolutional neural network 102 00:07:46,130 --> 00:07:50,540 that has some convolutions that represent the meaning of drink. 103 00:07:50,540 --> 00:07:54,820 And then some other convolutions that represent the meaning of able, 104 00:07:54,820 --> 00:07:58,520 you are likely to build the meaning of the whole word, 105 00:07:58,520 --> 00:08:01,070 even if you have never seen that. 106 00:08:01,070 --> 00:08:05,470 So character-level convolutional neural networks can work fine. 107 00:08:05,470 --> 00:08:12,110 Also, bidirectional LSTMs can be used on word level as well as on character level. 108 00:08:13,430 --> 00:08:19,150 Now, for our translation model, it is super nice to have hybrid models. 109 00:08:19,150 --> 00:08:22,660 Let's say, let us first try to work on word level. 110 00:08:23,720 --> 00:08:27,030 So let us try to produce word translations. 111 00:08:28,630 --> 00:08:31,580 For example, we have a cute cat here. 112 00:08:31,580 --> 00:08:34,940 And a and cat are nice words. 113 00:08:34,940 --> 00:08:37,840 But what if cute is out of vocabulary? 114 00:08:37,840 --> 00:08:40,380 We cannot model it on word level. 115 00:08:40,380 --> 00:08:44,720 In this case, let us have some separate units, some separate architecture, 116 00:08:44,720 --> 00:08:50,180 maybe even completely different, that will model these probabilities for 117 00:08:50,180 --> 00:08:55,010 this word on character level, and the same for the decoder. 118 00:08:55,010 --> 00:08:58,950 So first, we will try to decode the sequence on word level. 119 00:08:58,950 --> 00:09:02,440 And then, in some moments, the decoder will say, okay, 120 00:09:02,440 --> 00:09:05,660 this is some UNK token, please do something about it. 121 00:09:05,660 --> 00:09:10,190 And then the character-level model will switch on and do something about it. 122 00:09:11,380 --> 00:09:15,400 So this is a very nice and obvious idea, and it is used a lot. 123 00:09:17,480 --> 00:09:22,174 Now, the second big trend would be sub-word modeling, and 124 00:09:22,174 --> 00:09:26,050 one good example of that is byte-pair encoding. 125 00:09:27,380 --> 00:09:29,670 Let us understand what it is. 126 00:09:29,670 --> 00:09:33,260 So imagine you have some sentence, and 127 00:09:33,260 --> 00:09:38,720 you want to encode this, and you are not quite sure yet what is your vocabulary. 128 00:09:38,720 --> 00:09:43,350 You have just some constraints for the size of the vocabulary. 129 00:09:43,350 --> 00:09:46,070 So you start with characters. 130 00:09:46,070 --> 00:09:48,850 Everything is split to single characters. 131 00:09:48,850 --> 00:09:55,770 Then you say, okay, what are the most popular bigrams of my letters here? 132 00:09:56,810 --> 00:10:00,910 Well, I see that S-H happens three times. 133 00:10:00,910 --> 00:10:05,100 So maybe I should collapse them into just one unit. 134 00:10:05,100 --> 00:10:09,019 And you do this, and you have some other vocabulary right now. 135 00:10:09,019 --> 00:10:13,500 Okay, now you say, okay, what is next? 136 00:10:13,500 --> 00:10:18,780 Next, I see that these two letters occur a lot, let us collapse them. 137 00:10:18,780 --> 00:10:21,650 These two letters also should be collapsed. 138 00:10:21,650 --> 00:10:27,040 And then importantly, you can apply the same procedure to sub-word units. 139 00:10:27,040 --> 00:10:33,539 So here, you would collapse your bigrams and unigrams into trigrams. 140 00:10:33,539 --> 00:10:35,870 And actually, you can stop whenever you want. 141 00:10:37,040 --> 00:10:41,230 So you can proceed until you get the nice size of the vocabulary that you like. 142 00:10:43,280 --> 00:10:45,926 Yep, I have just said you this. 143 00:10:45,926 --> 00:10:52,980 So one thing to mention is how to apply this method for test data. 144 00:10:52,980 --> 00:10:57,420 So if you have test data, you also split it to letters first. 145 00:10:57,420 --> 00:11:01,860 And then you know the exact rules from your training procedure. 146 00:11:01,860 --> 00:11:08,030 And you apply those rules to test data, to collapse all the sub-word units as needed. 147 00:11:09,760 --> 00:11:15,701 Awesome, so this chart shows you why it is so cool technique. 148 00:11:15,701 --> 00:11:21,133 So this is the vocabulary size, and this line there, this vertical line, 149 00:11:21,133 --> 00:11:26,000 is about the size of the vocabulary that we are allowed. 150 00:11:26,000 --> 00:11:28,160 And in case of words, 151 00:11:28,160 --> 00:11:34,130 usually you have some long tail that goes outside of this allowed amount of words. 152 00:11:34,130 --> 00:11:39,240 But with byte-pair encoding, you can do exactly this number of units, 153 00:11:39,240 --> 00:11:42,190 because you decide which rules to apply. 154 00:11:43,580 --> 00:11:47,940 And finally, I can show you that this actually works fine. 155 00:11:47,940 --> 00:11:53,800 So for some different pairs of languages, for some different tasks from, for 156 00:11:53,800 --> 00:11:59,220 example, WMT, which is Workshop on Machine Translation, you can see that byte-pair 157 00:11:59,220 --> 00:12:05,540 encoding has better BLEU score than word-based techniques. 158 00:12:05,540 --> 00:12:08,860 And actually, this BLEU score improvement is very meaningful. 159 00:12:08,860 --> 00:12:12,130 This one or two points of BLEU is a lot. 160 00:12:12,130 --> 00:12:16,929 So please use this very nice technique if you someday need to build 161 00:12:16,929 --> 00:12:19,205 machine translation system. 162 00:12:22,202 --> 00:12:32,202 [SOUND]