1 00:00:00,000 --> 00:00:05,525 There are some similarities between the sequence to sequence machine translation model 2 00:00:05,525 --> 00:00:11,058 and the language models that you have worked within the first week of this course, 3 00:00:11,058 --> 00:00:14,060 but there are some significant differences as well. 4 00:00:14,060 --> 00:00:16,335 Let's take a look. So, you can think of 5 00:00:16,335 --> 00:00:20,625 machine translation as building a conditional language model. 6 00:00:20,625 --> 00:00:23,378 Here's what I mean, in language modeling, 7 00:00:23,378 --> 00:00:27,473 this was the network we had built in the first week. 8 00:00:27,473 --> 00:00:35,350 And this model allows you to estimate the probability of a sentence. 9 00:00:35,350 --> 00:00:38,175 That's what a language model does. 10 00:00:38,175 --> 00:00:42,235 And you can also use this to generate novel sentences, 11 00:00:42,235 --> 00:00:46,450 and sometimes when you are writing x1 and x2 here, 12 00:00:46,450 --> 00:00:47,740 where in this example, 13 00:00:47,740 --> 00:00:53,210 x2 would be equal to y1 or equal to y and one is just a feedback. 14 00:00:53,210 --> 00:00:56,040 But x1, x2, and so on were not important. 15 00:00:56,040 --> 00:00:57,950 So just to clean this up for this slide, 16 00:00:57,950 --> 00:00:59,800 I'm going to just cross these off. 17 00:00:59,800 --> 00:01:02,660 X1 could be the vector of all zeros and x2, 18 00:01:02,660 --> 00:01:06,995 x3 are just the previous output you are generating. 19 00:01:06,995 --> 00:01:09,243 So that was the language model. 20 00:01:09,243 --> 00:01:12,315 The machine translation model looks as follows, 21 00:01:12,315 --> 00:01:14,200 and I am going to use a couple different colors, 22 00:01:14,200 --> 00:01:17,490 green and purple, to denote respectively 23 00:01:17,490 --> 00:01:22,369 the coded network in green and the decoded network in purple. 24 00:01:22,369 --> 00:01:27,820 And you notice that the decoded network looks pretty much 25 00:01:27,820 --> 00:01:33,938 identical to the language model that we had up there. 26 00:01:33,938 --> 00:01:35,520 So what the machine translation model is, 27 00:01:35,520 --> 00:01:38,715 is very similar to the language model, 28 00:01:38,715 --> 00:01:43,690 except that instead of always starting along with the vector of all zeros, 29 00:01:43,690 --> 00:01:46,385 it instead has an encoded network 30 00:01:46,385 --> 00:01:49,960 that figures out some representation for the input sentence, 31 00:01:49,960 --> 00:01:54,425 and it takes that input sentence and starts off the decoded network with 32 00:01:54,425 --> 00:02:01,915 representation of the input sentence rather than with the representation of all zeros. 33 00:02:01,915 --> 00:02:07,355 So, that's why I call this a conditional language model, 34 00:02:07,355 --> 00:02:11,345 and instead of modeling the probability of any sentence, 35 00:02:11,345 --> 00:02:14,790 it is now modeling the probability of, say, 36 00:02:14,790 --> 00:02:17,490 the output English translation, 37 00:02:17,490 --> 00:02:22,745 conditions on some input French sentence. 38 00:02:22,745 --> 00:02:28,780 So in other words, you're trying to estimate the probability of an English translation. 39 00:02:28,780 --> 00:02:33,795 Like, what's the chance that the translation is "Jane is visiting Africa in September," 40 00:02:33,795 --> 00:02:38,870 but conditions on the input French censors like, 41 00:02:38,870 --> 00:02:42,271 "Jane visite I'Afrique en septembre." 42 00:02:42,271 --> 00:02:46,830 So, this is really the probability of an English sentence conditions on 43 00:02:46,830 --> 00:02:51,710 an input French sentence which is why it is a conditional language model. 44 00:02:51,710 --> 00:02:54,830 Now, if you want to apply this model to actually 45 00:02:54,830 --> 00:02:58,790 translate a sentence from French into English, 46 00:02:58,790 --> 00:03:02,045 given this input French sentence, 47 00:03:02,045 --> 00:03:05,285 the model might tell you what is the probability 48 00:03:05,285 --> 00:03:08,550 of difference in corresponding English translations. 49 00:03:08,550 --> 00:03:11,900 So, x is the French sentence, 50 00:03:11,900 --> 00:03:13,669 "Jane visite l'Afrique en septembre." 51 00:03:13,669 --> 00:03:17,330 And, this now tells you what is the probability of 52 00:03:17,330 --> 00:03:22,058 different English translations of that French input. 53 00:03:22,058 --> 00:03:28,235 And, what you do not want is to sample outputs at random. 54 00:03:28,235 --> 00:03:31,970 If you sample words from this distribution, 55 00:03:31,970 --> 00:03:36,343 p of y given x, maybe one time you get a pretty good translation, 56 00:03:36,343 --> 00:03:38,081 "Jane is visiting Africa in September." 57 00:03:38,081 --> 00:03:40,850 But, maybe another time you get a different translation, 58 00:03:40,850 --> 00:03:42,770 "Jane is going to be visiting Africa in September. " 59 00:03:42,770 --> 00:03:46,805 Which sounds a little awkward but is not a terrible translation, 60 00:03:46,805 --> 00:03:48,120 just not the best one. 61 00:03:48,120 --> 00:03:49,970 And sometimes, just by chance, 62 00:03:49,970 --> 00:03:52,432 you get, say, others: "In September, 63 00:03:52,432 --> 00:03:54,055 Jane will visit Africa." 64 00:03:54,055 --> 00:03:55,580 And maybe, just by chance, 65 00:03:55,580 --> 00:03:57,930 sometimes you sample a really bad translation: 66 00:03:57,930 --> 00:04:00,055 "Her African friend welcomed Jane in September." 67 00:04:00,055 --> 00:04:04,475 So, when you're using this model for machine translation, 68 00:04:04,475 --> 00:04:08,810 you're not trying to sample at random from this distribution. 69 00:04:08,810 --> 00:04:13,130 Instead, what you would like is to find the English sentence, 70 00:04:13,130 --> 00:04:16,910 y, that maximizes that conditional probability. 71 00:04:16,910 --> 00:04:20,660 So in developing a machine translation system, 72 00:04:20,660 --> 00:04:25,910 one of the things you need to do is come up with an algorithm that can actually find 73 00:04:25,910 --> 00:04:31,805 the value of y that maximizes this term over here. 74 00:04:31,805 --> 00:04:34,810 The most common algorithm for doing this is called beam search, 75 00:04:34,810 --> 00:04:37,565 and it's something you'll see in the next video. 76 00:04:37,565 --> 00:04:39,730 But, before moving on to describe beam search, 77 00:04:39,730 --> 00:04:43,700 you might wonder, why not just use greedy search? So, what is greedy search? 78 00:04:43,700 --> 00:04:49,310 Well, greedy search is an algorithm from computer science which says to generate 79 00:04:49,310 --> 00:04:50,960 the first word just pick whatever is 80 00:04:50,960 --> 00:04:55,390 the most likely first word according to your conditional language model. 81 00:04:55,390 --> 00:05:01,160 Going to your machine translation model and then after having picked the first word, 82 00:05:01,160 --> 00:05:04,685 you then pick whatever is the second word that seems most likely, 83 00:05:04,685 --> 00:05:07,335 then pick the third word that seems most likely. 84 00:05:07,335 --> 00:05:10,685 This algorithm is called greedy search. 85 00:05:10,685 --> 00:05:16,759 And, what you would really like is to pick the entire sequence of words, y1, y2, 86 00:05:16,759 --> 00:05:21,973 up to yTy, that's there, 87 00:05:21,973 --> 00:05:27,705 that maximizes the joint probability of that whole thing. 88 00:05:27,705 --> 00:05:30,176 And it turns out that the greedy approach, 89 00:05:30,176 --> 00:05:31,620 where you just pick the best first word, 90 00:05:31,620 --> 00:05:34,280 and then, after having picked the best first word, 91 00:05:34,280 --> 00:05:36,104 try to pick the best second word, 92 00:05:36,104 --> 00:05:37,395 and then, after that, 93 00:05:37,395 --> 00:05:39,318 try to pick the best third word, 94 00:05:39,318 --> 00:05:41,545 that approach doesn't really work. 95 00:05:41,545 --> 00:05:44,990 To demonstrate that, let's consider the following two translations. 96 00:05:44,990 --> 00:05:46,920 The first one is a better translation, 97 00:05:46,920 --> 00:05:50,610 so hopefully, in our machine translation model, 98 00:05:50,610 --> 00:05:56,330 it will say that p of y given x is higher for the first sentence. 99 00:05:56,330 --> 00:05:59,907 It's just a better, more succinct translation of the French input. 100 00:05:59,907 --> 00:06:02,394 The second one is not a bad translation, 101 00:06:02,394 --> 00:06:03,665 it's just more verbose, 102 00:06:03,665 --> 00:06:05,970 it has more unnecessary words. 103 00:06:05,970 --> 00:06:10,485 But, if the algorithm has picked "Jane is" as the first two words, 104 00:06:10,485 --> 00:06:14,298 because "going" is a more common English word, 105 00:06:14,298 --> 00:06:21,930 probably the chance of "Jane is going," given the French input, 106 00:06:21,930 --> 00:06:26,610 this might actually be higher than the chance of "Jane is 107 00:06:26,610 --> 00:06:32,710 visiting," given the French sentence. 108 00:06:32,710 --> 00:06:35,760 So, it's quite possible that if you just pick 109 00:06:35,760 --> 00:06:40,215 the third word based on whatever maximizes the probability of just the first three words, 110 00:06:40,215 --> 00:06:43,670 you end up choosing option number two. 111 00:06:43,670 --> 00:06:50,710 But, this ultimately ends up resulting in a less optimal sentence, 112 00:06:50,710 --> 00:06:55,325 in a less good sentence as measured by this model for p of y given 113 00:06:55,325 --> 00:07:01,280 x. I know this was may be a slightly hand-wavey argument, 114 00:07:01,280 --> 00:07:05,110 but, this is an example of a broader phenomenon, 115 00:07:05,110 --> 00:07:08,860 where if you want to find the sequence of words, y1, y2, 116 00:07:08,860 --> 00:07:13,822 all the way up to the final word that together maximize the probability, 117 00:07:13,822 --> 00:07:17,845 it's not always optimal to just pick one word at a time. 118 00:07:17,845 --> 00:07:21,730 And, of course, the total number of combinations of 119 00:07:21,730 --> 00:07:25,660 words in the English sentence is exponentially larger. 120 00:07:25,660 --> 00:07:30,365 So, if you have just 10,000 words in a dictionary and if you're 121 00:07:30,365 --> 00:07:35,211 contemplating translations that are up to ten words long, 122 00:07:35,211 --> 00:07:42,085 then there are 10000 to the tenth possible sentences that are ten words long. 123 00:07:42,085 --> 00:07:44,955 Picking words from the vocabulary size, 124 00:07:44,955 --> 00:07:47,970 the dictionary size of 10000 words. 125 00:07:47,970 --> 00:07:51,260 So, this is just a huge space of possible sentences, 126 00:07:51,260 --> 00:07:53,455 and it's impossible to rate them all, 127 00:07:53,455 --> 00:08:00,880 which is why the most common thing to do is use an approximate search out of them. 128 00:08:00,880 --> 00:08:02,540 And, what an approximate search algorithm does, 129 00:08:02,540 --> 00:08:03,724 is it will try, 130 00:08:03,724 --> 00:08:05,135 it won't always succeed, 131 00:08:05,135 --> 00:08:07,737 but it will to pick the sentence, y, 132 00:08:07,737 --> 00:08:11,900 that maximizes that conditional probability. 133 00:08:11,900 --> 00:08:16,925 And, even though it's not guaranteed to find the value of y that maximizes this, 134 00:08:16,925 --> 00:08:19,025 it usually does a good enough job. 135 00:08:19,025 --> 00:08:20,570 So, to summarize, in this video, 136 00:08:20,570 --> 00:08:26,430 you saw how machine translation can be posed as a conditional language modeling problem. 137 00:08:26,430 --> 00:08:28,880 But one major difference between this and 138 00:08:28,880 --> 00:08:31,460 the earlier language modeling problems is rather 139 00:08:31,460 --> 00:08:34,313 than wanting to generate a sentence at random, 140 00:08:34,313 --> 00:08:37,910 you may want to try to find the most likely English sentence, 141 00:08:37,910 --> 00:08:40,130 most likely English translation. 142 00:08:40,130 --> 00:08:43,655 But the set of all English sentences of a certain length 143 00:08:43,655 --> 00:08:47,420 is too large to exhaustively enumerate. 144 00:08:47,420 --> 00:08:51,155 So, we have to resort to a search algorithm. 145 00:08:51,155 --> 00:08:53,369 So, with that, let's go onto the next video where 146 00:08:53,369 --> 00:08:56,000 you'll learn about beam search algorithm.