1 00:00:02,070 --> 00:00:07,940 Today, we will cover one main idea of statistical machine translations. 2 00:00:07,940 --> 00:00:10,720 Imagine you have a sentence, let's say, 3 00:00:10,720 --> 00:00:14,290 in French or in some other foreign language and then, 4 00:00:14,290 --> 00:00:17,925 you want to have its translation to English. 5 00:00:17,925 --> 00:00:21,400 How do you do this? Well, you can try to compute 6 00:00:21,400 --> 00:00:26,530 the probability of the English sentence given your French sentence. 7 00:00:26,530 --> 00:00:29,785 And then, you want to maximize this probability and 8 00:00:29,785 --> 00:00:34,050 take the sentence that gives you this maximum probability, 9 00:00:34,050 --> 00:00:37,055 right? Sounds very intuitively. 10 00:00:37,055 --> 00:00:40,970 Now, let us apply base rule here. 11 00:00:40,970 --> 00:00:45,270 So let us say that instead of computing the probabilities of E given F, 12 00:00:45,270 --> 00:00:48,755 we would better compute probabilities of F given E. 13 00:00:48,755 --> 00:00:53,125 And multiply it by some probability of the English sentence. 14 00:00:53,125 --> 00:00:57,075 And also, normalize it by some denominator. 15 00:00:57,075 --> 00:00:59,194 Now, do you have any idea? 16 00:00:59,194 --> 00:01:02,360 Can we further simplify this formula? 17 00:01:02,360 --> 00:01:04,480 Well, actually, we can. 18 00:01:04,480 --> 00:01:09,060 So, the denominator doesn't depend on the English sentence, 19 00:01:09,060 --> 00:01:12,665 which means that we can just get rid of it, okay. 20 00:01:12,665 --> 00:01:15,840 Now, we have this formula and now, 21 00:01:15,840 --> 00:01:18,450 the question is, why is that easier? 22 00:01:18,450 --> 00:01:22,365 Why we like it more than the original formula? 23 00:01:22,365 --> 00:01:25,495 This slide is going to explain why. 24 00:01:25,495 --> 00:01:28,365 So, we have two models now. 25 00:01:28,365 --> 00:01:33,870 We have decoupled our complicated problem to two more simple problems. 26 00:01:33,870 --> 00:01:36,435 One problem is language modeling. 27 00:01:36,435 --> 00:01:38,740 And actually, you know a lot about it. 28 00:01:38,740 --> 00:01:45,465 So, this is how to produce some meaningful probability of the sentence of words. 29 00:01:45,465 --> 00:01:48,645 Now, the other problem is translation model. 30 00:01:48,645 --> 00:01:53,190 And this model doesn't think about some coherent sentences. 31 00:01:53,190 --> 00:01:57,570 It just thinks about some good translation of E to F, 32 00:01:57,570 --> 00:02:03,990 so that you do not end up with something that is not related to your source sentence. 33 00:02:03,990 --> 00:02:10,500 So, you have two models about language and about adequacy of the translation. 34 00:02:10,500 --> 00:02:14,430 And then you have argmax to perform the search in 35 00:02:14,430 --> 00:02:20,805 your space and find the sentence in English that gives you the best probability. 36 00:02:20,805 --> 00:02:24,625 Now, I have one more interpretation for you. 37 00:02:24,625 --> 00:02:28,320 The Noisy Channel is a super popular idea, 38 00:02:28,320 --> 00:02:31,010 so you definitely need to know about it. 39 00:02:31,010 --> 00:02:33,550 And it is actually super simple. 40 00:02:33,550 --> 00:02:38,155 So, you have your source sentence and you have some probability of this source sentence. 41 00:02:38,155 --> 00:02:41,255 And then, it goes through the noisy channel. 42 00:02:41,255 --> 00:02:45,710 The noisy channel is represented by the conditional probability of 43 00:02:45,710 --> 00:02:50,515 what you get as the output given your input for the channel. 44 00:02:50,515 --> 00:02:52,030 So, as the output, 45 00:02:52,030 --> 00:02:55,150 you obtain your French sentence. 46 00:02:55,150 --> 00:02:58,000 So, let's say that your source sentence was 47 00:02:58,000 --> 00:03:03,080 spoilt with the channel and now you obtained it in French. 48 00:03:03,080 --> 00:03:10,380 Now, the rest of the video is about how to model these two probabilities, 49 00:03:10,380 --> 00:03:14,035 the probability of the sentence and the probability of 50 00:03:14,035 --> 00:03:18,490 the translation given some sentence. 51 00:03:18,490 --> 00:03:21,075 Okay. First, about the language model. 52 00:03:21,075 --> 00:03:25,585 You know a lot about it so we covered this in the week two. 53 00:03:25,585 --> 00:03:30,050 So, I will have just one slide to have a recap for you. 54 00:03:30,050 --> 00:03:33,565 So, we need to compute the probability of a sentence of words. 55 00:03:33,565 --> 00:03:37,680 We apply chain rule and then we know that we can factorize 56 00:03:37,680 --> 00:03:42,625 it into the probabilities of the next word given some previous history. 57 00:03:42,625 --> 00:03:47,485 You can use Markov assumption and then end up with n-gram language models. 58 00:03:47,485 --> 00:03:53,950 Or you can use some neural language models such as LSTM to produce the next word, 59 00:03:53,950 --> 00:03:55,635 you will need previous words. 60 00:03:55,635 --> 00:03:57,832 Now, translation model. 61 00:03:57,832 --> 00:03:59,770 Well, it is not so easy. 62 00:03:59,770 --> 00:04:03,940 So, imagine you have a sequence of words in one language and you need to 63 00:04:03,940 --> 00:04:08,625 produce the probability of a sequence or words in some other language. 64 00:04:08,625 --> 00:04:11,670 For example, this is foreign language, 65 00:04:11,670 --> 00:04:13,990 like Russian and English language, 66 00:04:13,990 --> 00:04:15,780 and these two sentences. 67 00:04:15,780 --> 00:04:18,825 How do you produce these probabilities? 68 00:04:18,825 --> 00:04:21,865 Well, it is not obvious for me. 69 00:04:21,865 --> 00:04:24,655 So, let us start with words level. 70 00:04:24,655 --> 00:04:30,760 We can understand something for the level of separate words in these sentences. 71 00:04:30,760 --> 00:04:33,035 Okay. What can we do? 72 00:04:33,035 --> 00:04:35,065 We can have a translation table. 73 00:04:35,065 --> 00:04:41,395 So, here, I have the probabilities of Russian words given some English words. 74 00:04:41,395 --> 00:04:43,505 And they are normalized, right. 75 00:04:43,505 --> 00:04:47,110 So, each row in this matrix is normalized into one. 76 00:04:47,110 --> 00:04:50,530 And this are just translations that I learn or 77 00:04:50,530 --> 00:04:53,960 that I look up in the dictionary or built somehow. 78 00:04:53,960 --> 00:04:56,015 Okay, it's doable. 79 00:04:56,015 --> 00:04:59,170 Now, how do I build the probability of 80 00:04:59,170 --> 00:05:05,010 the whole sentence given these separate probabilities? 81 00:05:05,010 --> 00:05:07,400 We need some word alignments. 82 00:05:07,400 --> 00:05:12,710 So, the problem is that we can have some reorderings in the language like here, 83 00:05:12,710 --> 00:05:19,145 or even worse, we can have some one to many or many to one correspondence. 84 00:05:19,145 --> 00:05:24,335 For example, the word appetit here corresponds to the appetite. 85 00:05:24,335 --> 00:05:28,910 And the word with here corresponds to two Russian words [FOREIGN] 86 00:05:28,910 --> 00:05:34,475 It means that we need some model to build those alignments. 87 00:05:34,475 --> 00:05:39,555 Now, another example would be words that can appear or disappear. 88 00:05:39,555 --> 00:05:46,235 For example, some articles or some auxiliary words can happen in one language and then, 89 00:05:46,235 --> 00:05:48,975 they can't just vanish in some other language. 90 00:05:48,975 --> 00:05:51,910 This is a very unique word alignment models 91 00:05:51,910 --> 00:05:54,950 and this is the topic will fall when next video.