1 00:00:02,080 --> 00:00:04,495 Word Alignments Models. 2 00:00:04,495 --> 00:00:07,950 This is a super important subtask in machine translation, 3 00:00:07,950 --> 00:00:11,705 because different languages have different word order, 4 00:00:11,705 --> 00:00:14,745 and we need to learn that from data. 5 00:00:14,745 --> 00:00:21,215 So, we need to build some word alignment models and this video is exactly about them. 6 00:00:21,215 --> 00:00:24,685 Let us go a bit more formal and let us 7 00:00:24,685 --> 00:00:28,630 see what are the notations of every object that we have here. 8 00:00:28,630 --> 00:00:32,495 So, e is the sentence e1, 9 00:00:32,495 --> 00:00:36,210 e2 and so on and f is another sentence. 10 00:00:36,210 --> 00:00:43,195 So, the length of e sentence is I and the length of f sentence is J. 11 00:00:43,195 --> 00:00:46,495 Now, I need to learn some alignments between them, 12 00:00:46,495 --> 00:00:48,800 which I denote by a. 13 00:00:48,800 --> 00:00:56,110 And importantly, you'll see that I say that e is source and f is target. 14 00:00:56,110 --> 00:00:58,440 Why do I say so? 15 00:00:58,440 --> 00:01:01,285 Well, usually, we talked about 16 00:01:01,285 --> 00:01:07,825 machine translation system from French to English or from foreign to English. 17 00:01:07,825 --> 00:01:11,175 Why do I say now that it is vice versa? 18 00:01:11,175 --> 00:01:14,900 This is because we applied base rule. 19 00:01:14,900 --> 00:01:16,650 So, if you remember, 20 00:01:16,650 --> 00:01:22,970 we did this to have our decoupled model about language and about translation. 21 00:01:22,970 --> 00:01:27,390 And now, to build the system that translates from f to e, 22 00:01:27,390 --> 00:01:33,070 we need to model the probability of f given e. Now, 23 00:01:33,070 --> 00:01:34,410 what about word alignments? 24 00:01:34,410 --> 00:01:36,135 How do we represent them? 25 00:01:36,135 --> 00:01:40,855 So, the matrix of word alignments is one nice way to do this. 26 00:01:40,855 --> 00:01:43,540 You have one sentence and another sentence, 27 00:01:43,540 --> 00:01:46,480 and you have zeros or ones in the matrix. 28 00:01:46,480 --> 00:01:50,205 So, you'll know which words correspond to each other. 29 00:01:50,205 --> 00:01:53,980 Now, how many matrices do we have here? 30 00:01:53,980 --> 00:01:57,470 Well, actually, it is a huge amount of matrices. 31 00:01:57,470 --> 00:02:02,890 So, imagine you have two options in every element of the matrix and then, 32 00:02:02,890 --> 00:02:07,345 you have the size of the matrix which is I multiplied by J, 33 00:02:07,345 --> 00:02:10,720 so the number of possible matrices would be 34 00:02:10,720 --> 00:02:14,905 two to the power of the size of the matrix and that's a lot. 35 00:02:14,905 --> 00:02:17,060 So, let us do some constraints, 36 00:02:17,060 --> 00:02:20,075 some simplifications to deal with this. 37 00:02:20,075 --> 00:02:25,030 And what we do is we say that every target word is 38 00:02:25,030 --> 00:02:30,380 allowed to be aligned only to one source word, okay? 39 00:02:30,380 --> 00:02:33,355 Like here. So, this is a valid example. 40 00:02:33,355 --> 00:02:36,255 Now, what would be the notation here. 41 00:02:36,255 --> 00:02:39,840 So, we will have a1 which will 42 00:02:39,840 --> 00:02:45,680 represent the number of the source word which is aligned to the first target word. 43 00:02:45,680 --> 00:02:49,175 So, this is appetite and this is the second word. 44 00:02:49,175 --> 00:02:52,385 Now, what would be a2? 45 00:02:52,385 --> 00:02:57,305 So, a2 will be equal to three because we have 46 00:02:57,305 --> 00:03:03,890 comes matched to [inaudible] which is the third word in our source. 47 00:03:03,890 --> 00:03:08,920 Now, we can proceed and do the same for a4 and five, a6. 48 00:03:08,920 --> 00:03:11,980 That's it. So, this is our notation. 49 00:03:11,980 --> 00:03:15,095 Please keep it in mind not to get lost. 50 00:03:15,095 --> 00:03:18,850 Now, let us build the probabilistic model. 51 00:03:18,850 --> 00:03:24,825 Actually, this and the next slide will be about the sketch of the whole process. 52 00:03:24,825 --> 00:03:30,215 So, we are going to build the probabilistic model and figure out how we learned that. 53 00:03:30,215 --> 00:03:34,725 After that, we'll go into deeper details of this probabilistic model. 54 00:03:34,725 --> 00:03:39,530 So, stay with me. We have our sentences, 55 00:03:39,530 --> 00:03:41,380 e and f. So, 56 00:03:41,380 --> 00:03:44,105 this is our observable variables. 57 00:03:44,105 --> 00:03:46,760 Now, we have also word alignments. 58 00:03:46,760 --> 00:03:50,070 We do not see them, but we need to model them somehow. 59 00:03:50,070 --> 00:03:52,200 So, this is hidden variables. 60 00:03:52,200 --> 00:03:57,695 And we have parameters of the model and this is actually the most creative step. 61 00:03:57,695 --> 00:04:01,350 So, we need somehow to decide how do we parameterized 62 00:04:01,350 --> 00:04:05,775 our model to have some meaningful generative story. 63 00:04:05,775 --> 00:04:08,330 And if we have too many parameters, 64 00:04:08,330 --> 00:04:11,135 probably, it will be difficult to train that. 65 00:04:11,135 --> 00:04:13,530 If we have too less parameters, 66 00:04:13,530 --> 00:04:18,840 probably it will be not general enough to describe all the data. 67 00:04:18,840 --> 00:04:23,455 So, this is the moment that we will discuss in more details later. 68 00:04:23,455 --> 00:04:28,305 But for now, let's just say that we have some probabilistic model of 69 00:04:28,305 --> 00:04:33,975 f and a given e and theta. What do we do next? 70 00:04:33,975 --> 00:04:38,575 Well, you should know that in all these situations, 71 00:04:38,575 --> 00:04:40,890 we do likelihood maximization. 72 00:04:40,890 --> 00:04:42,585 So, we take our data, 73 00:04:42,585 --> 00:04:48,220 we write down the probability to see our data and we try to maximize this. 74 00:04:48,220 --> 00:04:51,635 Now, one complicated thing with this 75 00:04:51,635 --> 00:04:55,805 is that we do not see everything that we need to model. 76 00:04:55,805 --> 00:04:59,290 So, we can model the probabilities of f and a, 77 00:04:59,290 --> 00:05:01,140 but we don not see a. 78 00:05:01,140 --> 00:05:05,925 That's why we need to sum over all possible word alignments. 79 00:05:05,925 --> 00:05:08,305 And on the left-hand side, 80 00:05:08,305 --> 00:05:12,465 you have the probability of f given all the rest things, 81 00:05:12,465 --> 00:05:15,195 which is called incomplete data. 82 00:05:15,195 --> 00:05:17,880 Likelihood maximization for incomplete data 83 00:05:17,880 --> 00:05:21,935 means that there are some hidden variables that you do not see. 84 00:05:21,935 --> 00:05:26,115 And this is a very bad situation. 85 00:05:26,115 --> 00:05:28,185 So, imagine you have a logarithm. 86 00:05:28,185 --> 00:05:31,905 So, you take logarithm and you have logarithm of the sum. 87 00:05:31,905 --> 00:05:34,210 And you don't know how to maximize these, 88 00:05:34,210 --> 00:05:39,960 how to take derivatives and how to get your maximum likelihood estimations. 89 00:05:39,960 --> 00:05:45,570 But actually, we have already seen this case two times in our course. 90 00:05:45,570 --> 00:05:50,920 So, one was Hidden Markov Model and another was topic models. 91 00:05:50,920 --> 00:05:52,660 In both those cases, 92 00:05:52,660 --> 00:05:57,250 we had some hidden variables and we have these incomplete data. 93 00:05:57,250 --> 00:06:01,075 And in both cases we used EM-algorithm. 94 00:06:01,075 --> 00:06:03,980 So, EM-algorithm just to recap, 95 00:06:03,980 --> 00:06:08,700 is an iterative process that has E-step and M-step. 96 00:06:08,700 --> 00:06:12,775 The E-step is about estimates for your hidden variables. 97 00:06:12,775 --> 00:06:14,660 So, the E-step will be, 98 00:06:14,660 --> 00:06:20,480 what are the best alignments that we can produce right now given our perimeters? 99 00:06:20,480 --> 00:06:23,055 And the M-step is vice versa. 100 00:06:23,055 --> 00:06:26,875 Given our best guess for word alignments, 101 00:06:26,875 --> 00:06:32,065 what would be the updates for parameters that maximize our likelihood? 102 00:06:32,065 --> 00:06:38,465 This is also so interesting to go into the exact formulas of EM-algorithm. 103 00:06:38,465 --> 00:06:44,130 Better, let us discuss generative model because it is really creative thing. 104 00:06:44,130 --> 00:06:50,790 Well, let us start with generating the length of the sentence. 105 00:06:50,790 --> 00:06:54,610 So, J would be the length of the target sentence. 106 00:06:54,610 --> 00:06:56,930 Once we could generate this, 107 00:06:56,930 --> 00:07:01,795 let us say that we have independent susception by the target words. 108 00:07:01,795 --> 00:07:08,625 So, we have this product by J which denotes the word in our target sentence. 109 00:07:08,625 --> 00:07:12,855 Every word will be not modeled yet. 110 00:07:12,855 --> 00:07:16,390 So first, real model the alignment for every position. 111 00:07:16,390 --> 00:07:21,055 And then, we will model the exact word given that alignment. 112 00:07:21,055 --> 00:07:23,380 So, if you are scared with this formula, 113 00:07:23,380 --> 00:07:25,505 you can look into just green parts. 114 00:07:25,505 --> 00:07:27,155 This is the most important thing. 115 00:07:27,155 --> 00:07:31,790 You model alignments and you model words given these alignments. 116 00:07:31,790 --> 00:07:35,290 All the other things that you see on the right 117 00:07:35,290 --> 00:07:39,800 would be just everything that we know to condition on that. 118 00:07:39,800 --> 00:07:47,505 And this is too much to condition on that because we will have well too much parameters. 119 00:07:47,505 --> 00:07:49,650 So, we need to do some assumptions. 120 00:07:49,650 --> 00:07:55,705 So, we need to say that not all those things are important in this conditions. 121 00:07:55,705 --> 00:08:01,760 The first IBM model is the first attempt to simplify this generative story. 122 00:08:01,760 --> 00:08:03,870 So, what it says is, 123 00:08:03,870 --> 00:08:07,015 let us forget about the priors for word alignments, 124 00:08:07,015 --> 00:08:09,225 let us have just a uniform prior. 125 00:08:09,225 --> 00:08:13,135 And this prior will know nothing about the positions, 126 00:08:13,135 --> 00:08:17,745 but it will have just one constant to tune. So, this is awesome. 127 00:08:17,745 --> 00:08:21,900 Now, the translation table will be also very simple. 128 00:08:21,900 --> 00:08:25,345 So, we will say that the probability to generate the word, 129 00:08:25,345 --> 00:08:29,150 given that it is aligned to some source word, 130 00:08:29,150 --> 00:08:35,220 is just the translation probability of that word given the source word. 131 00:08:35,220 --> 00:08:37,350 So, how does that look like? 132 00:08:37,350 --> 00:08:39,560 This is the translation table. 133 00:08:39,560 --> 00:08:43,640 So, once we know that the word is aligned to some source word, 134 00:08:43,640 --> 00:08:46,170 we just take this probability out of it. 135 00:08:46,170 --> 00:08:48,400 So, this is a very simple model, 136 00:08:48,400 --> 00:08:50,970 but it has a very big drawback. 137 00:08:50,970 --> 00:08:53,450 It doesn't take into account the position of 138 00:08:53,450 --> 00:08:56,870 your word to produce the alignment to this word. 139 00:08:56,870 --> 00:09:01,680 So, the second IBM model tries to make better and it says, 140 00:09:01,680 --> 00:09:03,520 "Okay, let us take J, 141 00:09:03,520 --> 00:09:09,190 the position of the target word and let us use it to produce aj.", 142 00:09:09,190 --> 00:09:11,595 the alignment for this target word. 143 00:09:11,595 --> 00:09:18,250 Now, how many parameters do we have to learn to build this position-based prior. 144 00:09:18,250 --> 00:09:21,095 Well, actually a lot of parameters. 145 00:09:21,095 --> 00:09:25,460 So, you know what, you have I multiplied by J, 146 00:09:25,460 --> 00:09:32,190 which will be the size of the matrix of probabilities and it is not all. 147 00:09:32,190 --> 00:09:33,880 Apart of this matrix, 148 00:09:33,880 --> 00:09:38,085 you will also have different matrices for different I and J. 149 00:09:38,085 --> 00:09:44,445 So, you cannot just use one and the same matrix for all kind of sentences. 150 00:09:44,445 --> 00:09:50,265 You just share this matrix across all sentences with given lengths. 151 00:09:50,265 --> 00:09:52,835 But for sentences with different lengths, 152 00:09:52,835 --> 00:09:54,585 you have different matrix. 153 00:09:54,585 --> 00:09:59,270 So, this is a lot of parameters and to try to improve on that, 154 00:09:59,270 --> 00:10:01,400 we can say, "Well, 155 00:10:01,400 --> 00:10:04,490 the matrix is usually very close to diagonal. 156 00:10:04,490 --> 00:10:08,020 What if we model it as a diagonal matrix?" 157 00:10:08,020 --> 00:10:11,940 This is what Chris Dyer said in 2013. 158 00:10:11,940 --> 00:10:15,930 So, this model has only one perimeter that says, 159 00:10:15,930 --> 00:10:21,355 how is the probability mass spread around the diagonal? 160 00:10:21,355 --> 00:10:26,680 And this is nice because it is still has some priors about positions, 161 00:10:26,680 --> 00:10:29,750 but it has not too many parameters. 162 00:10:29,750 --> 00:10:34,355 Now, I have the last example for you for alignments. 163 00:10:34,355 --> 00:10:37,605 So, actually, you already know how to build this, 164 00:10:37,605 --> 00:10:39,925 you just don't remember that. 165 00:10:39,925 --> 00:10:44,780 We had Hidden Markov Models in our course and Hidden Markov Models 166 00:10:44,780 --> 00:10:49,840 can help to build some transition probabilities. Why do we need it here? 167 00:10:49,840 --> 00:10:54,070 So, imagine these couple of sentences and the phrase in 168 00:10:54,070 --> 00:10:58,560 the beginning of the sentence can be aligned to the phrase in the end of the sentence. 169 00:10:58,560 --> 00:11:01,570 But sometimes, inside this phrase, 170 00:11:01,570 --> 00:11:07,815 you just go word-by-word so you do not have any gaps. And this is nice. 171 00:11:07,815 --> 00:11:11,020 It means that you need to learn these and to 172 00:11:11,020 --> 00:11:14,370 use this information that the previous word was 173 00:11:14,370 --> 00:11:17,700 aligned to position five and maybe that means 174 00:11:17,700 --> 00:11:21,945 that the next word should be aligned to position six. 175 00:11:21,945 --> 00:11:25,460 So, this is what Hidden Markov Model can make for you. 176 00:11:25,460 --> 00:11:33,270 So, you model the probability of the next alignment given the previous alignment. 177 00:11:33,270 --> 00:11:39,705 So now, let us sum up what we have in this video. 178 00:11:39,705 --> 00:11:41,935 So, we have covered IBM models, 179 00:11:41,935 --> 00:11:47,035 which is a nice word-based technique to build machine translation systems. 180 00:11:47,035 --> 00:11:51,060 And actually, there are lots of problems with them that we did not cover. 181 00:11:51,060 --> 00:11:54,800 And there are IBM Model Number three and four and five that 182 00:11:54,800 --> 00:11:59,115 can try to cope with the problem of fertility, for example, 183 00:11:59,115 --> 00:12:01,545 saying that we need to explicitly model 184 00:12:01,545 --> 00:12:06,545 how many output words are produced by each source word, 185 00:12:06,545 --> 00:12:10,950 or that we need to explicitly deal with spurious word. 186 00:12:10,950 --> 00:12:16,590 This are the words that just appear from nowhere in the target language.