1
00:00:02,080 --> 00:00:04,495
Word Alignments Models.

2
00:00:04,495 --> 00:00:07,950
This is a super important subtask in machine translation,

3
00:00:07,950 --> 00:00:11,705
because different languages have different word order,

4
00:00:11,705 --> 00:00:14,745
and we need to learn that from data.

5
00:00:14,745 --> 00:00:21,215
So, we need to build some word alignment models and this video is exactly about them.

6
00:00:21,215 --> 00:00:24,685
Let us go a bit more formal and let us

7
00:00:24,685 --> 00:00:28,630
see what are the notations of every object that we have here.

8
00:00:28,630 --> 00:00:32,495
So, e is the sentence e1,

9
00:00:32,495 --> 00:00:36,210
e2 and so on and f is another sentence.

10
00:00:36,210 --> 00:00:43,195
So, the length of e sentence is I and the length of f sentence is J.

11
00:00:43,195 --> 00:00:46,495
Now, I need to learn some alignments between them,

12
00:00:46,495 --> 00:00:48,800
which I denote by a.

13
00:00:48,800 --> 00:00:56,110
And importantly, you'll see that I say that e is source and f is target.

14
00:00:56,110 --> 00:00:58,440
Why do I say so?

15
00:00:58,440 --> 00:01:01,285
Well, usually, we talked about

16
00:01:01,285 --> 00:01:07,825
machine translation system from French to English or from foreign to English.

17
00:01:07,825 --> 00:01:11,175
Why do I say now that it is vice versa?

18
00:01:11,175 --> 00:01:14,900
This is because we applied base rule.

19
00:01:14,900 --> 00:01:16,650
So, if you remember,

20
00:01:16,650 --> 00:01:22,970
we did this to have our decoupled model about language and about translation.

21
00:01:22,970 --> 00:01:27,390
And now, to build the system that translates from f to e,

22
00:01:27,390 --> 00:01:33,070
we need to model the probability of f given e. Now,

23
00:01:33,070 --> 00:01:34,410
what about word alignments?

24
00:01:34,410 --> 00:01:36,135
How do we represent them?

25
00:01:36,135 --> 00:01:40,855
So, the matrix of word alignments is one nice way to do this.

26
00:01:40,855 --> 00:01:43,540
You have one sentence and another sentence,

27
00:01:43,540 --> 00:01:46,480
and you have zeros or ones in the matrix.

28
00:01:46,480 --> 00:01:50,205
So, you'll know which words correspond to each other.

29
00:01:50,205 --> 00:01:53,980
Now, how many matrices do we have here?

30
00:01:53,980 --> 00:01:57,470
Well, actually, it is a huge amount of matrices.

31
00:01:57,470 --> 00:02:02,890
So, imagine you have two options in every element of the matrix and then,

32
00:02:02,890 --> 00:02:07,345
you have the size of the matrix which is I multiplied by J,

33
00:02:07,345 --> 00:02:10,720
so the number of possible matrices would be

34
00:02:10,720 --> 00:02:14,905
two to the power of the size of the matrix and that's a lot.

35
00:02:14,905 --> 00:02:17,060
So, let us do some constraints,

36
00:02:17,060 --> 00:02:20,075
some simplifications to deal with this.

37
00:02:20,075 --> 00:02:25,030
And what we do is we say that every target word is

38
00:02:25,030 --> 00:02:30,380
allowed to be aligned only to one source word, okay?

39
00:02:30,380 --> 00:02:33,355
Like here. So, this is a valid example.

40
00:02:33,355 --> 00:02:36,255
Now, what would be the notation here.

41
00:02:36,255 --> 00:02:39,840
So, we will have a1 which will

42
00:02:39,840 --> 00:02:45,680
represent the number of the source word which is aligned to the first target word.

43
00:02:45,680 --> 00:02:49,175
So, this is appetite and this is the second word.

44
00:02:49,175 --> 00:02:52,385
Now, what would be a2?

45
00:02:52,385 --> 00:02:57,305
So, a2 will be equal to three because we have

46
00:02:57,305 --> 00:03:03,890
comes matched to [inaudible] which is the third word in our source.

47
00:03:03,890 --> 00:03:08,920
Now, we can proceed and do the same for a4 and five, a6.

48
00:03:08,920 --> 00:03:11,980
That's it. So, this is our notation.

49
00:03:11,980 --> 00:03:15,095
Please keep it in mind not to get lost.

50
00:03:15,095 --> 00:03:18,850
Now, let us build the probabilistic model.

51
00:03:18,850 --> 00:03:24,825
Actually, this and the next slide will be about the sketch of the whole process.

52
00:03:24,825 --> 00:03:30,215
So, we are going to build the probabilistic model and figure out how we learned that.

53
00:03:30,215 --> 00:03:34,725
After that, we'll go into deeper details of this probabilistic model.

54
00:03:34,725 --> 00:03:39,530
So, stay with me. We have our sentences,

55
00:03:39,530 --> 00:03:41,380
e and f. So,

56
00:03:41,380 --> 00:03:44,105
this is our observable variables.

57
00:03:44,105 --> 00:03:46,760
Now, we have also word alignments.

58
00:03:46,760 --> 00:03:50,070
We do not see them, but we need to model them somehow.

59
00:03:50,070 --> 00:03:52,200
So, this is hidden variables.

60
00:03:52,200 --> 00:03:57,695
And we have parameters of the model and this is actually the most creative step.

61
00:03:57,695 --> 00:04:01,350
So, we need somehow to decide how do we parameterized

62
00:04:01,350 --> 00:04:05,775
our model to have some meaningful generative story.

63
00:04:05,775 --> 00:04:08,330
And if we have too many parameters,

64
00:04:08,330 --> 00:04:11,135
probably, it will be difficult to train that.

65
00:04:11,135 --> 00:04:13,530
If we have too less parameters,

66
00:04:13,530 --> 00:04:18,840
probably it will be not general enough to describe all the data.

67
00:04:18,840 --> 00:04:23,455
So, this is the moment that we will discuss in more details later.

68
00:04:23,455 --> 00:04:28,305
But for now, let's just say that we have some probabilistic model of

69
00:04:28,305 --> 00:04:33,975
f and a given e and theta. What do we do next?

70
00:04:33,975 --> 00:04:38,575
Well, you should know that in all these situations,

71
00:04:38,575 --> 00:04:40,890
we do likelihood maximization.

72
00:04:40,890 --> 00:04:42,585
So, we take our data,

73
00:04:42,585 --> 00:04:48,220
we write down the probability to see our data and we try to maximize this.

74
00:04:48,220 --> 00:04:51,635
Now, one complicated thing with this

75
00:04:51,635 --> 00:04:55,805
is that we do not see everything that we need to model.

76
00:04:55,805 --> 00:04:59,290
So, we can model the probabilities of f and a,

77
00:04:59,290 --> 00:05:01,140
but we don not see a.

78
00:05:01,140 --> 00:05:05,925
That's why we need to sum over all possible word alignments.

79
00:05:05,925 --> 00:05:08,305
And on the left-hand side,

80
00:05:08,305 --> 00:05:12,465
you have the probability of f given all the rest things,

81
00:05:12,465 --> 00:05:15,195
which is called incomplete data.

82
00:05:15,195 --> 00:05:17,880
Likelihood maximization for incomplete data

83
00:05:17,880 --> 00:05:21,935
means that there are some hidden variables that you do not see.

84
00:05:21,935 --> 00:05:26,115
And this is a very bad situation.

85
00:05:26,115 --> 00:05:28,185
So, imagine you have a logarithm.

86
00:05:28,185 --> 00:05:31,905
So, you take logarithm and you have logarithm of the sum.

87
00:05:31,905 --> 00:05:34,210
And you don't know how to maximize these,

88
00:05:34,210 --> 00:05:39,960
how to take derivatives and how to get your maximum likelihood estimations.

89
00:05:39,960 --> 00:05:45,570
But actually, we have already seen this case two times in our course.

90
00:05:45,570 --> 00:05:50,920
So, one was Hidden Markov Model and another was topic models.

91
00:05:50,920 --> 00:05:52,660
In both those cases,

92
00:05:52,660 --> 00:05:57,250
we had some hidden variables and we have these incomplete data.

93
00:05:57,250 --> 00:06:01,075
And in both cases we used EM-algorithm.

94
00:06:01,075 --> 00:06:03,980
So, EM-algorithm just to recap,

95
00:06:03,980 --> 00:06:08,700
is an iterative process that has E-step and M-step.

96
00:06:08,700 --> 00:06:12,775
The E-step is about estimates for your hidden variables.

97
00:06:12,775 --> 00:06:14,660
So, the E-step will be,

98
00:06:14,660 --> 00:06:20,480
what are the best alignments that we can produce right now given our perimeters?

99
00:06:20,480 --> 00:06:23,055
And the M-step is vice versa.

100
00:06:23,055 --> 00:06:26,875
Given our best guess for word alignments,

101
00:06:26,875 --> 00:06:32,065
what would be the updates for parameters that maximize our likelihood?

102
00:06:32,065 --> 00:06:38,465
This is also so interesting to go into the exact formulas of EM-algorithm.

103
00:06:38,465 --> 00:06:44,130
Better, let us discuss generative model because it is really creative thing.

104
00:06:44,130 --> 00:06:50,790
Well, let us start with generating the length of the sentence.

105
00:06:50,790 --> 00:06:54,610
So, J would be the length of the target sentence.

106
00:06:54,610 --> 00:06:56,930
Once we could generate this,

107
00:06:56,930 --> 00:07:01,795
let us say that we have independent susception by the target words.

108
00:07:01,795 --> 00:07:08,625
So, we have this product by J which denotes the word in our target sentence.

109
00:07:08,625 --> 00:07:12,855
Every word will be not modeled yet.

110
00:07:12,855 --> 00:07:16,390
So first, real model the alignment for every position.

111
00:07:16,390 --> 00:07:21,055
And then, we will model the exact word given that alignment.

112
00:07:21,055 --> 00:07:23,380
So, if you are scared with this formula,

113
00:07:23,380 --> 00:07:25,505
you can look into just green parts.

114
00:07:25,505 --> 00:07:27,155
This is the most important thing.

115
00:07:27,155 --> 00:07:31,790
You model alignments and you model words given these alignments.

116
00:07:31,790 --> 00:07:35,290
All the other things that you see on the right

117
00:07:35,290 --> 00:07:39,800
would be just everything that we know to condition on that.

118
00:07:39,800 --> 00:07:47,505
And this is too much to condition on that because we will have well too much parameters.

119
00:07:47,505 --> 00:07:49,650
So, we need to do some assumptions.

120
00:07:49,650 --> 00:07:55,705
So, we need to say that not all those things are important in this conditions.

121
00:07:55,705 --> 00:08:01,760
The first IBM model is the first attempt to simplify this generative story.

122
00:08:01,760 --> 00:08:03,870
So, what it says is,

123
00:08:03,870 --> 00:08:07,015
let us forget about the priors for word alignments,

124
00:08:07,015 --> 00:08:09,225
let us have just a uniform prior.

125
00:08:09,225 --> 00:08:13,135
And this prior will know nothing about the positions,

126
00:08:13,135 --> 00:08:17,745
but it will have just one constant to tune. So, this is awesome.

127
00:08:17,745 --> 00:08:21,900
Now, the translation table will be also very simple.

128
00:08:21,900 --> 00:08:25,345
So, we will say that the probability to generate the word,

129
00:08:25,345 --> 00:08:29,150
given that it is aligned to some source word,

130
00:08:29,150 --> 00:08:35,220
is just the translation probability of that word given the source word.

131
00:08:35,220 --> 00:08:37,350
So, how does that look like?

132
00:08:37,350 --> 00:08:39,560
This is the translation table.

133
00:08:39,560 --> 00:08:43,640
So, once we know that the word is aligned to some source word,

134
00:08:43,640 --> 00:08:46,170
we just take this probability out of it.

135
00:08:46,170 --> 00:08:48,400
So, this is a very simple model,

136
00:08:48,400 --> 00:08:50,970
but it has a very big drawback.

137
00:08:50,970 --> 00:08:53,450
It doesn't take into account the position of

138
00:08:53,450 --> 00:08:56,870
your word to produce the alignment to this word.

139
00:08:56,870 --> 00:09:01,680
So, the second IBM model tries to make better and it says,

140
00:09:01,680 --> 00:09:03,520
"Okay, let us take J,

141
00:09:03,520 --> 00:09:09,190
the position of the target word and let us use it to produce aj.",

142
00:09:09,190 --> 00:09:11,595
the alignment for this target word.

143
00:09:11,595 --> 00:09:18,250
Now, how many parameters do we have to learn to build this position-based prior.

144
00:09:18,250 --> 00:09:21,095
Well, actually a lot of parameters.

145
00:09:21,095 --> 00:09:25,460
So, you know what, you have I multiplied by J,

146
00:09:25,460 --> 00:09:32,190
which will be the size of the matrix of probabilities and it is not all.

147
00:09:32,190 --> 00:09:33,880
Apart of this matrix,

148
00:09:33,880 --> 00:09:38,085
you will also have different matrices for different I and J.

149
00:09:38,085 --> 00:09:44,445
So, you cannot just use one and the same matrix for all kind of sentences.

150
00:09:44,445 --> 00:09:50,265
You just share this matrix across all sentences with given lengths.

151
00:09:50,265 --> 00:09:52,835
But for sentences with different lengths,

152
00:09:52,835 --> 00:09:54,585
you have different matrix.

153
00:09:54,585 --> 00:09:59,270
So, this is a lot of parameters and to try to improve on that,

154
00:09:59,270 --> 00:10:01,400
we can say, "Well,

155
00:10:01,400 --> 00:10:04,490
the matrix is usually very close to diagonal.

156
00:10:04,490 --> 00:10:08,020
What if we model it as a diagonal matrix?"

157
00:10:08,020 --> 00:10:11,940
This is what Chris Dyer said in 2013.

158
00:10:11,940 --> 00:10:15,930
So, this model has only one perimeter that says,

159
00:10:15,930 --> 00:10:21,355
how is the probability mass spread around the diagonal?

160
00:10:21,355 --> 00:10:26,680
And this is nice because it is still has some priors about positions,

161
00:10:26,680 --> 00:10:29,750
but it has not too many parameters.

162
00:10:29,750 --> 00:10:34,355
Now, I have the last example for you for alignments.

163
00:10:34,355 --> 00:10:37,605
So, actually, you already know how to build this,

164
00:10:37,605 --> 00:10:39,925
you just don't remember that.

165
00:10:39,925 --> 00:10:44,780
We had Hidden Markov Models in our course and Hidden Markov Models

166
00:10:44,780 --> 00:10:49,840
can help to build some transition probabilities. Why do we need it here?

167
00:10:49,840 --> 00:10:54,070
So, imagine these couple of sentences and the phrase in

168
00:10:54,070 --> 00:10:58,560
the beginning of the sentence can be aligned to the phrase in the end of the sentence.

169
00:10:58,560 --> 00:11:01,570
But sometimes, inside this phrase,

170
00:11:01,570 --> 00:11:07,815
you just go word-by-word so you do not have any gaps. And this is nice.

171
00:11:07,815 --> 00:11:11,020
It means that you need to learn these and to

172
00:11:11,020 --> 00:11:14,370
use this information that the previous word was

173
00:11:14,370 --> 00:11:17,700
aligned to position five and maybe that means

174
00:11:17,700 --> 00:11:21,945
that the next word should be aligned to position six.

175
00:11:21,945 --> 00:11:25,460
So, this is what Hidden Markov Model can make for you.

176
00:11:25,460 --> 00:11:33,270
So, you model the probability of the next alignment given the previous alignment.

177
00:11:33,270 --> 00:11:39,705
So now, let us sum up what we have in this video.

178
00:11:39,705 --> 00:11:41,935
So, we have covered IBM models,

179
00:11:41,935 --> 00:11:47,035
which is a nice word-based technique to build machine translation systems.

180
00:11:47,035 --> 00:11:51,060
And actually, there are lots of problems with them that we did not cover.

181
00:11:51,060 --> 00:11:54,800
And there are IBM Model Number three and four and five that

182
00:11:54,800 --> 00:11:59,115
can try to cope with the problem of fertility, for example,

183
00:11:59,115 --> 00:12:01,545
saying that we need to explicitly model

184
00:12:01,545 --> 00:12:06,545
how many output words are produced by each source word,

185
00:12:06,545 --> 00:12:10,950
or that we need to explicitly deal with spurious word.

186
00:12:10,950 --> 00:12:16,590
This are the words that just appear from nowhere in the target language.