1
00:00:00,000 --> 00:00:03,649
[MUSIC]

2
00:00:03,649 --> 00:00:07,703
Hi everyone, this week is about
sequence to sequence tasks.

3
00:00:07,703 --> 00:00:09,627
We have a lot of them in NLP, but

4
00:00:09,627 --> 00:00:13,440
one obvious example would
be machine translation.

5
00:00:13,440 --> 00:00:16,870
So you have a sequence of words
in one language as an input, and

6
00:00:16,870 --> 00:00:21,650
you want to produce a sequence of words
in some other language as an output.

7
00:00:21,650 --> 00:00:24,277
Now, you can think about
some other examples.

8
00:00:24,277 --> 00:00:28,392
For example, summarization is also
a sequence to sequence task and

9
00:00:28,392 --> 00:00:31,567
you can think about it as
machine translation but for

10
00:00:31,567 --> 00:00:35,730
one language,
monolingual machine translation.

11
00:00:35,730 --> 00:00:40,750
Well we will cover these examples in the
end of the week but now let us start with

12
00:00:40,750 --> 00:00:45,030
statistical machine translation,
and neural machine translation.

13
00:00:45,030 --> 00:00:47,160
We will see that actually
there are some techniques,

14
00:00:47,160 --> 00:00:50,340
that are super similar in
both these approaches.

15
00:00:50,340 --> 00:00:52,655
For example, we will see alignments,

16
00:00:52,655 --> 00:00:56,540
word alignments that we need in
statistical machine translation.

17
00:00:56,540 --> 00:01:00,850
And then, we will see that we have
attention mechanism in neural networks

18
00:01:00,850 --> 00:01:03,920
that kind of has similar
meaning in these tasks.

19
00:01:05,180 --> 00:01:09,330
Okay, so let us begin, and
I think there is no need to tell you that

20
00:01:09,330 --> 00:01:12,770
machine translation is important,
we just know that.

21
00:01:12,770 --> 00:01:16,496
So I would better start
with two other questions.

22
00:01:16,496 --> 00:01:20,061
Two questions that actually we
skip a lot in our course and

23
00:01:20,061 --> 00:01:25,600
in some other courses but these are two
very important questions to speak about.

24
00:01:25,600 --> 00:01:31,030
So one question is data and
another question is evaluation.

25
00:01:31,030 --> 00:01:36,210
When you get some real task in your life,
some NLP task usually

26
00:01:36,210 --> 00:01:41,300
this is not a model that is plane,
this is usually data and evaluation.

27
00:01:41,300 --> 00:01:44,800
So you can have a fancy
neuro-architecture, but

28
00:01:44,800 --> 00:01:49,220
if you do not have good data and
if you haven't settled

29
00:01:49,220 --> 00:01:54,760
down how to do evaluation procedure,
you're not going to have good results.

30
00:01:54,760 --> 00:02:01,517
So first data, well what kind of data
do we need for machine translation?

31
00:02:01,517 --> 00:02:05,794
We need some parallel corpora, so
we need some text in one language and

32
00:02:05,794 --> 00:02:09,610
we need its translation
to another language.

33
00:02:09,610 --> 00:02:14,600
Where does that come from, so
what sources can you think of?

34
00:02:14,600 --> 00:02:18,278
Well, one of your source well maybe not so
obvious but

35
00:02:18,278 --> 00:02:22,545
one very good source,
is European Parliament proceedings.

36
00:02:22,545 --> 00:02:28,321
So you have there some texts in several
languages, maybe 20 languages and

37
00:02:28,321 --> 00:02:32,572
very exact translations of
one in the same statements.

38
00:02:32,572 --> 00:02:37,920
And this is nice, so you can use that,
some other domain would be movies.

39
00:02:37,920 --> 00:02:41,940
So you have subtitles that are translated
in many languages this is nice.

40
00:02:42,980 --> 00:02:46,621
Something which is not that useful,
but still useful,

41
00:02:46,621 --> 00:02:50,117
would be books translations or
Wikipedia articles.

42
00:02:50,117 --> 00:02:54,926
So for example, for
Wikipedia you can not guarantee that you

43
00:02:54,926 --> 00:02:57,953
have the same text for two languages.

44
00:02:57,953 --> 00:03:02,856
But you can have something similar,
for example some vague translations or

45
00:03:02,856 --> 00:03:05,810
which are to the same topic at least.

46
00:03:05,810 --> 00:03:09,020
So we call this corpora comparable but
not parallel.

47
00:03:10,350 --> 00:03:16,270
The OPUS website has the nice overview
of many sources so please check it out.

48
00:03:16,270 --> 00:03:21,070
But I want to discuss something which is
not nice, some problems with the data.

49
00:03:22,130 --> 00:03:27,060
Actually, we have lots of problems for
any data that we have, and

50
00:03:27,060 --> 00:03:30,670
what kind of problems happen for
machine translation?

51
00:03:30,670 --> 00:03:35,637
Well, first, usually the data
comes from some specific domain.

52
00:03:35,637 --> 00:03:39,984
So imagine you have movie subtitles and
you want to train a system for

53
00:03:39,984 --> 00:03:42,760
scientific papers translations.

54
00:03:42,760 --> 00:03:46,950
It's not going to work, right, so
you need to have some close domain.

55
00:03:46,950 --> 00:03:51,201
Or you need to know how to transfer
your knowledge from one domain

56
00:03:51,201 --> 00:03:54,835
to another domain,
this is something to think about.

57
00:03:54,835 --> 00:04:00,139
Now, you can have some decent amount of
data for some language pairs like English

58
00:04:00,139 --> 00:04:05,443
and French, or English and German, but
probably for some rare language pairs,

59
00:04:05,443 --> 00:04:10,160
you have really not a lot of data,
and that's a huge problem.

60
00:04:10,160 --> 00:04:15,077
Also you can have noisy and not enough
data, and it can be not aligned well.

61
00:04:15,077 --> 00:04:20,029
By alignment I mean, you need to know
the correspondence between the sentences,

62
00:04:20,029 --> 00:04:24,427
or even better the correspondence
between the words and the sentences.

63
00:04:24,427 --> 00:04:28,855
And this is luxury, so
usually you do not have that, at least for

64
00:04:28,855 --> 00:04:30,415
a huge amount of data.

65
00:04:30,415 --> 00:04:36,954
Okay, now I think it's clear about
the data, so the second thing, evaluation.

66
00:04:36,954 --> 00:04:40,020
Well you can say that we
have some parallel data.

67
00:04:40,020 --> 00:04:45,300
So why don't we just split it to train and
test and have our test set

68
00:04:45,300 --> 00:04:50,030
to compare correct translations and
those that are produced by our system.

69
00:04:51,250 --> 00:04:54,530
But well,
how do we know that the translation

70
00:04:54,530 --> 00:04:58,300
is wrong just because it doesn't
occur in your reference?

71
00:04:59,330 --> 00:05:01,445
You know that the language is so

72
00:05:01,445 --> 00:05:05,860
relative so every translator would
do some different translations.

73
00:05:05,860 --> 00:05:10,982
It means that if your system produce
something different it doesn't mean yet

74
00:05:10,982 --> 00:05:12,177
that it is wrong.

75
00:05:12,177 --> 00:05:19,180
So well there is no nice answer for this
question, I mean this is a problem, yes.

76
00:05:19,180 --> 00:05:24,310
One thing that you can do is to have
multiple references so you can have,

77
00:05:24,310 --> 00:05:29,680
let's say five references and
compare your system output to all of them.

78
00:05:29,680 --> 00:05:34,329
And the other thing is you should be
very careful how do you compare it.

79
00:05:34,329 --> 00:05:37,574
So definitely you shouldn't
do just exact match,

80
00:05:37,574 --> 00:05:41,400
right you should do
something more intelligent.

81
00:05:41,400 --> 00:05:46,392
And I'm going to show you BLUE score
which is known to be very popular measure

82
00:05:46,392 --> 00:05:51,073
in machine translation that try
somehow to softly measure whether your

83
00:05:51,073 --> 00:05:55,381
system output is somehow similar
to the reference translation.

84
00:05:55,381 --> 00:05:58,180
Okay, let me show you an example.

85
00:05:58,180 --> 00:06:03,036
So you have some reference translation and
you have the output of your system and

86
00:06:03,036 --> 00:06:04,567
you try to compare them.

87
00:06:04,567 --> 00:06:08,920
Well, you remember that we have this
nice tool which is called engrams.

88
00:06:08,920 --> 00:06:12,910
So you can compute some unigrams and
bigrams and trigrams.

89
00:06:14,010 --> 00:06:16,010
Do you have any idea how to use that here?

90
00:06:17,050 --> 00:06:22,990
Well, first we can try to compute
some precision, what does it mean?

91
00:06:22,990 --> 00:06:27,630
You look into your system output,
and here you have six words,

92
00:06:27,630 --> 00:06:33,920
six unigrams and compute how many of
them actually occur in the reference.

93
00:06:33,920 --> 00:06:39,187
So the unigram precision
core will be 4 out of 6.

94
00:06:39,187 --> 00:06:43,827
Now, tell me what would
be bigram score here.

95
00:06:43,827 --> 00:06:48,854
Well, the bigram score will be
3 out of 5 because you have

96
00:06:48,854 --> 00:06:54,804
5 bigrams in your system output and
only 3 of them was sent sent on and

97
00:06:54,804 --> 00:06:58,920
on Tuesday occurred in the reference.

98
00:06:58,920 --> 00:07:03,419
Now you can proceed and
you can compute 3-grams score and

99
00:07:03,419 --> 00:07:06,290
4-grams score, so that's good.

100
00:07:06,290 --> 00:07:10,137
Maybe we can just average them and
have some measure.

101
00:07:10,137 --> 00:07:14,240
Well we could, but
there is one problem here,

102
00:07:14,240 --> 00:07:20,070
well imagine that the system
tries to be super precise.

103
00:07:20,070 --> 00:07:24,167
Then it is good for system to output
super short sentences, right?

104
00:07:25,320 --> 00:07:29,352
So if I'm sure that this
union gram should occur,

105
00:07:29,352 --> 00:07:33,385
I will just output this and
I will not output more.

106
00:07:33,385 --> 00:07:39,686
So just to punish into penalty the model,
we can have some brevity score.

107
00:07:39,686 --> 00:07:43,880
This brevity penalty says that we

108
00:07:43,880 --> 00:07:49,330
divide the length of the output
by the length of the reference.

109
00:07:49,330 --> 00:07:55,733
And then the system outputs two short
sentences, we will get to know that.

110
00:07:55,733 --> 00:07:59,359
Now how do we compute the BLEU
score out of these values?

111
00:08:00,720 --> 00:08:06,050
Like this so we have some average so
this root is the average

112
00:08:06,050 --> 00:08:11,250
of our union gram, bigram,
3-gram, and 4-gram's course.

113
00:08:11,250 --> 00:08:15,092
And then we multiply this
average by the brevity.

114
00:08:15,092 --> 00:08:19,487
Okay, now let us speak about
how the system actually works.

115
00:08:19,487 --> 00:08:22,984
So this is kind of a mandatory
slide on machine translation,

116
00:08:22,984 --> 00:08:27,190
because kind of any tutorial on
machine translation has this.

117
00:08:27,190 --> 00:08:30,790
So I decided not to be an exception and
show you that.

118
00:08:31,950 --> 00:08:35,850
So the idea is like that,
we have some source sentence and

119
00:08:35,850 --> 00:08:39,445
we want to translate it to
get some target sentence.

120
00:08:39,445 --> 00:08:44,280
Now the first thing that we can
do is just direct transfer.

121
00:08:44,280 --> 00:08:49,860
So we can translate this source sentence
word by word and get the target sentence.

122
00:08:50,970 --> 00:08:53,890
But well,
maybe it's not super good, right?

123
00:08:53,890 --> 00:09:00,160
So if you have ever studied some foreign
language, you know that just by dictionary

124
00:09:00,160 --> 00:09:06,110
translations of every word, you usually
do not get nice coherent translation.

125
00:09:06,110 --> 00:09:10,800
So probably we would better
go into some synthetic level.

126
00:09:10,800 --> 00:09:15,428
So we do syntax analysis, and
then we do the transfer and

127
00:09:15,428 --> 00:09:19,859
then we generate the target
sentence by knowing how it

128
00:09:19,859 --> 00:09:23,515
should look like on on
the syntactic level.

129
00:09:23,515 --> 00:09:28,430
Even better, we could try to go to
semantic layer, so that first we analyze

130
00:09:28,430 --> 00:09:34,110
the source sentence and understand some
meanings of some parts of the sentence.

131
00:09:34,110 --> 00:09:38,060
We somehow transfer these
meanings to in our language and

132
00:09:38,060 --> 00:09:42,960
then we generate some good syntactic
structures with good meaning.

133
00:09:42,960 --> 00:09:47,450
And our dream, like the best
things as we could ever think of,

134
00:09:47,450 --> 00:09:50,320
would be having some interlingual.

135
00:09:50,320 --> 00:09:55,350
So by interlingual, we mean some
n ice representation of the whole

136
00:09:55,350 --> 00:10:00,360
source sentence that is enough to
generate the whole target sentence.

137
00:10:01,360 --> 00:10:06,589
Actually it is still a dream, so
it is still a dream of the translators

138
00:10:06,589 --> 00:10:11,115
to have that kind of system
because it sounds so appealing.

139
00:10:11,115 --> 00:10:15,928
But neural translation systems
somehow have mechanisms that

140
00:10:15,928 --> 00:10:20,658
resembles that and I will show
you that in a couple of slides.

141
00:10:22,237 --> 00:10:27,870
Okay, so for now I want to show you
some brief history of the area.

142
00:10:29,075 --> 00:10:31,187
And like any other area,

143
00:10:31,187 --> 00:10:35,925
machine translation has some bright and
dark periods.

144
00:10:35,925 --> 00:10:43,048
So in 1954 there were great expectations,
so there was IBM experiments where

145
00:10:43,048 --> 00:10:48,120
they translated 60 sentences
from Russian to English.

146
00:10:48,120 --> 00:10:52,930
And they said, that's easy we can
solve the machine translation

147
00:10:52,930 --> 00:10:56,262
task completely in just three or
five years.

148
00:10:56,262 --> 00:11:00,143
So they tried to work on that and
they worked a lot, and

149
00:11:00,143 --> 00:11:05,083
after many years they concluded
that actually it's not that easy.

150
00:11:05,083 --> 00:11:08,871
And they said, well,
machine translation is too expensive and

151
00:11:08,871 --> 00:11:12,740
we should not do automatic
machine translation system.

152
00:11:12,740 --> 00:11:17,360
We should better focus on just
some tools that help human

153
00:11:17,360 --> 00:11:21,797
translators to provide
good quality translations.

154
00:11:21,797 --> 00:11:25,632
So you know these great expectations and

155
00:11:25,632 --> 00:11:30,257
then the disappointment
made the area silent for

156
00:11:30,257 --> 00:11:34,655
a while, but then in 1988 IBM researchers

157
00:11:34,655 --> 00:11:39,871
proposed word-based machine
translation systems.

158
00:11:39,871 --> 00:11:43,954
These machine translation
systems were rather simple, so

159
00:11:43,954 --> 00:11:48,521
we will cover them, kind of in this
video and in the next video, but

160
00:11:48,521 --> 00:11:54,540
these systems were kind of the first
working system for machine translation.

161
00:11:54,540 --> 00:11:59,390
So this was nice and then the next
important step was phrase based machine

162
00:11:59,390 --> 00:12:05,040
translations system that were
proposed by Philip Koehn in 2003.

163
00:12:05,040 --> 00:12:10,373
And this is what probably people mean
by statistical machine translation now.

164
00:12:10,373 --> 00:12:13,807
You definitely know Google Translate,
right?

165
00:12:13,807 --> 00:12:16,901
But maybe you haven't heard about Moses.

166
00:12:16,901 --> 00:12:21,893
So Moses is the system that allows
a researchers to build their own

167
00:12:21,893 --> 00:12:24,357
machine translation systems.

168
00:12:24,357 --> 00:12:28,449
So it allows to train your models and
to compare them, so

169
00:12:28,449 --> 00:12:34,209
this is a very nice tool for researchers
and it was made available in 2007.

170
00:12:35,860 --> 00:12:37,456
Now, with an extent,

171
00:12:37,456 --> 00:12:42,334
obviously very important step here
is neural machine translation.

172
00:12:42,334 --> 00:12:47,122
It is amazing how fast the neural
machine translation systems

173
00:12:47,122 --> 00:12:51,700
could go from research
papers to production.

174
00:12:51,700 --> 00:12:54,980
Usually we have such a big
gap between these two things.

175
00:12:54,980 --> 00:12:58,905
But in this case there were just two or
three years so

176
00:12:58,905 --> 00:13:04,568
it is amazing that those ideas that
were proposed could be implemented and

177
00:13:04,568 --> 00:13:11,537
just launched in many companies in 2016 so
we have neutral machine translations now.

178
00:13:11,537 --> 00:13:17,883
You might be wondering what is WMT there,
it is the workshop on machine translation,

179
00:13:17,883 --> 00:13:24,580
which is kind of the annual competition,
the annual event and shared tasks.

180
00:13:24,580 --> 00:13:29,420
Which means that you can compare your
systems there, and it is a very nice

181
00:13:29,420 --> 00:13:34,890
venue to compare different systems by
different researchers and companies.

182
00:13:34,890 --> 00:13:38,800
And to see what are the traits
of machine translations.

183
00:13:38,800 --> 00:13:43,431
And it happens every year, so
usually people who do research in this

184
00:13:43,431 --> 00:13:46,777
area keep eye on this and
this is very nice thing.

185
00:13:46,777 --> 00:13:51,740
This is the slide about intralingual
that I promised to show you.

186
00:13:51,740 --> 00:13:56,140
So this is how Google neural
machine translation works, and

187
00:13:56,140 --> 00:14:00,500
there was actually lots of hype
around it maybe even too much.

188
00:14:00,500 --> 00:14:07,650
But still, so the idea is that you train
some system or some pair of languages.

189
00:14:07,650 --> 00:14:12,630
For example on English to Japanese and
Japanese to English and

190
00:14:12,630 --> 00:14:19,240
English to Korean and some other pair, you
train some encoder, decoder architecture.

191
00:14:19,240 --> 00:14:24,076
It means that you have some encoder
that encodes your sentence to

192
00:14:24,076 --> 00:14:26,415
some hidden representation.

193
00:14:26,415 --> 00:14:31,363
And then you have decoder that takes
that hidden representation and

194
00:14:31,363 --> 00:14:33,978
decodes it to the target sentence.

195
00:14:33,978 --> 00:14:40,459
Now the nice thing is, that if you
just take your encoder, let's say for

196
00:14:40,459 --> 00:14:45,465
Japanese and decoder for
Korean and you just take them.

197
00:14:45,465 --> 00:14:49,967
Somehow it works nicely even
though the system has never seen

198
00:14:49,967 --> 00:14:53,390
Japanese to Korean translations.

199
00:14:53,390 --> 00:14:58,340
You see so this is zero-shot translation
you have never seen Japanese to Korean,

200
00:14:58,340 --> 00:15:00,750
but just by building nice encoder and

201
00:15:00,750 --> 00:15:05,330
nice decoder, you can stack them and
get this path.

202
00:15:05,330 --> 00:15:09,270
So it seems like this hidden
representation that you have,

203
00:15:09,270 --> 00:15:12,770
is kind of universal for
any language pair.

204
00:15:12,770 --> 00:15:18,100
Well, it is not completely true but
at least it is very promising result.

205
00:15:18,100 --> 00:15:28,100
[MUSIC]