1 00:00:00,000 --> 00:00:03,649 [MUSIC] 2 00:00:03,649 --> 00:00:07,703 Hi everyone, this week is about sequence to sequence tasks. 3 00:00:07,703 --> 00:00:09,627 We have a lot of them in NLP, but 4 00:00:09,627 --> 00:00:13,440 one obvious example would be machine translation. 5 00:00:13,440 --> 00:00:16,870 So you have a sequence of words in one language as an input, and 6 00:00:16,870 --> 00:00:21,650 you want to produce a sequence of words in some other language as an output. 7 00:00:21,650 --> 00:00:24,277 Now, you can think about some other examples. 8 00:00:24,277 --> 00:00:28,392 For example, summarization is also a sequence to sequence task and 9 00:00:28,392 --> 00:00:31,567 you can think about it as machine translation but for 10 00:00:31,567 --> 00:00:35,730 one language, monolingual machine translation. 11 00:00:35,730 --> 00:00:40,750 Well we will cover these examples in the end of the week but now let us start with 12 00:00:40,750 --> 00:00:45,030 statistical machine translation, and neural machine translation. 13 00:00:45,030 --> 00:00:47,160 We will see that actually there are some techniques, 14 00:00:47,160 --> 00:00:50,340 that are super similar in both these approaches. 15 00:00:50,340 --> 00:00:52,655 For example, we will see alignments, 16 00:00:52,655 --> 00:00:56,540 word alignments that we need in statistical machine translation. 17 00:00:56,540 --> 00:01:00,850 And then, we will see that we have attention mechanism in neural networks 18 00:01:00,850 --> 00:01:03,920 that kind of has similar meaning in these tasks. 19 00:01:05,180 --> 00:01:09,330 Okay, so let us begin, and I think there is no need to tell you that 20 00:01:09,330 --> 00:01:12,770 machine translation is important, we just know that. 21 00:01:12,770 --> 00:01:16,496 So I would better start with two other questions. 22 00:01:16,496 --> 00:01:20,061 Two questions that actually we skip a lot in our course and 23 00:01:20,061 --> 00:01:25,600 in some other courses but these are two very important questions to speak about. 24 00:01:25,600 --> 00:01:31,030 So one question is data and another question is evaluation. 25 00:01:31,030 --> 00:01:36,210 When you get some real task in your life, some NLP task usually 26 00:01:36,210 --> 00:01:41,300 this is not a model that is plane, this is usually data and evaluation. 27 00:01:41,300 --> 00:01:44,800 So you can have a fancy neuro-architecture, but 28 00:01:44,800 --> 00:01:49,220 if you do not have good data and if you haven't settled 29 00:01:49,220 --> 00:01:54,760 down how to do evaluation procedure, you're not going to have good results. 30 00:01:54,760 --> 00:02:01,517 So first data, well what kind of data do we need for machine translation? 31 00:02:01,517 --> 00:02:05,794 We need some parallel corpora, so we need some text in one language and 32 00:02:05,794 --> 00:02:09,610 we need its translation to another language. 33 00:02:09,610 --> 00:02:14,600 Where does that come from, so what sources can you think of? 34 00:02:14,600 --> 00:02:18,278 Well, one of your source well maybe not so obvious but 35 00:02:18,278 --> 00:02:22,545 one very good source, is European Parliament proceedings. 36 00:02:22,545 --> 00:02:28,321 So you have there some texts in several languages, maybe 20 languages and 37 00:02:28,321 --> 00:02:32,572 very exact translations of one in the same statements. 38 00:02:32,572 --> 00:02:37,920 And this is nice, so you can use that, some other domain would be movies. 39 00:02:37,920 --> 00:02:41,940 So you have subtitles that are translated in many languages this is nice. 40 00:02:42,980 --> 00:02:46,621 Something which is not that useful, but still useful, 41 00:02:46,621 --> 00:02:50,117 would be books translations or Wikipedia articles. 42 00:02:50,117 --> 00:02:54,926 So for example, for Wikipedia you can not guarantee that you 43 00:02:54,926 --> 00:02:57,953 have the same text for two languages. 44 00:02:57,953 --> 00:03:02,856 But you can have something similar, for example some vague translations or 45 00:03:02,856 --> 00:03:05,810 which are to the same topic at least. 46 00:03:05,810 --> 00:03:09,020 So we call this corpora comparable but not parallel. 47 00:03:10,350 --> 00:03:16,270 The OPUS website has the nice overview of many sources so please check it out. 48 00:03:16,270 --> 00:03:21,070 But I want to discuss something which is not nice, some problems with the data. 49 00:03:22,130 --> 00:03:27,060 Actually, we have lots of problems for any data that we have, and 50 00:03:27,060 --> 00:03:30,670 what kind of problems happen for machine translation? 51 00:03:30,670 --> 00:03:35,637 Well, first, usually the data comes from some specific domain. 52 00:03:35,637 --> 00:03:39,984 So imagine you have movie subtitles and you want to train a system for 53 00:03:39,984 --> 00:03:42,760 scientific papers translations. 54 00:03:42,760 --> 00:03:46,950 It's not going to work, right, so you need to have some close domain. 55 00:03:46,950 --> 00:03:51,201 Or you need to know how to transfer your knowledge from one domain 56 00:03:51,201 --> 00:03:54,835 to another domain, this is something to think about. 57 00:03:54,835 --> 00:04:00,139 Now, you can have some decent amount of data for some language pairs like English 58 00:04:00,139 --> 00:04:05,443 and French, or English and German, but probably for some rare language pairs, 59 00:04:05,443 --> 00:04:10,160 you have really not a lot of data, and that's a huge problem. 60 00:04:10,160 --> 00:04:15,077 Also you can have noisy and not enough data, and it can be not aligned well. 61 00:04:15,077 --> 00:04:20,029 By alignment I mean, you need to know the correspondence between the sentences, 62 00:04:20,029 --> 00:04:24,427 or even better the correspondence between the words and the sentences. 63 00:04:24,427 --> 00:04:28,855 And this is luxury, so usually you do not have that, at least for 64 00:04:28,855 --> 00:04:30,415 a huge amount of data. 65 00:04:30,415 --> 00:04:36,954 Okay, now I think it's clear about the data, so the second thing, evaluation. 66 00:04:36,954 --> 00:04:40,020 Well you can say that we have some parallel data. 67 00:04:40,020 --> 00:04:45,300 So why don't we just split it to train and test and have our test set 68 00:04:45,300 --> 00:04:50,030 to compare correct translations and those that are produced by our system. 69 00:04:51,250 --> 00:04:54,530 But well, how do we know that the translation 70 00:04:54,530 --> 00:04:58,300 is wrong just because it doesn't occur in your reference? 71 00:04:59,330 --> 00:05:01,445 You know that the language is so 72 00:05:01,445 --> 00:05:05,860 relative so every translator would do some different translations. 73 00:05:05,860 --> 00:05:10,982 It means that if your system produce something different it doesn't mean yet 74 00:05:10,982 --> 00:05:12,177 that it is wrong. 75 00:05:12,177 --> 00:05:19,180 So well there is no nice answer for this question, I mean this is a problem, yes. 76 00:05:19,180 --> 00:05:24,310 One thing that you can do is to have multiple references so you can have, 77 00:05:24,310 --> 00:05:29,680 let's say five references and compare your system output to all of them. 78 00:05:29,680 --> 00:05:34,329 And the other thing is you should be very careful how do you compare it. 79 00:05:34,329 --> 00:05:37,574 So definitely you shouldn't do just exact match, 80 00:05:37,574 --> 00:05:41,400 right you should do something more intelligent. 81 00:05:41,400 --> 00:05:46,392 And I'm going to show you BLUE score which is known to be very popular measure 82 00:05:46,392 --> 00:05:51,073 in machine translation that try somehow to softly measure whether your 83 00:05:51,073 --> 00:05:55,381 system output is somehow similar to the reference translation. 84 00:05:55,381 --> 00:05:58,180 Okay, let me show you an example. 85 00:05:58,180 --> 00:06:03,036 So you have some reference translation and you have the output of your system and 86 00:06:03,036 --> 00:06:04,567 you try to compare them. 87 00:06:04,567 --> 00:06:08,920 Well, you remember that we have this nice tool which is called engrams. 88 00:06:08,920 --> 00:06:12,910 So you can compute some unigrams and bigrams and trigrams. 89 00:06:14,010 --> 00:06:16,010 Do you have any idea how to use that here? 90 00:06:17,050 --> 00:06:22,990 Well, first we can try to compute some precision, what does it mean? 91 00:06:22,990 --> 00:06:27,630 You look into your system output, and here you have six words, 92 00:06:27,630 --> 00:06:33,920 six unigrams and compute how many of them actually occur in the reference. 93 00:06:33,920 --> 00:06:39,187 So the unigram precision core will be 4 out of 6. 94 00:06:39,187 --> 00:06:43,827 Now, tell me what would be bigram score here. 95 00:06:43,827 --> 00:06:48,854 Well, the bigram score will be 3 out of 5 because you have 96 00:06:48,854 --> 00:06:54,804 5 bigrams in your system output and only 3 of them was sent sent on and 97 00:06:54,804 --> 00:06:58,920 on Tuesday occurred in the reference. 98 00:06:58,920 --> 00:07:03,419 Now you can proceed and you can compute 3-grams score and 99 00:07:03,419 --> 00:07:06,290 4-grams score, so that's good. 100 00:07:06,290 --> 00:07:10,137 Maybe we can just average them and have some measure. 101 00:07:10,137 --> 00:07:14,240 Well we could, but there is one problem here, 102 00:07:14,240 --> 00:07:20,070 well imagine that the system tries to be super precise. 103 00:07:20,070 --> 00:07:24,167 Then it is good for system to output super short sentences, right? 104 00:07:25,320 --> 00:07:29,352 So if I'm sure that this union gram should occur, 105 00:07:29,352 --> 00:07:33,385 I will just output this and I will not output more. 106 00:07:33,385 --> 00:07:39,686 So just to punish into penalty the model, we can have some brevity score. 107 00:07:39,686 --> 00:07:43,880 This brevity penalty says that we 108 00:07:43,880 --> 00:07:49,330 divide the length of the output by the length of the reference. 109 00:07:49,330 --> 00:07:55,733 And then the system outputs two short sentences, we will get to know that. 110 00:07:55,733 --> 00:07:59,359 Now how do we compute the BLEU score out of these values? 111 00:08:00,720 --> 00:08:06,050 Like this so we have some average so this root is the average 112 00:08:06,050 --> 00:08:11,250 of our union gram, bigram, 3-gram, and 4-gram's course. 113 00:08:11,250 --> 00:08:15,092 And then we multiply this average by the brevity. 114 00:08:15,092 --> 00:08:19,487 Okay, now let us speak about how the system actually works. 115 00:08:19,487 --> 00:08:22,984 So this is kind of a mandatory slide on machine translation, 116 00:08:22,984 --> 00:08:27,190 because kind of any tutorial on machine translation has this. 117 00:08:27,190 --> 00:08:30,790 So I decided not to be an exception and show you that. 118 00:08:31,950 --> 00:08:35,850 So the idea is like that, we have some source sentence and 119 00:08:35,850 --> 00:08:39,445 we want to translate it to get some target sentence. 120 00:08:39,445 --> 00:08:44,280 Now the first thing that we can do is just direct transfer. 121 00:08:44,280 --> 00:08:49,860 So we can translate this source sentence word by word and get the target sentence. 122 00:08:50,970 --> 00:08:53,890 But well, maybe it's not super good, right? 123 00:08:53,890 --> 00:09:00,160 So if you have ever studied some foreign language, you know that just by dictionary 124 00:09:00,160 --> 00:09:06,110 translations of every word, you usually do not get nice coherent translation. 125 00:09:06,110 --> 00:09:10,800 So probably we would better go into some synthetic level. 126 00:09:10,800 --> 00:09:15,428 So we do syntax analysis, and then we do the transfer and 127 00:09:15,428 --> 00:09:19,859 then we generate the target sentence by knowing how it 128 00:09:19,859 --> 00:09:23,515 should look like on on the syntactic level. 129 00:09:23,515 --> 00:09:28,430 Even better, we could try to go to semantic layer, so that first we analyze 130 00:09:28,430 --> 00:09:34,110 the source sentence and understand some meanings of some parts of the sentence. 131 00:09:34,110 --> 00:09:38,060 We somehow transfer these meanings to in our language and 132 00:09:38,060 --> 00:09:42,960 then we generate some good syntactic structures with good meaning. 133 00:09:42,960 --> 00:09:47,450 And our dream, like the best things as we could ever think of, 134 00:09:47,450 --> 00:09:50,320 would be having some interlingual. 135 00:09:50,320 --> 00:09:55,350 So by interlingual, we mean some n ice representation of the whole 136 00:09:55,350 --> 00:10:00,360 source sentence that is enough to generate the whole target sentence. 137 00:10:01,360 --> 00:10:06,589 Actually it is still a dream, so it is still a dream of the translators 138 00:10:06,589 --> 00:10:11,115 to have that kind of system because it sounds so appealing. 139 00:10:11,115 --> 00:10:15,928 But neural translation systems somehow have mechanisms that 140 00:10:15,928 --> 00:10:20,658 resembles that and I will show you that in a couple of slides. 141 00:10:22,237 --> 00:10:27,870 Okay, so for now I want to show you some brief history of the area. 142 00:10:29,075 --> 00:10:31,187 And like any other area, 143 00:10:31,187 --> 00:10:35,925 machine translation has some bright and dark periods. 144 00:10:35,925 --> 00:10:43,048 So in 1954 there were great expectations, so there was IBM experiments where 145 00:10:43,048 --> 00:10:48,120 they translated 60 sentences from Russian to English. 146 00:10:48,120 --> 00:10:52,930 And they said, that's easy we can solve the machine translation 147 00:10:52,930 --> 00:10:56,262 task completely in just three or five years. 148 00:10:56,262 --> 00:11:00,143 So they tried to work on that and they worked a lot, and 149 00:11:00,143 --> 00:11:05,083 after many years they concluded that actually it's not that easy. 150 00:11:05,083 --> 00:11:08,871 And they said, well, machine translation is too expensive and 151 00:11:08,871 --> 00:11:12,740 we should not do automatic machine translation system. 152 00:11:12,740 --> 00:11:17,360 We should better focus on just some tools that help human 153 00:11:17,360 --> 00:11:21,797 translators to provide good quality translations. 154 00:11:21,797 --> 00:11:25,632 So you know these great expectations and 155 00:11:25,632 --> 00:11:30,257 then the disappointment made the area silent for 156 00:11:30,257 --> 00:11:34,655 a while, but then in 1988 IBM researchers 157 00:11:34,655 --> 00:11:39,871 proposed word-based machine translation systems. 158 00:11:39,871 --> 00:11:43,954 These machine translation systems were rather simple, so 159 00:11:43,954 --> 00:11:48,521 we will cover them, kind of in this video and in the next video, but 160 00:11:48,521 --> 00:11:54,540 these systems were kind of the first working system for machine translation. 161 00:11:54,540 --> 00:11:59,390 So this was nice and then the next important step was phrase based machine 162 00:11:59,390 --> 00:12:05,040 translations system that were proposed by Philip Koehn in 2003. 163 00:12:05,040 --> 00:12:10,373 And this is what probably people mean by statistical machine translation now. 164 00:12:10,373 --> 00:12:13,807 You definitely know Google Translate, right? 165 00:12:13,807 --> 00:12:16,901 But maybe you haven't heard about Moses. 166 00:12:16,901 --> 00:12:21,893 So Moses is the system that allows a researchers to build their own 167 00:12:21,893 --> 00:12:24,357 machine translation systems. 168 00:12:24,357 --> 00:12:28,449 So it allows to train your models and to compare them, so 169 00:12:28,449 --> 00:12:34,209 this is a very nice tool for researchers and it was made available in 2007. 170 00:12:35,860 --> 00:12:37,456 Now, with an extent, 171 00:12:37,456 --> 00:12:42,334 obviously very important step here is neural machine translation. 172 00:12:42,334 --> 00:12:47,122 It is amazing how fast the neural machine translation systems 173 00:12:47,122 --> 00:12:51,700 could go from research papers to production. 174 00:12:51,700 --> 00:12:54,980 Usually we have such a big gap between these two things. 175 00:12:54,980 --> 00:12:58,905 But in this case there were just two or three years so 176 00:12:58,905 --> 00:13:04,568 it is amazing that those ideas that were proposed could be implemented and 177 00:13:04,568 --> 00:13:11,537 just launched in many companies in 2016 so we have neutral machine translations now. 178 00:13:11,537 --> 00:13:17,883 You might be wondering what is WMT there, it is the workshop on machine translation, 179 00:13:17,883 --> 00:13:24,580 which is kind of the annual competition, the annual event and shared tasks. 180 00:13:24,580 --> 00:13:29,420 Which means that you can compare your systems there, and it is a very nice 181 00:13:29,420 --> 00:13:34,890 venue to compare different systems by different researchers and companies. 182 00:13:34,890 --> 00:13:38,800 And to see what are the traits of machine translations. 183 00:13:38,800 --> 00:13:43,431 And it happens every year, so usually people who do research in this 184 00:13:43,431 --> 00:13:46,777 area keep eye on this and this is very nice thing. 185 00:13:46,777 --> 00:13:51,740 This is the slide about intralingual that I promised to show you. 186 00:13:51,740 --> 00:13:56,140 So this is how Google neural machine translation works, and 187 00:13:56,140 --> 00:14:00,500 there was actually lots of hype around it maybe even too much. 188 00:14:00,500 --> 00:14:07,650 But still, so the idea is that you train some system or some pair of languages. 189 00:14:07,650 --> 00:14:12,630 For example on English to Japanese and Japanese to English and 190 00:14:12,630 --> 00:14:19,240 English to Korean and some other pair, you train some encoder, decoder architecture. 191 00:14:19,240 --> 00:14:24,076 It means that you have some encoder that encodes your sentence to 192 00:14:24,076 --> 00:14:26,415 some hidden representation. 193 00:14:26,415 --> 00:14:31,363 And then you have decoder that takes that hidden representation and 194 00:14:31,363 --> 00:14:33,978 decodes it to the target sentence. 195 00:14:33,978 --> 00:14:40,459 Now the nice thing is, that if you just take your encoder, let's say for 196 00:14:40,459 --> 00:14:45,465 Japanese and decoder for Korean and you just take them. 197 00:14:45,465 --> 00:14:49,967 Somehow it works nicely even though the system has never seen 198 00:14:49,967 --> 00:14:53,390 Japanese to Korean translations. 199 00:14:53,390 --> 00:14:58,340 You see so this is zero-shot translation you have never seen Japanese to Korean, 200 00:14:58,340 --> 00:15:00,750 but just by building nice encoder and 201 00:15:00,750 --> 00:15:05,330 nice decoder, you can stack them and get this path. 202 00:15:05,330 --> 00:15:09,270 So it seems like this hidden representation that you have, 203 00:15:09,270 --> 00:15:12,770 is kind of universal for any language pair. 204 00:15:12,770 --> 00:15:18,100 Well, it is not completely true but at least it is very promising result. 205 00:15:18,100 --> 00:15:28,100 [MUSIC]