1 00:00:00,000 --> 00:00:04,636 [SOUND] Hey everyone, we're going to discuss 2 00:00:04,636 --> 00:00:09,638 a very important technique in neural networks. 3 00:00:09,638 --> 00:00:12,726 We are going to speak about encoder-decoder architecture and 4 00:00:12,726 --> 00:00:14,190 about attention mechanism. 5 00:00:14,190 --> 00:00:18,740 We will cover them by the example of neural machine translation, 6 00:00:18,740 --> 00:00:23,070 just because they were mostly proposed for machine translation originally. 7 00:00:23,070 --> 00:00:26,080 But now they are applied to many, many other tasks. 8 00:00:26,080 --> 00:00:30,100 For example, you can think about summarization or simplification of 9 00:00:30,100 --> 00:00:34,930 the texts, or sequence to sequence chatbots and many, many others. 10 00:00:34,930 --> 00:00:39,410 Now let us start with the general idea of the architecture. 11 00:00:39,410 --> 00:00:41,940 We have some sequence as the input, and 12 00:00:41,940 --> 00:00:44,950 we would want to get some sequence as the output. 13 00:00:44,950 --> 00:00:50,000 For example, this could be two sequences for different languages, right? 14 00:00:50,000 --> 00:00:54,510 We have our encoder and the task of the encoder is to build some hidden 15 00:00:54,510 --> 00:00:59,440 representation over the input sentence in some hidden way. 16 00:00:59,440 --> 00:01:02,210 So we get this green hidden vector 17 00:01:02,210 --> 00:01:06,280 that tries to encode the whole meaning of the input sentence. 18 00:01:06,280 --> 00:01:10,150 Sometimes this vector is also called thought vector, 19 00:01:10,150 --> 00:01:14,280 because it encodes the thought of the sentence. 20 00:01:14,280 --> 00:01:18,220 The encoder task is to decode this thought vector or 21 00:01:18,220 --> 00:01:22,010 context vector into some output representation. 22 00:01:22,010 --> 00:01:25,790 For example, the sequence of words from the other language. 23 00:01:26,840 --> 00:01:30,430 Now what types of encoders could we have here? 24 00:01:30,430 --> 00:01:34,520 Well, one most obvious type would be her current neural networks, but 25 00:01:34,520 --> 00:01:36,890 actually this is not the only option. 26 00:01:36,890 --> 00:01:41,562 So be aware that we have also convolutional neural networks that can be 27 00:01:41,562 --> 00:01:46,482 very fast and nice, and they can also encode the meaning of the sentence. 28 00:01:46,482 --> 00:01:49,074 We could also have some hierarchical structures. 29 00:01:49,074 --> 00:01:55,594 For example, recursive neural networks try to use syntax of the language and 30 00:01:55,594 --> 00:02:01,613 build the representation hierarchically from from bottom to the top, 31 00:02:01,613 --> 00:02:04,941 and understand the sentence that way. 32 00:02:04,941 --> 00:02:11,210 Okay, now what is the first example of sequence to sequence architecture? 33 00:02:11,210 --> 00:02:16,439 This is the model that was proposed in 2014 and it is rather simple. 34 00:02:16,439 --> 00:02:23,512 So it says, we have some LCM module or RNN module that encodes our input sentence, 35 00:02:23,512 --> 00:02:28,910 and then we have end of sentence token at some point. 36 00:02:28,910 --> 00:02:34,904 At this point, we understand that our state is our thought vector or 37 00:02:34,904 --> 00:02:40,490 context vector, and we need to decode starting from this moment. 38 00:02:40,490 --> 00:02:44,300 The decoding is conditional language modelling. 39 00:02:44,300 --> 00:02:48,720 So you're already familiar with language modelling with neural networks, but 40 00:02:48,720 --> 00:02:52,920 now it is conditioned on this context vector, the green vector. 41 00:02:54,050 --> 00:02:59,430 Okay, as any other language model, you usually fit the output of the previous 42 00:02:59,430 --> 00:03:05,280 state as the input to the next state, and generate the next words just one by one. 43 00:03:06,370 --> 00:03:11,920 Now, let us go deeper and stack several layers of our LSTM model. 44 00:03:11,920 --> 00:03:13,940 You can do this straightforwardly like this. 45 00:03:15,640 --> 00:03:18,560 So let us move forward, and 46 00:03:18,560 --> 00:03:23,370 speak about a little bit different variant of the same architecture. 47 00:03:23,370 --> 00:03:25,870 One problem with the previous architectures 48 00:03:25,870 --> 00:03:29,570 is that the green context letter can be forgotten. 49 00:03:29,570 --> 00:03:35,027 So if you only feed it as the inputs to the first state of the decoder, then 50 00:03:35,027 --> 00:03:41,660 you are likely to forget about it when you come to the end of your output sentence. 51 00:03:41,660 --> 00:03:45,350 So it would be better to feed it at every moment. 52 00:03:45,350 --> 00:03:49,780 And this architecture does exactly that, it says that every stage of 53 00:03:49,780 --> 00:03:54,940 the decoder should have three kind of errors that go to it. 54 00:03:54,940 --> 00:04:01,960 First, the error from the previous state, then the error from this context vector, 55 00:04:01,960 --> 00:04:06,870 and then the current input which is the output of the previous state. 56 00:04:06,870 --> 00:04:12,500 Okay, now let us go into more details with the formulas. 57 00:04:12,500 --> 00:04:17,450 So you have your sequence modeling task conditional because 58 00:04:17,450 --> 00:04:22,199 you need to produce the probabilities of one sequence given another sequence, and 59 00:04:22,199 --> 00:04:25,850 you factorize it using the chain rule. 60 00:04:25,850 --> 00:04:30,893 Also importantly you see that x variables are not needed 61 00:04:30,893 --> 00:04:35,832 anymore because you have encoded them to the v vector. 62 00:04:35,832 --> 00:04:40,751 V vector is obtained as the last hidden state of the encoder, and 63 00:04:40,751 --> 00:04:44,132 encoder is just recurrent neural network. 64 00:04:44,132 --> 00:04:47,643 The decoder is also the recurrent neural network. 65 00:04:47,643 --> 00:04:51,039 However, it has more inputs, right? 66 00:04:51,039 --> 00:04:59,141 So you see that now I concatenate the current input Y with the V vector. 67 00:04:59,141 --> 00:05:03,062 And this means that I will use all kind of information, 68 00:05:03,062 --> 00:05:06,140 all those three errors in my transitions. 69 00:05:07,400 --> 00:05:10,730 Now, how do we get predictions out of this model? 70 00:05:10,730 --> 00:05:14,200 Well, the easiest way is just to do soft marks, right? 71 00:05:14,200 --> 00:05:17,440 So when you have your decoder RNN, 72 00:05:17,440 --> 00:05:22,400 you have your hidden states of your RNN and they are called SJ. 73 00:05:23,440 --> 00:05:28,330 You can just apply some linear layer, and then softmax, 74 00:05:28,330 --> 00:05:32,475 to get the probability of the current word, given everything that we have, 75 00:05:32,475 --> 00:05:35,110 awesome. 76 00:05:35,110 --> 00:05:40,810 Now let us try to see whether those v vectors are somehow meaningful. 77 00:05:40,810 --> 00:05:42,730 One way to do this is to say, 78 00:05:42,730 --> 00:05:46,560 okay they are let's say three dimensional hidden vectors. 79 00:05:46,560 --> 00:05:52,989 Let us do some dimensional reduction, for example, by TS&E or PCA, and 80 00:05:52,989 --> 00:05:59,940 let us plot them just by two dimensions just to see what are the vectors. 81 00:05:59,940 --> 00:06:05,130 So you see that the representations of some sentences are close here and 82 00:06:05,130 --> 00:06:08,936 it's nice that the model can capture that active and 83 00:06:08,936 --> 00:06:14,325 passive voice doesn't actually matter for the meaning of the sentence. 84 00:06:14,325 --> 00:06:19,032 For example, you see that the sentence, I gave her a card or 85 00:06:19,032 --> 00:06:22,908 she was given a card are very close in this space. 86 00:06:22,908 --> 00:06:28,929 Okay, even though these representations are so nice, this is still a bottleneck. 87 00:06:28,929 --> 00:06:32,222 So you should think about how to avoid that. 88 00:06:32,222 --> 00:06:35,687 And to avoid that, we will go into attention mechanisms and 89 00:06:35,687 --> 00:06:38,058 this will be the topic of our next video. 90 00:06:38,058 --> 00:06:48,058 [SOUND]