1
00:00:00,000 --> 00:00:04,636
[SOUND] Hey everyone,
we're going to discuss

2
00:00:04,636 --> 00:00:09,638
a very important technique
in neural networks.

3
00:00:09,638 --> 00:00:12,726
We are going to speak about
encoder-decoder architecture and

4
00:00:12,726 --> 00:00:14,190
about attention mechanism.

5
00:00:14,190 --> 00:00:18,740
We will cover them by the example
of neural machine translation,

6
00:00:18,740 --> 00:00:23,070
just because they were mostly proposed for
machine translation originally.

7
00:00:23,070 --> 00:00:26,080
But now they are applied to many,
many other tasks.

8
00:00:26,080 --> 00:00:30,100
For example, you can think about
summarization or simplification of

9
00:00:30,100 --> 00:00:34,930
the texts, or sequence to sequence
chatbots and many, many others.

10
00:00:34,930 --> 00:00:39,410
Now let us start with the general
idea of the architecture.

11
00:00:39,410 --> 00:00:41,940
We have some sequence as the input, and

12
00:00:41,940 --> 00:00:44,950
we would want to get some
sequence as the output.

13
00:00:44,950 --> 00:00:50,000
For example, this could be two sequences
for different languages, right?

14
00:00:50,000 --> 00:00:54,510
We have our encoder and the task of
the encoder is to build some hidden

15
00:00:54,510 --> 00:00:59,440
representation over the input
sentence in some hidden way.

16
00:00:59,440 --> 00:01:02,210
So we get this green hidden vector

17
00:01:02,210 --> 00:01:06,280
that tries to encode the whole
meaning of the input sentence.

18
00:01:06,280 --> 00:01:10,150
Sometimes this vector is
also called thought vector,

19
00:01:10,150 --> 00:01:14,280
because it encodes
the thought of the sentence.

20
00:01:14,280 --> 00:01:18,220
The encoder task is to decode
this thought vector or

21
00:01:18,220 --> 00:01:22,010
context vector into some
output representation.

22
00:01:22,010 --> 00:01:25,790
For example, the sequence of
words from the other language.

23
00:01:26,840 --> 00:01:30,430
Now what types of encoders
could we have here?

24
00:01:30,430 --> 00:01:34,520
Well, one most obvious type would
be her current neural networks, but

25
00:01:34,520 --> 00:01:36,890
actually this is not the only option.

26
00:01:36,890 --> 00:01:41,562
So be aware that we have also
convolutional neural networks that can be

27
00:01:41,562 --> 00:01:46,482
very fast and nice, and they can also
encode the meaning of the sentence.

28
00:01:46,482 --> 00:01:49,074
We could also have some
hierarchical structures.

29
00:01:49,074 --> 00:01:55,594
For example, recursive neural networks
try to use syntax of the language and

30
00:01:55,594 --> 00:02:01,613
build the representation hierarchically
from from bottom to the top,

31
00:02:01,613 --> 00:02:04,941
and understand the sentence that way.

32
00:02:04,941 --> 00:02:11,210
Okay, now what is the first example
of sequence to sequence architecture?

33
00:02:11,210 --> 00:02:16,439
This is the model that was proposed
in 2014 and it is rather simple.

34
00:02:16,439 --> 00:02:23,512
So it says, we have some LCM module or RNN
module that encodes our input sentence,

35
00:02:23,512 --> 00:02:28,910
and then we have end of
sentence token at some point.

36
00:02:28,910 --> 00:02:34,904
At this point, we understand that
our state is our thought vector or

37
00:02:34,904 --> 00:02:40,490
context vector, and we need to
decode starting from this moment.

38
00:02:40,490 --> 00:02:44,300
The decoding is conditional
language modelling.

39
00:02:44,300 --> 00:02:48,720
So you're already familiar with language
modelling with neural networks, but

40
00:02:48,720 --> 00:02:52,920
now it is conditioned on this
context vector, the green vector.

41
00:02:54,050 --> 00:02:59,430
Okay, as any other language model,
you usually fit the output of the previous

42
00:02:59,430 --> 00:03:05,280
state as the input to the next state, and
generate the next words just one by one.

43
00:03:06,370 --> 00:03:11,920
Now, let us go deeper and
stack several layers of our LSTM model.

44
00:03:11,920 --> 00:03:13,940
You can do this
straightforwardly like this.

45
00:03:15,640 --> 00:03:18,560
So let us move forward, and

46
00:03:18,560 --> 00:03:23,370
speak about a little bit different
variant of the same architecture.

47
00:03:23,370 --> 00:03:25,870
One problem with
the previous architectures

48
00:03:25,870 --> 00:03:29,570
is that the green context
letter can be forgotten.

49
00:03:29,570 --> 00:03:35,027
So if you only feed it as the inputs
to the first state of the decoder, then

50
00:03:35,027 --> 00:03:41,660
you are likely to forget about it when you
come to the end of your output sentence.

51
00:03:41,660 --> 00:03:45,350
So it would be better to
feed it at every moment.

52
00:03:45,350 --> 00:03:49,780
And this architecture does exactly that,
it says that every stage of

53
00:03:49,780 --> 00:03:54,940
the decoder should have three
kind of errors that go to it.

54
00:03:54,940 --> 00:04:01,960
First, the error from the previous state,
then the error from this context vector,

55
00:04:01,960 --> 00:04:06,870
and then the current input which is
the output of the previous state.

56
00:04:06,870 --> 00:04:12,500
Okay, now let us go into more
details with the formulas.

57
00:04:12,500 --> 00:04:17,450
So you have your sequence
modeling task conditional because

58
00:04:17,450 --> 00:04:22,199
you need to produce the probabilities of
one sequence given another sequence, and

59
00:04:22,199 --> 00:04:25,850
you factorize it using the chain rule.

60
00:04:25,850 --> 00:04:30,893
Also importantly you see that
x variables are not needed

61
00:04:30,893 --> 00:04:35,832
anymore because you have
encoded them to the v vector.

62
00:04:35,832 --> 00:04:40,751
V vector is obtained as the last
hidden state of the encoder, and

63
00:04:40,751 --> 00:04:44,132
encoder is just recurrent neural network.

64
00:04:44,132 --> 00:04:47,643
The decoder is also
the recurrent neural network.

65
00:04:47,643 --> 00:04:51,039
However, it has more inputs, right?

66
00:04:51,039 --> 00:04:59,141
So you see that now I concatenate
the current input Y with the V vector.

67
00:04:59,141 --> 00:05:03,062
And this means that I will
use all kind of information,

68
00:05:03,062 --> 00:05:06,140
all those three errors in my transitions.

69
00:05:07,400 --> 00:05:10,730
Now, how do we get predictions
out of this model?

70
00:05:10,730 --> 00:05:14,200
Well, the easiest way is just
to do soft marks, right?

71
00:05:14,200 --> 00:05:17,440
So when you have your decoder RNN,

72
00:05:17,440 --> 00:05:22,400
you have your hidden states of
your RNN and they are called SJ.

73
00:05:23,440 --> 00:05:28,330
You can just apply some linear layer,
and then softmax,

74
00:05:28,330 --> 00:05:32,475
to get the probability of the current
word, given everything that we have,

75
00:05:32,475 --> 00:05:35,110
awesome.

76
00:05:35,110 --> 00:05:40,810
Now let us try to see whether those
v vectors are somehow meaningful.

77
00:05:40,810 --> 00:05:42,730
One way to do this is to say,

78
00:05:42,730 --> 00:05:46,560
okay they are let's say three
dimensional hidden vectors.

79
00:05:46,560 --> 00:05:52,989
Let us do some dimensional reduction,
for example, by TS&E or PCA, and

80
00:05:52,989 --> 00:05:59,940
let us plot them just by two dimensions
just to see what are the vectors.

81
00:05:59,940 --> 00:06:05,130
So you see that the representations
of some sentences are close here and

82
00:06:05,130 --> 00:06:08,936
it's nice that the model
can capture that active and

83
00:06:08,936 --> 00:06:14,325
passive voice doesn't actually matter for
the meaning of the sentence.

84
00:06:14,325 --> 00:06:19,032
For example, you see that the sentence,
I gave her a card or

85
00:06:19,032 --> 00:06:22,908
she was given a card are very
close in this space.

86
00:06:22,908 --> 00:06:28,929
Okay, even though these representations
are so nice, this is still a bottleneck.

87
00:06:28,929 --> 00:06:32,222
So you should think
about how to avoid that.

88
00:06:32,222 --> 00:06:35,687
And to avoid that,
we will go into attention mechanisms and

89
00:06:35,687 --> 00:06:38,058
this will be the topic of our next video.

90
00:06:38,058 --> 00:06:48,058
[SOUND]