1
00:00:02,530 --> 00:00:04,874
Hi. In this video,

2
00:00:04,874 --> 00:00:08,380
we'll talk about context utilization in our NLU.

3
00:00:08,380 --> 00:00:10,955
Let me remind you why we need context.

4
00:00:10,955 --> 00:00:13,335
We can have a dialect like this.

5
00:00:13,335 --> 00:00:18,185
User says, "Give me directions from LA," and we understand that we need,

6
00:00:18,185 --> 00:00:20,150
we have a missing slot so we ask,

7
00:00:20,150 --> 00:00:21,275
"Where do you want to go?"

8
00:00:21,275 --> 00:00:23,490
And then the user says, "San Francisco."

9
00:00:23,490 --> 00:00:25,580
And when we have the next utterance,

10
00:00:25,580 --> 00:00:28,280
it would be very nice if intent classifier and

11
00:00:28,280 --> 00:00:32,295
slot tagger could use the previous context,

12
00:00:32,295 --> 00:00:34,180
and it could understand that,

13
00:00:34,180 --> 00:00:36,490
that San Francisco is actually,

14
00:00:36,490 --> 00:00:38,300
@To slot that we are waiting,

15
00:00:38,300 --> 00:00:40,894
and the intent didn't change,

16
00:00:40,894 --> 00:00:43,348
and we had context for that.

17
00:00:43,348 --> 00:00:47,205
A proper way to do this is called memory networks.

18
00:00:47,205 --> 00:00:49,203
Let's see how it might work.

19
00:00:49,203 --> 00:00:51,335
We have a history of utterances,

20
00:00:51,335 --> 00:00:53,026
and let's call them x's,

21
00:00:53,026 --> 00:00:56,660
and that is our utterances.

22
00:00:56,660 --> 00:00:59,405
Then we passed them through a special RNN,

23
00:00:59,405 --> 00:01:02,720
that will encode them into memory vectors.

24
00:01:02,720 --> 00:01:06,765
And we take out with two utterances passed through these RNN,

25
00:01:06,765 --> 00:01:09,020
and we have some memory vectors.

26
00:01:09,020 --> 00:01:12,950
And these are dense vectors just like neural networks like.

27
00:01:12,950 --> 00:01:18,265
Okay. So we can encode all the utterances we had before into the memory.

28
00:01:18,265 --> 00:01:21,035
Let's see how we can use that memory.

29
00:01:21,035 --> 00:01:24,285
Then when a new utterance comes,

30
00:01:24,285 --> 00:01:27,675
and this is utterance C in the lower left corner,

31
00:01:27,675 --> 00:01:32,598
then we actually encoded into the vector of the same size as our memory,

32
00:01:32,598 --> 00:01:34,864
and we use a special RNN for that,

33
00:01:34,864 --> 00:01:36,790
called RNN for input.

34
00:01:36,790 --> 00:01:40,665
And when we have that, orange "u" vector,

35
00:01:40,665 --> 00:01:44,795
we actually, this is actually the representation of our current utterance,

36
00:01:44,795 --> 00:01:48,525
and what we need to do is we need to match this current utterance

37
00:01:48,525 --> 00:01:52,915
with all the utterances that we had before in that memory.

38
00:01:52,915 --> 00:01:58,630
And for that, we use a dark product with the representations of utterances we had before,

39
00:01:58,630 --> 00:02:00,690
and that actually gives us,

40
00:02:00,690 --> 00:02:02,055
after applying soft marks,

41
00:02:02,055 --> 00:02:04,905
we can actually have a knowledge attention distribution.

42
00:02:04,905 --> 00:02:07,545
So we know what knowledge,

43
00:02:07,545 --> 00:02:12,935
what previous knowledge is relevant to our current utterance and which is not.

44
00:02:12,935 --> 00:02:15,558
And we can actually take all the memory vectors,

45
00:02:15,558 --> 00:02:19,905
and we can take them with weights of this attention distribution,

46
00:02:19,905 --> 00:02:23,145
and we have a final vector which is a weighted sum.

47
00:02:23,145 --> 00:02:27,015
We can edit to our representation of an utterance,

48
00:02:27,015 --> 00:02:28,440
which is an orange vector,

49
00:02:28,440 --> 00:02:31,490
and we can pass it through some fully connected layers

50
00:02:31,490 --> 00:02:34,980
and get the final vector "o" which is

51
00:02:34,980 --> 00:02:41,715
the knowledge encoding of our current utterance and the knowledge that we had before.

52
00:02:41,715 --> 00:02:43,395
What do we do with that vector?

53
00:02:43,395 --> 00:02:48,850
That vector actually accumulates all the context of the dialect that we had before.

54
00:02:48,850 --> 00:02:54,600
And so, we can actually use it in our RNN for tagging, let's say.

55
00:02:54,600 --> 00:02:59,100
Now, let's say how we can implement that knowledge vector into tagging RNN.

56
00:02:59,100 --> 00:03:05,690
We can edit as input on every step of our RNN tagger,

57
00:03:05,690 --> 00:03:09,923
and that is a memory vector that doesn't change,

58
00:03:09,923 --> 00:03:11,580
and if we train it end to end,

59
00:03:11,580 --> 00:03:19,030
then we might have a better quality because we use context here.

60
00:03:19,030 --> 00:03:23,585
Okay. So this is an overview of the whole architecture.

61
00:03:23,585 --> 00:03:25,390
We have historical utterances,

62
00:03:25,390 --> 00:03:29,415
and we use a special RNN to turn them into memory vectors.

63
00:03:29,415 --> 00:03:33,280
Then we use attention mechanism when a new utterance comes,

64
00:03:33,280 --> 00:03:36,130
and we actually know which prior knowledge is

65
00:03:36,130 --> 00:03:39,130
relevant to us at the current stage and which is not.

66
00:03:39,130 --> 00:03:46,368
And we use that information in the RNN tagger that gives us slot tagging sequence.

67
00:03:46,368 --> 00:03:48,925
Let's see how it actually works.

68
00:03:48,925 --> 00:03:53,175
If we evaluate the slot tagger on multi-turn data set,

69
00:03:53,175 --> 00:03:55,471
when the dialect is along,

70
00:03:55,471 --> 00:03:58,915
and we actually measure F1, F1-measure here.

71
00:03:58,915 --> 00:04:01,945
And let's compare RNN tagger without context,

72
00:04:01,945 --> 00:04:04,210
and these memory networks architecture.

73
00:04:04,210 --> 00:04:07,840
We can see that this model performs better and not only

74
00:04:07,840 --> 00:04:11,590
on the first turn but also on the consecutive turns as well.

75
00:04:11,590 --> 00:04:15,790
And overall, it gives a significant improvement to the F1 score,

76
00:04:15,790 --> 00:04:19,280
like 47, comparing with 6 to 7.

77
00:04:19,280 --> 00:04:21,749
So, let me summarize.

78
00:04:21,749 --> 00:04:25,970
You can make your NLU context-aware with memory networks.

79
00:04:25,970 --> 00:04:27,655
In the previous weeks,

80
00:04:27,655 --> 00:04:28,851
in the previous videos,

81
00:04:28,851 --> 00:04:32,070
we actually overviewed how you can do that in a simple manner,

82
00:04:32,070 --> 00:04:35,943
but memory network seems to be the right approach to this.

83
00:04:35,943 --> 00:04:39,856
In the next video, we will take a look at lexicon utilization in our NLU.

84
00:04:39,856 --> 00:04:42,245
You can think of lexicon as,

85
00:04:42,245 --> 00:04:44,630
let's say, a list of all music artists.

86
00:04:44,630 --> 00:04:47,255
We already know that this is a knowledge base,

87
00:04:47,255 --> 00:04:53,220
and let's try to use that in our intent classifier and slot tagger.