1 00:00:02,530 --> 00:00:04,874 Hi. In this video, 2 00:00:04,874 --> 00:00:08,380 we'll talk about context utilization in our NLU. 3 00:00:08,380 --> 00:00:10,955 Let me remind you why we need context. 4 00:00:10,955 --> 00:00:13,335 We can have a dialect like this. 5 00:00:13,335 --> 00:00:18,185 User says, "Give me directions from LA," and we understand that we need, 6 00:00:18,185 --> 00:00:20,150 we have a missing slot so we ask, 7 00:00:20,150 --> 00:00:21,275 "Where do you want to go?" 8 00:00:21,275 --> 00:00:23,490 And then the user says, "San Francisco." 9 00:00:23,490 --> 00:00:25,580 And when we have the next utterance, 10 00:00:25,580 --> 00:00:28,280 it would be very nice if intent classifier and 11 00:00:28,280 --> 00:00:32,295 slot tagger could use the previous context, 12 00:00:32,295 --> 00:00:34,180 and it could understand that, 13 00:00:34,180 --> 00:00:36,490 that San Francisco is actually, 14 00:00:36,490 --> 00:00:38,300 @To slot that we are waiting, 15 00:00:38,300 --> 00:00:40,894 and the intent didn't change, 16 00:00:40,894 --> 00:00:43,348 and we had context for that. 17 00:00:43,348 --> 00:00:47,205 A proper way to do this is called memory networks. 18 00:00:47,205 --> 00:00:49,203 Let's see how it might work. 19 00:00:49,203 --> 00:00:51,335 We have a history of utterances, 20 00:00:51,335 --> 00:00:53,026 and let's call them x's, 21 00:00:53,026 --> 00:00:56,660 and that is our utterances. 22 00:00:56,660 --> 00:00:59,405 Then we passed them through a special RNN, 23 00:00:59,405 --> 00:01:02,720 that will encode them into memory vectors. 24 00:01:02,720 --> 00:01:06,765 And we take out with two utterances passed through these RNN, 25 00:01:06,765 --> 00:01:09,020 and we have some memory vectors. 26 00:01:09,020 --> 00:01:12,950 And these are dense vectors just like neural networks like. 27 00:01:12,950 --> 00:01:18,265 Okay. So we can encode all the utterances we had before into the memory. 28 00:01:18,265 --> 00:01:21,035 Let's see how we can use that memory. 29 00:01:21,035 --> 00:01:24,285 Then when a new utterance comes, 30 00:01:24,285 --> 00:01:27,675 and this is utterance C in the lower left corner, 31 00:01:27,675 --> 00:01:32,598 then we actually encoded into the vector of the same size as our memory, 32 00:01:32,598 --> 00:01:34,864 and we use a special RNN for that, 33 00:01:34,864 --> 00:01:36,790 called RNN for input. 34 00:01:36,790 --> 00:01:40,665 And when we have that, orange "u" vector, 35 00:01:40,665 --> 00:01:44,795 we actually, this is actually the representation of our current utterance, 36 00:01:44,795 --> 00:01:48,525 and what we need to do is we need to match this current utterance 37 00:01:48,525 --> 00:01:52,915 with all the utterances that we had before in that memory. 38 00:01:52,915 --> 00:01:58,630 And for that, we use a dark product with the representations of utterances we had before, 39 00:01:58,630 --> 00:02:00,690 and that actually gives us, 40 00:02:00,690 --> 00:02:02,055 after applying soft marks, 41 00:02:02,055 --> 00:02:04,905 we can actually have a knowledge attention distribution. 42 00:02:04,905 --> 00:02:07,545 So we know what knowledge, 43 00:02:07,545 --> 00:02:12,935 what previous knowledge is relevant to our current utterance and which is not. 44 00:02:12,935 --> 00:02:15,558 And we can actually take all the memory vectors, 45 00:02:15,558 --> 00:02:19,905 and we can take them with weights of this attention distribution, 46 00:02:19,905 --> 00:02:23,145 and we have a final vector which is a weighted sum. 47 00:02:23,145 --> 00:02:27,015 We can edit to our representation of an utterance, 48 00:02:27,015 --> 00:02:28,440 which is an orange vector, 49 00:02:28,440 --> 00:02:31,490 and we can pass it through some fully connected layers 50 00:02:31,490 --> 00:02:34,980 and get the final vector "o" which is 51 00:02:34,980 --> 00:02:41,715 the knowledge encoding of our current utterance and the knowledge that we had before. 52 00:02:41,715 --> 00:02:43,395 What do we do with that vector? 53 00:02:43,395 --> 00:02:48,850 That vector actually accumulates all the context of the dialect that we had before. 54 00:02:48,850 --> 00:02:54,600 And so, we can actually use it in our RNN for tagging, let's say. 55 00:02:54,600 --> 00:02:59,100 Now, let's say how we can implement that knowledge vector into tagging RNN. 56 00:02:59,100 --> 00:03:05,690 We can edit as input on every step of our RNN tagger, 57 00:03:05,690 --> 00:03:09,923 and that is a memory vector that doesn't change, 58 00:03:09,923 --> 00:03:11,580 and if we train it end to end, 59 00:03:11,580 --> 00:03:19,030 then we might have a better quality because we use context here. 60 00:03:19,030 --> 00:03:23,585 Okay. So this is an overview of the whole architecture. 61 00:03:23,585 --> 00:03:25,390 We have historical utterances, 62 00:03:25,390 --> 00:03:29,415 and we use a special RNN to turn them into memory vectors. 63 00:03:29,415 --> 00:03:33,280 Then we use attention mechanism when a new utterance comes, 64 00:03:33,280 --> 00:03:36,130 and we actually know which prior knowledge is 65 00:03:36,130 --> 00:03:39,130 relevant to us at the current stage and which is not. 66 00:03:39,130 --> 00:03:46,368 And we use that information in the RNN tagger that gives us slot tagging sequence. 67 00:03:46,368 --> 00:03:48,925 Let's see how it actually works. 68 00:03:48,925 --> 00:03:53,175 If we evaluate the slot tagger on multi-turn data set, 69 00:03:53,175 --> 00:03:55,471 when the dialect is along, 70 00:03:55,471 --> 00:03:58,915 and we actually measure F1, F1-measure here. 71 00:03:58,915 --> 00:04:01,945 And let's compare RNN tagger without context, 72 00:04:01,945 --> 00:04:04,210 and these memory networks architecture. 73 00:04:04,210 --> 00:04:07,840 We can see that this model performs better and not only 74 00:04:07,840 --> 00:04:11,590 on the first turn but also on the consecutive turns as well. 75 00:04:11,590 --> 00:04:15,790 And overall, it gives a significant improvement to the F1 score, 76 00:04:15,790 --> 00:04:19,280 like 47, comparing with 6 to 7. 77 00:04:19,280 --> 00:04:21,749 So, let me summarize. 78 00:04:21,749 --> 00:04:25,970 You can make your NLU context-aware with memory networks. 79 00:04:25,970 --> 00:04:27,655 In the previous weeks, 80 00:04:27,655 --> 00:04:28,851 in the previous videos, 81 00:04:28,851 --> 00:04:32,070 we actually overviewed how you can do that in a simple manner, 82 00:04:32,070 --> 00:04:35,943 but memory network seems to be the right approach to this. 83 00:04:35,943 --> 00:04:39,856 In the next video, we will take a look at lexicon utilization in our NLU. 84 00:04:39,856 --> 00:04:42,245 You can think of lexicon as, 85 00:04:42,245 --> 00:04:44,630 let's say, a list of all music artists. 86 00:04:44,630 --> 00:04:47,255 We already know that this is a knowledge base, 87 00:04:47,255 --> 00:04:53,220 and let's try to use that in our intent classifier and slot tagger.