1 00:00:02,180 --> 00:00:05,045 Hi. In this video, 2 00:00:05,045 --> 00:00:08,285 we will talk about Policy Learner in Dialogue Manager. 3 00:00:08,285 --> 00:00:11,530 Okay, let me remind you what policy learning is. 4 00:00:11,530 --> 00:00:14,760 We have a dialogue that progresses with time, 5 00:00:14,760 --> 00:00:17,040 and after every turn, 6 00:00:17,040 --> 00:00:20,710 after every observation from the user will somehow update 7 00:00:20,710 --> 00:00:25,620 our state of the dialogue and state records responsible for that. 8 00:00:25,620 --> 00:00:28,665 And then, after we have a certain state, 9 00:00:28,665 --> 00:00:31,020 we actually have to make some action, 10 00:00:31,020 --> 00:00:35,070 and we need to figure out the policy that tells us if you have 11 00:00:35,070 --> 00:00:39,225 a certain state then this is an action that you must do, 12 00:00:39,225 --> 00:00:42,670 and this is something that we then sell to the user. 13 00:00:42,670 --> 00:00:45,810 So let's look at what dialog policy actually is. 14 00:00:45,810 --> 00:00:49,825 It is actually a mapping from dialog state to agent act. 15 00:00:49,825 --> 00:00:52,663 Imagine that we have a conversation with the user. 16 00:00:52,663 --> 00:00:55,395 We collect some information from him or her, 17 00:00:55,395 --> 00:01:00,615 and we have that internal state that tells us what the user essentially wants, 18 00:01:00,615 --> 00:01:04,410 and we need to take some action to continue the dialog. 19 00:01:04,410 --> 00:01:09,884 And we need that mapping from dialog state to agent act, 20 00:01:09,884 --> 00:01:12,625 and this is what dialog policy essentially is. 21 00:01:12,625 --> 00:01:15,320 Let's look at some policy execution examples. 22 00:01:15,320 --> 00:01:20,320 A system might inform the user that the location is 780 Market Street. 23 00:01:20,320 --> 00:01:22,940 The user will hear it as of the following, 24 00:01:22,940 --> 00:01:25,860 "The nearest one is at 780 Market Street." 25 00:01:25,860 --> 00:01:31,710 Another example is that the system might request location of the user. 26 00:01:31,710 --> 00:01:33,570 And the user will see it as, 27 00:01:33,570 --> 00:01:35,844 "What is the delivery address?" 28 00:01:35,844 --> 00:01:39,610 And we have to train a model 29 00:01:39,610 --> 00:01:45,215 to give us an act from a dialog state or we can do that by hand crafted rules, 30 00:01:45,215 --> 00:01:47,348 which is my favorite. 31 00:01:47,348 --> 00:01:51,465 Okay, so let's look at the Simple approach: hand crafted rules. 32 00:01:51,465 --> 00:01:54,360 You have NLU and state tracker. 33 00:01:54,360 --> 00:01:58,380 And you can come up with hand crafted rules for policy. 34 00:01:58,380 --> 00:02:01,275 Because if you have a state tracker, you have a state, 35 00:02:01,275 --> 00:02:05,015 and if you remember the dialog state tracking challenge dataset, 36 00:02:05,015 --> 00:02:10,470 it actually contains a part of the state which has requested slots, 37 00:02:10,470 --> 00:02:14,325 and we can use that information to understand what to do next, 38 00:02:14,325 --> 00:02:18,150 whether we need to tell the user a value of 39 00:02:18,150 --> 00:02:22,800 a particular slot or we should search the database or something else. 40 00:02:22,800 --> 00:02:27,430 So, it should be pretty easy to come up with hand crafted rules for policy. 41 00:02:27,430 --> 00:02:31,570 But it turns out that you can make it better if you do it with machine learning. 42 00:02:31,570 --> 00:02:33,390 And there are two ways to do that, 43 00:02:33,390 --> 00:02:36,400 to optimize dialog policies with machine learning. 44 00:02:36,400 --> 00:02:39,090 The first one is Supervised learning, 45 00:02:39,090 --> 00:02:40,770 and in this setting, 46 00:02:40,770 --> 00:02:43,875 you train to imitate the observed actions of an expert. 47 00:02:43,875 --> 00:02:46,219 So we have some human-human interactions, 48 00:02:46,219 --> 00:02:47,445 one of them is an expert, 49 00:02:47,445 --> 00:02:53,925 and you just use that observations and try to imitate the action of an expert. 50 00:02:53,925 --> 00:02:57,300 It often requires a large amount of expert label data 51 00:02:57,300 --> 00:03:01,620 and as you know it is pretty expensive to collect that data, 52 00:03:01,620 --> 00:03:07,855 because you cannot use crowd sourcing platforms like Amazon Mechanical Turk. 53 00:03:07,855 --> 00:03:10,730 But even with a large amount of training data, 54 00:03:10,730 --> 00:03:13,610 parts of the dialog state space may not be well 55 00:03:13,610 --> 00:03:17,705 covered in the training data and our system will be blind there. 56 00:03:17,705 --> 00:03:22,682 So, there is a different approach to this called Reinforcement learning, 57 00:03:22,682 --> 00:03:25,670 and this is a huge field and it is out of our scope, 58 00:03:25,670 --> 00:03:28,155 but it is like an honorable mention. 59 00:03:28,155 --> 00:03:30,545 Given only rewards signal, now, 60 00:03:30,545 --> 00:03:35,045 the agent can optimize a dialog policy through interaction with users. 61 00:03:35,045 --> 00:03:39,020 Reinforcement learning can require many samples from an environment, 62 00:03:39,020 --> 00:03:42,640 making learning from scratch with real user is impractical, 63 00:03:42,640 --> 00:03:45,850 we will just waste the time of our experts. 64 00:03:45,850 --> 00:03:49,550 That's why there, we need simulated users based on 65 00:03:49,550 --> 00:03:53,630 the supervised data for reinforcement learning. 66 00:03:53,630 --> 00:04:00,647 So and this is a huge field and it gains popularity in dialog policies optimization. 67 00:04:00,647 --> 00:04:04,706 Let's look at how supervised approach might work. 68 00:04:04,706 --> 00:04:07,705 Here is an example of another model that does 69 00:04:07,705 --> 00:04:11,615 joint NLU and dialog management policy optimization, 70 00:04:11,615 --> 00:04:15,565 and you can see what it does. 71 00:04:15,565 --> 00:04:18,820 We actually have four utterances 72 00:04:18,820 --> 00:04:23,065 that are all utterances that we got from the user so far. 73 00:04:23,065 --> 00:04:28,720 We pass each of them through NLU which gives us intents and slot tagging, 74 00:04:28,720 --> 00:04:32,030 and we can also take the hidden vector, 75 00:04:32,030 --> 00:04:34,270 the hidden representation of that phrase from 76 00:04:34,270 --> 00:04:38,815 the NLU and we can use it for a consecutive LSTM 77 00:04:38,815 --> 00:04:46,410 that will actually come up with an idea what system action we can actually execute. 78 00:04:46,410 --> 00:04:50,335 So, we've got several utterances, NLU results, 79 00:04:50,335 --> 00:04:56,815 and then the LSTM reads those utterances in latent space from NLU, 80 00:04:56,815 --> 00:04:59,545 and it actually decides what to do next. 81 00:04:59,545 --> 00:05:02,245 So this is pretty cool because, here, 82 00:05:02,245 --> 00:05:04,460 we don't need dialog state tracking, 83 00:05:04,460 --> 00:05:05,785 we don't have state. 84 00:05:05,785 --> 00:05:11,350 State here is replaced with a state of the LSTM, 85 00:05:11,350 --> 00:05:15,495 so that is some latent variables like 300 of them let's say. 86 00:05:15,495 --> 00:05:18,240 So our state becomes not hand crafted, 87 00:05:18,240 --> 00:05:20,660 but it becomes a real valued vector. 88 00:05:20,660 --> 00:05:22,325 So this is pretty cool. 89 00:05:22,325 --> 00:05:28,525 And then we can actually learn a classifier on top of that LSTM, 90 00:05:28,525 --> 00:05:34,055 and it will output us the probability of the next system action. 91 00:05:34,055 --> 00:05:37,035 Let's see how it actually works. 92 00:05:37,035 --> 00:05:40,000 If we look at the results, 93 00:05:40,000 --> 00:05:42,460 there are three models that we compare here. 94 00:05:42,460 --> 00:05:44,350 The first one is baseline. 95 00:05:44,350 --> 00:05:47,555 That is a classical approach to this problem. 96 00:05:47,555 --> 00:05:51,745 We have a conditional random field for slot tagging and we have 97 00:05:51,745 --> 00:05:56,520 SVM for action classification. 98 00:05:56,520 --> 00:05:57,940 As you can see, 99 00:05:57,940 --> 00:06:00,485 the frame level accuracies, 100 00:06:00,485 --> 00:06:04,180 that means that we need to be 101 00:06:04,180 --> 00:06:09,090 accurate about everything in the current frame that we have after every utterance, 102 00:06:09,090 --> 00:06:13,903 and you can see that the accuracy for dialog manager is pretty bad here. 103 00:06:13,903 --> 00:06:16,010 But for NLU, it's okay. 104 00:06:16,010 --> 00:06:19,475 Then, another model is Pipeline-BLSTM, 105 00:06:19,475 --> 00:06:24,560 and what it actually does is it does NLU training separately and 106 00:06:24,560 --> 00:06:30,045 then that bidirectional LSTM for dialog policy optimization on top of that model. 107 00:06:30,045 --> 00:06:33,170 But these models are trained separately. 108 00:06:33,170 --> 00:06:38,570 And you can see that the third option is when these two models, 109 00:06:38,570 --> 00:06:43,355 NLU and bidirectional LSTM which was in blue in the previous slides, 110 00:06:43,355 --> 00:06:47,585 we can actually train them end to end, jointly, 111 00:06:47,585 --> 00:06:49,100 and we can increase 112 00:06:49,100 --> 00:06:55,516 the dialog manager accuracy by a huge margin and we actually improve NLU as well. 113 00:06:55,516 --> 00:06:59,555 So we have seen that effect of joint training before, 114 00:06:59,555 --> 00:07:03,445 and it still continues to happen. 115 00:07:03,445 --> 00:07:06,630 Okay, so what have we looked at? 116 00:07:06,630 --> 00:07:08,490 Dialog policy can be done by 117 00:07:08,490 --> 00:07:13,525 hand crafted rules if you have a good NLU and you have a good state tracker. 118 00:07:13,525 --> 00:07:18,140 Or it can be done in a supervised way where you can learn it 119 00:07:18,140 --> 00:07:22,867 from data and you can learn it jointly with NLU, 120 00:07:22,867 --> 00:07:26,240 and this way you will not need state tracker for example. 121 00:07:26,240 --> 00:07:28,610 Or you can do the reinforcement learning way, 122 00:07:28,610 --> 00:07:32,070 but that is a story for a different course.