Hi. In this video, we will talk about Policy Learner in Dialogue Manager. Okay, let me remind you what policy learning is. We have a dialogue that progresses with time, and after every turn, after every observation from the user will somehow update our state of the dialogue and state records responsible for that. And then, after we have a certain state, we actually have to make some action, and we need to figure out the policy that tells us if you have a certain state then this is an action that you must do, and this is something that we then sell to the user. So let's look at what dialog policy actually is. It is actually a mapping from dialog state to agent act. Imagine that we have a conversation with the user. We collect some information from him or her, and we have that internal state that tells us what the user essentially wants, and we need to take some action to continue the dialog. And we need that mapping from dialog state to agent act, and this is what dialog policy essentially is. Let's look at some policy execution examples. A system might inform the user that the location is 780 Market Street. The user will hear it as of the following, "The nearest one is at 780 Market Street." Another example is that the system might request location of the user. And the user will see it as, "What is the delivery address?" And we have to train a model to give us an act from a dialog state or we can do that by hand crafted rules, which is my favorite. Okay, so let's look at the Simple approach: hand crafted rules. You have NLU and state tracker. And you can come up with hand crafted rules for policy. Because if you have a state tracker, you have a state, and if you remember the dialog state tracking challenge dataset, it actually contains a part of the state which has requested slots, and we can use that information to understand what to do next, whether we need to tell the user a value of a particular slot or we should search the database or something else. So, it should be pretty easy to come up with hand crafted rules for policy. But it turns out that you can make it better if you do it with machine learning. And there are two ways to do that, to optimize dialog policies with machine learning. The first one is Supervised learning, and in this setting, you train to imitate the observed actions of an expert. So we have some human-human interactions, one of them is an expert, and you just use that observations and try to imitate the action of an expert. It often requires a large amount of expert label data and as you know it is pretty expensive to collect that data, because you cannot use crowd sourcing platforms like Amazon Mechanical Turk. But even with a large amount of training data, parts of the dialog state space may not be well covered in the training data and our system will be blind there. So, there is a different approach to this called Reinforcement learning, and this is a huge field and it is out of our scope, but it is like an honorable mention. Given only rewards signal, now, the agent can optimize a dialog policy through interaction with users. Reinforcement learning can require many samples from an environment, making learning from scratch with real user is impractical, we will just waste the time of our experts. That's why there, we need simulated users based on the supervised data for reinforcement learning. So and this is a huge field and it gains popularity in dialog policies optimization. Let's look at how supervised approach might work. Here is an example of another model that does joint NLU and dialog management policy optimization, and you can see what it does. We actually have four utterances that are all utterances that we got from the user so far. We pass each of them through NLU which gives us intents and slot tagging, and we can also take the hidden vector, the hidden representation of that phrase from the NLU and we can use it for a consecutive LSTM that will actually come up with an idea what system action we can actually execute. So, we've got several utterances, NLU results, and then the LSTM reads those utterances in latent space from NLU, and it actually decides what to do next. So this is pretty cool because, here, we don't need dialog state tracking, we don't have state. State here is replaced with a state of the LSTM, so that is some latent variables like 300 of them let's say. So our state becomes not hand crafted, but it becomes a real valued vector. So this is pretty cool. And then we can actually learn a classifier on top of that LSTM, and it will output us the probability of the next system action. Let's see how it actually works. If we look at the results, there are three models that we compare here. The first one is baseline. That is a classical approach to this problem. We have a conditional random field for slot tagging and we have SVM for action classification. As you can see, the frame level accuracies, that means that we need to be accurate about everything in the current frame that we have after every utterance, and you can see that the accuracy for dialog manager is pretty bad here. But for NLU, it's okay. Then, another model is Pipeline-BLSTM, and what it actually does is it does NLU training separately and then that bidirectional LSTM for dialog policy optimization on top of that model. But these models are trained separately. And you can see that the third option is when these two models, NLU and bidirectional LSTM which was in blue in the previous slides, we can actually train them end to end, jointly, and we can increase the dialog manager accuracy by a huge margin and we actually improve NLU as well. So we have seen that effect of joint training before, and it still continues to happen. Okay, so what have we looked at? Dialog policy can be done by hand crafted rules if you have a good NLU and you have a good state tracker. Or it can be done in a supervised way where you can learn it from data and you can learn it jointly with NLU, and this way you will not need state tracker for example. Or you can do the reinforcement learning way, but that is a story for a different course.