Hi. In this video, we will talk about state tracking in dialog manager. Let me remind you that dialog managers are responsible for two tasks. The first one is state tracking and it actually acquires some hand-crafted states. And, what it does is, it can query the external database or knowledge base for some additional information. It actually tracks the evolving state of the dialog and it constructs the state estimation after every utterance from the user. And another part is policy learner, that is the part that takes the state estimation as input and chooses the next best action from the dialog system from the agent. You can think of a dialog as the following. We have dialog turns, the system, and user provide some input and when we get input from the user, we actually get some observations. We hear something from the user and when we hear something from the user, we actually update the state of the dialog and dialog manager is responsible for tracking that state. Because with every new utterance, user can specify more details or change its intent, and that all affects our state. And you can think of state as something describing what the user ultimately wants. And then, when we have the state, we have to do something, we have to react, and we need to learn policy, and that is a part of a dialogue manager as well. And policy is actually a rule. What do we need to say when we have a certain state? So next, we will all review state tracking part of dialog manager, that red border. And for that, we will need to introduce DSTC 2 dataset. It is a dialog state tracking challenge. It was collected in 2013. It is a human computer dialogs about finding a restaurant in Cambridge. It contains 3,000 telephone-based dialogs and people were recruited for this using Amazon Mechanical Turk. So, this collection didn't assume that we need some experts in the field. These are like regular users that use our system. They used several dialog systems like Markov decision process or partially observed Markov decision process for tracking the dialog state and hand-crafted policy or policy learned with reinforcement learning. So, this is a computer part of that dialog collection. The labeling procedure then followed this principles. First, the utterances that they got from user and that was sound, they were transcribed using Amazon Mechanical Turk as well. And then, these transcriptions were annotated by heuristics, some regular expressions. And then, they were checked by the experts and corrected by hand. That's how these dataset came into being. So, how do they define dialog state and this dataset? Dialog state consists of three things. The first one, goals, that is a distribution over the values of each informable slot in our task. The slot is an informable if the user can give it in the utterance as a constraint for our search. Then, the second part of the state is a method, that is a distribution over methods namely by name, by constraints, by alternatives, or finished. So these are the methods that we need to track. And, user can also request some slots from the system. And, this is a part of a dialog state as well. They requested slots that the user needs and that is a probability for each requestable slot that it has been requested by the user and the system should inform it. So, the dataset was marked up in terms of user dialog acts and slots. So, the utterance like what part of town is it, can become the request. So, this is an act that the user makes. And, you can think of it as an intent, and that area slot that is there tells us that the user actually wants to get the area. Then, we can infer the method from act and goals. So if we have informed food which is Chinese, then it is clear that we need to search by constraints. Let's look at the dialog example. The user says, "I'm looking for an expensive restaurant with Venetian food." What we need to understand from this is now our state becomes food= Venetian, price range=expensive, and the method is by constraints and no slots were requested. Then, when the dialog progresses, the user says,"Is there one with Thai food?" And, we actually need to change our state so all the rest is the same, but food is now Thai. And then, when the user is okay with the options that we have provided, it asks, "Can I have the address?" And that means that our state of our dialog is the same, but this time, the requested slots is the address. And so, these three components goals, method, and requested slots are actually our context that we need to track off to every utterance from the user. So, let's look at the results of the competition that was held after the collection of this dataset. The results are the following. If we take the goals, then the best solution had 65 percent correct combinations that means that every slot and every value is guessed correctly and that happened in 65 percent of times. And as for the method, it has 97 percent accuracy and requested slots have 95 percent accuracy as well. So, it looks like slot tagging is still the most hard, the most difficult part. How can you do that state tracking? When you looked at that example, that was pretty clear that after those utterances, it is pretty easy to change the state of our dialog. So maybe, if you train a good NLU, which gives you intents and slots, then you can come up with some hand-crafted rules for dialog state change. If the user like mentions a new slot, you just add it to the state, if it can override the slot or it can start to fill a new form. And, you can actually come up with some rules to track that state, but you can actually do better if you do neural networks. This is an example of an architecture that does the following. It uses the previous system output, which says, "Would you like some Indian food?" Then, it takes the current utterance from the user like, "No, how about Farsi food?" And then, we need to actually parse that system output and user utterance and to come up with a current state of our dialog. And this is done in the following way. First, we embed the context and that is the system output on the previous state. Then, we embed the user utterance and we also embed candidate pairs for the slot and values, like food-Indian, food-Persian, or any other else. Then, we do the following thing. We have a context modelling network that actually takes the information about system output, about candidate pairs, uses some type of gating and uses the user utterance to come up with the idea whether this user utterance effects the context or not. And also, there is the second part which does semantic decoding, so it takes user utterance, the candidate pairs for slot and values, and they decide whether they match or not. And finally, we have a final binary decision making whether these candidate pairs match the user utterance provided the previous system output was the following. So in this way, we actually solve NLU and dialog state tracking simultaneously in a joint model. So, this is pretty cool. Let's see, for example, how one part of that model can actually work and let's look at the utterance representation. We can take our utterance, we can split it into tokens, we can take Word2Vec embeddings, or any other embeddings you like. And then, we apply 1D convolutions that we investigated in week one. And, you can take bigram, trigram, and so forth. And then, you can just sum up those vectors and that's how we get the representation for the utterance. So, that is a small part in our architecture. And we don't have time to overview like all of those parts. Let's go to the results. If we look at how good that network is, you can see that using that neural belief tracker architecture with convolutional neural networks, you can get 73 percent accuracy for goals, and this is pretty huge improvement. And, it actually improves request accuracy as well on our dialog state tracking challenge dataset. We can see that when we solved the task of NLU and dialog state tracking simultaneously, we can actually get better results. Another dataset worth mentioning is Frames dataset. It is pretty recent dataset. It was collected in 2016. It is human-human goal-oriented dataset. It is all about booking flights and hotels. It has 12 participants for 20 days and they have collected 1400 dialogs. And, they were collected in human-human interaction, that means that two humans talk to each other via a Slack chat. One of them was the user and he has a task from the system. Find a vacation between certain dates, between destination, and like the place where you go from, and date not flexible if not available, then end the conversation. So, the user had this task. The wizard which is another user, which has an access to a searchable database with packages and a hotel, and round trip flights, and that user, his task was to provide the help via a chat interface to the user who was searching for something. So, this dataset actually introduces a new task called frame tracking, which extends state tracking to a setting where several states attract simultaneously and users can go back and forth between them and compare results. Like, I simultaneously want to compare the flight from Atlanta to Caprica or let's say from Chicago to New York, and I investigate these two options, and these are different frames, and I can compare them. So, this is a pretty difficult task. How is it annotated? It is annotated with dialog act, slot types, slot values, and one more thing, references to other frames for each utterance. And also, we have an idea of the current active frame for each utterance. Let's see how it might work. The user says, "2.5 stars will do." What he does is, he actually informs the system that the category equally 2.5 is okay. Then, the system might ask the user. It might make an offer to him, like offer the user in the frame six business suite for the price $1,000 and it will actually be converted into the following utterance from the system. What about a 1,000 business class ticket to San Francisco? And we know that it is to San Francisco because we have an ID of the frame, so we have all the information for that frame. Let's summarize, we have overviewed a state tracker of a dialog manager. We have discussed the datasets for dialog manager training and those are dialog state tracking challenge and Frames dataset. State tracking can be done by hand having a good NLU or you can do better with neural network approaches, like a joint NLU and dialog manager. In the next video, we will talk about dialog policies in dialog managers.