Hi. This week, we will talk about task-oriented dialog systems. And where you can see task-oriented dialog systems, you can actually talk to a personal assistant like Apple Siri or Google Assistant or Microsoft Cortana or Amazon Alexa. You can solve these tasks like set up a reminder or find a photos of your pet or find a good restaurant or anything else. So, people are really familiar with this personal assistance and this week we will overview how you can make your own. Okay. You can also write to chat bot like for different reasons: to book a tickets, to order food, or to contest a parking ticket for example. And this time, you don't use your voice but you rather type in your question to the bot and you actually assume that the result will come up instantaneously. What we actually get from the user when he uses our system is either speech or text. If it is speech, we can run it through automatic speech recognition and get the text and the result. And what we actually get is the utterance and we will further assume that our utterance is text and we don't mess with speech or anything like that because it is out of scope of this week. The first thing you need to do when you get the utterance from the user, is you need to understand what does the user want, and this is the intent classification problem. You should think of it as the following, which predefined scenario is the user trying to execute? Let's look at this Siri example, "How long to drive to the nearest Starbucks?", I asked Siri and the Siri tells me the result. The traffic to Starbucks is about average so it should take approximately ten minutes. And I had such an intent, I wanted to know how long to drive to the nearest Starbucks and we can mark it up as the intent: navigation.time.closest. So, that means that I am interested about time of navigation to the closest thing. And I can actually ask it in any other way and because our natural language has a lot of options for that. But it will still need to understand that this is the same intent. Okay. So, I can actually ask the Siri a different question, "Give me directions to nearest Starbucks". This time, I don't care about how long it takes, I just need the directions. And so this time, Siri gives me the directions of a map. And let's say that this is a different intent like navigation.directions.closest. And you actually need to classify different intents, you need to distinguish between them, and this is classification task and you can measure accuracy here. And one more example, "Give me directions to Starbucks." This time, I don't say that I need the time or the nearest Starbucks, that's why the system doesn't know which Starbucks I want. And that's when this system initiate the dialogue with me and because it needs additional information like which Starbucks. And this is intent: navigation.directions. And how to think about this dialogue and how our chat bot, a personal assistant actually tracks what we are saying to it. You should think of intent as actually a form that a user needs to fill in. Each intent has a set of fields or so-called slots that must be filled in to execute the user request. Let's look at the example intent like navigation.directions. So that the system can build the directions for us, it needs to know where we want to go and from where we want to go. So, let's say we have two slots here like FROM and TO, and the FROM slot is actually optional because it can default to current geolocation of the user. And TO slot is required, we cannot build directions for you if you don't say where you want to go. And we need a slot tagger to extract slots from the user utterance. Whenever we get the utterance from the user, we need to know what slots are there and what intent is there. And let's look at slot filling example. The user says, "Show me the way to History Museum." And what we expect from our slot tagger is to highlight that History Museum part, and tell us that History Museum is actually a value of a TO slot in our form. And you should think of it as a sequence tagging and let me remind you that we solve sequence tagging tasks using BIO Scheme coding. And in here B corresponds to the word of the beginning of the slot, I corresponds to the word inside the slot, and O corresponds to all other words that are outside of slots. And if we look at this example, "Show me the way to History Museum.", the text that we want to produce for each token are actually the following, "Show me the way to" are outside of any slots, that's why they have O, "History" is actually the beginning of slot TO, and "Museum" is the inside token in the slot TO, so that's why it gets that tag. You train it as a sequence tagging task in BIO scheme and we have overview that in sequence to sequence in previous week. Let's say that a slot is considered to be correct if it's range and type are correct. And then, we can actually calculate the following metrics: we can calculate the recall of our slot tagging, we can take all the two slots and find out which of them are actually correctly found by our system, and that's how we define a recall. The precision is the following would take all of found slots and we find out which of them are correctly classified slots. And you can actually evaluate your slot tagger with F1 measure, which is a harmonic mean of precision and recall that we have defined. Okay. So, let's see how form filling dialog manager can work in a single turn scenario. That means that we give single utterance to the system and then outputs the result right away. Okay, the user says "Give me directions to San Francisco." We run our intent classifier and it says, "This is in navigation.directions intent." Okay, then we're on slot tagger and it says that "San Francisco seems to be the value of slot TO." Then, our dialog manager actually needs to decide what to do with that information. It seems that all slots are filled so we can actually ask for the route. We can query Google Maps or any other service that will give us the route, and we can output it to the user and say, "Here is your route." Okay, that was a simple way, this is a single dialog. Let's look at a more difficult example. This time the user starts the conversation like this, "Give me directions from L.A.", and we run intent classifier, it says, "Navigation.directions", where on slot tagger and it says that Los Angeles is actually a FROM slot and this time, dialog manager looks at this and says, "Okay, so required slot is missing, I don't know where to go. Please ask the user where to go." And the system asks the user, "Where do you want to go?", and the user gives us, this is where a second turn in the dialog happens and the user says San Francisco. We're on our intent classifier and slot tagger and hopefully, they will give us the values on the slide. The slot tagger will feel that San Francisco barred as TO slot. This time dialog manager knows that, "Okay. I have all the information I need. I can query Google Maps and give you the route." And the assistant outputs, "Here is your route." The problem here is that during the second turn here, actually, if we don't know the history of our conversation and just see the odds are in San Francisco, it's really hard to guess that this is in navigation.directions intent and that San Francisco actually fills TO slot. So, here we need to add context to our intent classifier and slot tagger and that context is actually some information about what happened previously. Let's see how you can track context in an easy way. We already understand that both intent classifier and slot tagger are needed. Let's add simple features to both of them. The first feature is the previous utterance intent as a categorical feature. So we know what to user wanted in the previous turn and that information can be valuable to decide what to do now, what intent the user has now. Then, we also add the slots that are filled in so far with binary feature for each possible slot, so that the system during slot tagging already knows which slots are filled by the user previously and which are not, and that will help it to decide which slot is correct in the utterance it sees. And this simple procedure actually improves slot tagger F1 by 0.5% and it reduces intent classifier error by 6.7%. So, this is pretty cool. These are pretty easy features and you can reduce your error. We will review a better way to do that and that is memory networks but that will happen later. Okay. But how do we track a form switch? Imagine that at first the user says, "Give me directions from L.A.", and then we ask, "Where do you want to go?" and this time, the user says, "Forget about it, let's eat some sushi first." So, this is where we need to understand that the intent has changed and we should forget about all the previous slots that we had and all the previous information that we had because we don't need it anymore. And the intent classifier gives us navigation find and the category, which is a slot and it has the value of sushi. Then, we make a query to the database or knowledge base like Yelp and dialog manager understands, "Okay, let's start a new form and find some sushi." and the assistant outputs, "Okay, here are nearby sushi places." We can actually track the forms which when the intent switches from navigation.directions lets say to navigation.find. If we overview the whole system, it looks like the following: we have a user, we get the speech or text from him or her, and then, we have natural language understanding module that outputs us intents and slots for our utterance. Then we have that magic box that is called dialog manager and dialog manager is responsible for two tasks. The first one is dialog state tracking. So we need to understand what the user wanted throughout the conversation and track that state. And also, it does dialog policy managing. So, there is a certain policy, which says that, okay, if the state is the following then we need to query some information from the user or request some information from the user or we just inform the user about something. And we can also query backend services like Google Maps or Yelp, and when we are ready to give users some information, we use natural language generation box that outputs the speech for the user so that this is a conversation. Okay, so let's summarize. We have overviewed the task-oriented dialog system with form filling and, how do we evaluate form filling? We evaluate accuracy for intent classifier and F1-measure for slot tagger. In the next video, we will take a closer look at the intent classifier and slot tagger.