1 00:00:02,490 --> 00:00:05,000 Hi. In this video, 2 00:00:05,000 --> 00:00:08,355 we will talk about state tracking in dialog manager. 3 00:00:08,355 --> 00:00:12,575 Let me remind you that dialog managers are responsible for two tasks. 4 00:00:12,575 --> 00:00:17,700 The first one is state tracking and it actually acquires some hand-crafted states. 5 00:00:17,700 --> 00:00:19,525 And, what it does is, 6 00:00:19,525 --> 00:00:24,295 it can query the external database or knowledge base for some additional information. 7 00:00:24,295 --> 00:00:27,920 It actually tracks the evolving state of the dialog 8 00:00:27,920 --> 00:00:33,070 and it constructs the state estimation after every utterance from the user. 9 00:00:33,070 --> 00:00:35,950 And another part is policy learner, 10 00:00:35,950 --> 00:00:39,955 that is the part that takes the state estimation as input 11 00:00:39,955 --> 00:00:45,115 and chooses the next best action from the dialog system from the agent. 12 00:00:45,115 --> 00:00:49,415 You can think of a dialog as the following. 13 00:00:49,415 --> 00:00:51,780 We have dialog turns, the system, 14 00:00:51,780 --> 00:00:57,480 and user provide some input and when we get input from the user, 15 00:00:57,480 --> 00:00:59,435 we actually get some observations. 16 00:00:59,435 --> 00:01:04,980 We hear something from the user and when we hear something from the user, 17 00:01:04,980 --> 00:01:07,860 we actually update the state of the dialog and 18 00:01:07,860 --> 00:01:11,190 dialog manager is responsible for tracking that state. 19 00:01:11,190 --> 00:01:13,170 Because with every new utterance, 20 00:01:13,170 --> 00:01:17,655 user can specify more details or change its intent, 21 00:01:17,655 --> 00:01:20,535 and that all affects our state. 22 00:01:20,535 --> 00:01:26,730 And you can think of state as something describing what the user ultimately wants. 23 00:01:26,730 --> 00:01:28,560 And then, when we have the state, 24 00:01:28,560 --> 00:01:29,765 we have to do something, 25 00:01:29,765 --> 00:01:31,328 we have to react, 26 00:01:31,328 --> 00:01:33,465 and we need to learn policy, 27 00:01:33,465 --> 00:01:36,000 and that is a part of a dialogue manager as well. 28 00:01:36,000 --> 00:01:38,610 And policy is actually a rule. 29 00:01:38,610 --> 00:01:42,885 What do we need to say when we have a certain state? 30 00:01:42,885 --> 00:01:49,985 So next, we will all review state tracking part of dialog manager, that red border. 31 00:01:49,985 --> 00:01:55,490 And for that, we will need to introduce DSTC 2 dataset. 32 00:01:55,490 --> 00:01:57,450 It is a dialog state tracking challenge. 33 00:01:57,450 --> 00:01:59,900 It was collected in 2013. 34 00:01:59,900 --> 00:02:04,910 It is a human computer dialogs about finding a restaurant in Cambridge. 35 00:02:04,910 --> 00:02:08,175 It contains 3,000 telephone-based dialogs 36 00:02:08,175 --> 00:02:12,205 and people were recruited for this using Amazon Mechanical Turk. 37 00:02:12,205 --> 00:02:18,690 So, this collection didn't assume that we need some experts in the field. 38 00:02:18,690 --> 00:02:23,440 These are like regular users that use our system. 39 00:02:23,440 --> 00:02:27,900 They used several dialog systems like Markov decision process or 40 00:02:27,900 --> 00:02:31,350 partially observed Markov decision process for tracking 41 00:02:31,350 --> 00:02:37,570 the dialog state and hand-crafted policy or policy learned with reinforcement learning. 42 00:02:37,570 --> 00:02:42,440 So, this is a computer part of that dialog collection. 43 00:02:42,440 --> 00:02:46,525 The labeling procedure then followed this principles. 44 00:02:46,525 --> 00:02:51,023 First, the utterances that they got from user and that was sound, 45 00:02:51,023 --> 00:02:56,265 they were transcribed using Amazon Mechanical Turk as well. 46 00:02:56,265 --> 00:03:02,835 And then, these transcriptions were annotated by heuristics, some regular expressions. 47 00:03:02,835 --> 00:03:08,265 And then, they were checked by the experts and corrected by hand. 48 00:03:08,265 --> 00:03:12,994 That's how these dataset came into being. 49 00:03:12,994 --> 00:03:16,820 So, how do they define dialog state and this dataset? 50 00:03:16,820 --> 00:03:20,130 Dialog state consists of three things. 51 00:03:20,130 --> 00:03:21,875 The first one, goals, 52 00:03:21,875 --> 00:03:28,090 that is a distribution over the values of each informable slot in our task. 53 00:03:28,090 --> 00:03:31,580 The slot is an informable if the user can give 54 00:03:31,580 --> 00:03:36,380 it in the utterance as a constraint for our search. 55 00:03:36,380 --> 00:03:39,690 Then, the second part of the state is a method, 56 00:03:39,690 --> 00:03:44,425 that is a distribution over methods namely by name, 57 00:03:44,425 --> 00:03:47,460 by constraints, by alternatives, or finished. 58 00:03:47,460 --> 00:03:50,280 So these are the methods that we need to track. 59 00:03:50,280 --> 00:03:54,315 And, user can also request some slots from the system. 60 00:03:54,315 --> 00:03:58,360 And, this is a part of a dialog state as well. 61 00:03:58,360 --> 00:04:02,460 They requested slots that the user needs and that is a probability for 62 00:04:02,460 --> 00:04:04,470 each requestable slot that it has been 63 00:04:04,470 --> 00:04:07,720 requested by the user and the system should inform it. 64 00:04:07,720 --> 00:04:16,685 So, the dataset was marked up in terms of user dialog acts and slots. 65 00:04:16,685 --> 00:04:19,895 So, the utterance like what part of town is it, 66 00:04:19,895 --> 00:04:23,145 can become the request. 67 00:04:23,145 --> 00:04:26,080 So, this is an act that the user makes. 68 00:04:26,080 --> 00:04:29,170 And, you can think of it as an intent, 69 00:04:29,170 --> 00:04:38,150 and that area slot that is there tells us that the user actually wants to get the area. 70 00:04:38,150 --> 00:04:41,895 Then, we can infer the method from act and goals. 71 00:04:41,895 --> 00:04:44,810 So if we have informed food which is Chinese, 72 00:04:44,810 --> 00:04:48,180 then it is clear that we need to search by constraints. 73 00:04:48,180 --> 00:04:50,750 Let's look at the dialog example. 74 00:04:50,750 --> 00:04:55,255 The user says, "I'm looking for an expensive restaurant with Venetian food." 75 00:04:55,255 --> 00:05:00,409 What we need to understand from this is now our state becomes food= Venetian, 76 00:05:00,409 --> 00:05:06,250 price range=expensive, and the method is by constraints and no slots were requested. 77 00:05:06,250 --> 00:05:08,930 Then, when the dialog progresses, 78 00:05:08,930 --> 00:05:11,510 the user says,"Is there one with Thai food?" 79 00:05:11,510 --> 00:05:15,829 And, we actually need to change our state so all the rest is the same, 80 00:05:15,829 --> 00:05:17,850 but food is now Thai. 81 00:05:17,850 --> 00:05:23,290 And then, when the user is okay with the options that we have provided, 82 00:05:23,290 --> 00:05:25,550 it asks, "Can I have the address?" 83 00:05:25,550 --> 00:05:29,654 And that means that our state of our dialog is the same, 84 00:05:29,654 --> 00:05:33,880 but this time, the requested slots is the address. 85 00:05:33,880 --> 00:05:37,565 And so, these three components goals, method, 86 00:05:37,565 --> 00:05:41,300 and requested slots are actually our context 87 00:05:41,300 --> 00:05:45,690 that we need to track off to every utterance from the user. 88 00:05:45,690 --> 00:05:47,740 So, let's look at the results of 89 00:05:47,740 --> 00:05:53,120 the competition that was held after the collection of this dataset. 90 00:05:53,120 --> 00:05:55,335 The results are the following. 91 00:05:55,335 --> 00:05:56,740 If we take the goals, 92 00:05:56,740 --> 00:06:03,680 then the best solution had 65 percent correct combinations that means that 93 00:06:03,680 --> 00:06:13,225 every slot and every value is guessed correctly and that happened in 65 percent of times. 94 00:06:13,225 --> 00:06:15,315 And as for the method, 95 00:06:15,315 --> 00:06:21,840 it has 97 percent accuracy and requested slots have 95 percent accuracy as well. 96 00:06:21,840 --> 00:06:27,340 So, it looks like slot tagging is still the most hard, 97 00:06:27,340 --> 00:06:29,845 the most difficult part. 98 00:06:29,845 --> 00:06:33,200 How can you do that state tracking? 99 00:06:33,200 --> 00:06:34,714 When you looked at that example, 100 00:06:34,714 --> 00:06:40,620 that was pretty clear that after those utterances, 101 00:06:40,620 --> 00:06:43,520 it is pretty easy to change the state of our dialog. 102 00:06:43,520 --> 00:06:46,595 So maybe, if you train a good NLU, 103 00:06:46,595 --> 00:06:48,680 which gives you intents and slots, 104 00:06:48,680 --> 00:06:53,190 then you can come up with some hand-crafted rules for dialog state change. 105 00:06:53,190 --> 00:06:55,910 If the user like mentions a new slot, 106 00:06:55,910 --> 00:06:57,530 you just add it to the state, 107 00:06:57,530 --> 00:07:02,705 if it can override the slot or it can start to fill a new form. 108 00:07:02,705 --> 00:07:06,929 And, you can actually come up with some rules to track that state, 109 00:07:06,929 --> 00:07:13,280 but you can actually do better if you do neural networks. 110 00:07:13,280 --> 00:07:19,063 This is an example of an architecture that does the following. 111 00:07:19,063 --> 00:07:21,927 It uses the previous system output, 112 00:07:21,927 --> 00:07:25,240 which says, "Would you like some Indian food?" 113 00:07:25,240 --> 00:07:28,950 Then, it takes the current utterance from the user like, 114 00:07:28,950 --> 00:07:30,760 "No, how about Farsi food?" 115 00:07:30,760 --> 00:07:33,870 And then, we need to actually parse 116 00:07:33,870 --> 00:07:40,440 that system output and user utterance and to come up with a current state of our dialog. 117 00:07:40,440 --> 00:07:44,710 And this is done in the following way. 118 00:07:44,710 --> 00:07:52,215 First, we embed the context and that is the system output on the previous state. 119 00:07:52,215 --> 00:07:55,380 Then, we embed the user utterance and we also 120 00:07:55,380 --> 00:07:59,325 embed candidate pairs for the slot and values, 121 00:07:59,325 --> 00:08:03,140 like food-Indian, food-Persian, or any other else. 122 00:08:03,140 --> 00:08:05,230 Then, we do the following thing. 123 00:08:05,230 --> 00:08:08,815 We have a context modelling network that actually 124 00:08:08,815 --> 00:08:12,725 takes the information about system output, about candidate pairs, 125 00:08:12,725 --> 00:08:19,330 uses some type of gating and uses the user utterance to come up with 126 00:08:19,330 --> 00:08:26,395 the idea whether this user utterance effects the context or not. 127 00:08:26,395 --> 00:08:29,860 And also, there is the second part which does semantic decoding, 128 00:08:29,860 --> 00:08:31,513 so it takes user utterance, 129 00:08:31,513 --> 00:08:34,590 the candidate pairs for slot and values, 130 00:08:34,590 --> 00:08:38,840 and they decide whether they match or not. 131 00:08:38,840 --> 00:08:43,650 And finally, we have a final binary decision making whether 132 00:08:43,650 --> 00:08:45,930 these candidate pairs match 133 00:08:45,930 --> 00:08:51,975 the user utterance provided the previous system output was the following. 134 00:08:51,975 --> 00:08:54,240 So in this way, 135 00:08:54,240 --> 00:09:01,020 we actually solve NLU and dialog state tracking simultaneously in a joint model. 136 00:09:01,020 --> 00:09:03,390 So, this is pretty cool. 137 00:09:03,390 --> 00:09:05,580 Let's see, for example, 138 00:09:05,580 --> 00:09:08,430 how one part of that model can actually work and 139 00:09:08,430 --> 00:09:11,895 let's look at the utterance representation. 140 00:09:11,895 --> 00:09:13,779 We can take our utterance, 141 00:09:13,779 --> 00:09:15,200 we can split it into tokens, 142 00:09:15,200 --> 00:09:17,055 we can take Word2Vec embeddings, 143 00:09:17,055 --> 00:09:18,985 or any other embeddings you like. 144 00:09:18,985 --> 00:09:24,735 And then, we apply 1D convolutions that we investigated in week one. 145 00:09:24,735 --> 00:09:27,060 And, you can take bigram, 146 00:09:27,060 --> 00:09:28,770 trigram, and so forth. 147 00:09:28,770 --> 00:09:31,200 And then, you can just sum up those vectors and 148 00:09:31,200 --> 00:09:34,580 that's how we get the representation for the utterance. 149 00:09:34,580 --> 00:09:37,030 So, that is a small part in our architecture. 150 00:09:37,030 --> 00:09:40,125 And we don't have time to overview like all of those parts. 151 00:09:40,125 --> 00:09:41,340 Let's go to the results. 152 00:09:41,340 --> 00:09:45,460 If we look at how good that network is, 153 00:09:45,460 --> 00:09:47,410 you can see that using 154 00:09:47,410 --> 00:09:51,475 that neural belief tracker architecture with convolutional neural networks, 155 00:09:51,475 --> 00:09:55,735 you can get 73 percent accuracy for goals, 156 00:09:55,735 --> 00:09:57,775 and this is pretty huge improvement. 157 00:09:57,775 --> 00:10:01,360 And, it actually improves request accuracy 158 00:10:01,360 --> 00:10:05,500 as well on our dialog state tracking challenge dataset. 159 00:10:05,500 --> 00:10:14,460 We can see that when we solved the task of NLU and dialog state tracking simultaneously, 160 00:10:14,460 --> 00:10:17,205 we can actually get better results. 161 00:10:17,205 --> 00:10:21,905 Another dataset worth mentioning is Frames dataset. 162 00:10:21,905 --> 00:10:25,095 It is pretty recent dataset. 163 00:10:25,095 --> 00:10:27,295 It was collected in 2016. 164 00:10:27,295 --> 00:10:31,465 It is human-human goal-oriented dataset. 165 00:10:31,465 --> 00:10:34,510 It is all about booking flights and hotels. 166 00:10:34,510 --> 00:10:42,555 It has 12 participants for 20 days and they have collected 1400 dialogs. 167 00:10:42,555 --> 00:10:44,796 And, they were collected in human-human interaction, 168 00:10:44,796 --> 00:10:49,200 that means that two humans talk to each other via a Slack chat. 169 00:10:49,200 --> 00:10:52,860 One of them was the user and he has a task from the system. 170 00:10:52,860 --> 00:10:57,345 Find a vacation between certain dates, between destination, 171 00:10:57,345 --> 00:11:01,665 and like the place where you go from, 172 00:11:01,665 --> 00:11:04,570 and date not flexible if not available, 173 00:11:04,570 --> 00:11:05,850 then end the conversation. 174 00:11:05,850 --> 00:11:07,705 So, the user had this task. 175 00:11:07,705 --> 00:11:09,450 The wizard which is another user, 176 00:11:09,450 --> 00:11:15,450 which has an access to a searchable database with packages and a hotel, 177 00:11:15,450 --> 00:11:18,698 and round trip flights, and that user, 178 00:11:18,698 --> 00:11:20,970 his task was to provide the help via 179 00:11:20,970 --> 00:11:24,765 a chat interface to the user who was searching for something. 180 00:11:24,765 --> 00:11:30,815 So, this dataset actually introduces a new task called frame tracking, 181 00:11:30,815 --> 00:11:35,505 which extends state tracking to a setting where several states attract 182 00:11:35,505 --> 00:11:40,490 simultaneously and users can go back and forth between them and compare results. 183 00:11:40,490 --> 00:11:44,990 Like, I simultaneously want to compare the flight from Atlanta 184 00:11:44,990 --> 00:11:49,730 to Caprica or let's say from Chicago to New York, 185 00:11:49,730 --> 00:11:52,790 and I investigate these two options, 186 00:11:52,790 --> 00:11:55,235 and these are different frames, and I can compare them. 187 00:11:55,235 --> 00:11:58,015 So, this is a pretty difficult task. 188 00:11:58,015 --> 00:12:00,800 How is it annotated? 189 00:12:00,800 --> 00:12:04,380 It is annotated with dialog act, slot types, 190 00:12:04,380 --> 00:12:06,290 slot values, and one more thing, 191 00:12:06,290 --> 00:12:09,335 references to other frames for each utterance. 192 00:12:09,335 --> 00:12:13,510 And also, we have an idea of the current active frame for each utterance. 193 00:12:13,510 --> 00:12:15,165 Let's see how it might work. 194 00:12:15,165 --> 00:12:17,870 The user says, "2.5 stars will do." 195 00:12:17,870 --> 00:12:19,270 What he does is, 196 00:12:19,270 --> 00:12:25,145 he actually informs the system that the category equally 2.5 is okay. 197 00:12:25,145 --> 00:12:28,275 Then, the system might ask the user. 198 00:12:28,275 --> 00:12:30,080 It might make an offer to him, 199 00:12:30,080 --> 00:12:36,290 like offer the user in the frame six business suite for 200 00:12:36,290 --> 00:12:39,135 the price $1,000 and it will 201 00:12:39,135 --> 00:12:43,280 actually be converted into the following utterance from the system. 202 00:12:43,280 --> 00:12:46,505 What about a 1,000 business class ticket to San Francisco? 203 00:12:46,505 --> 00:12:50,985 And we know that it is to San Francisco because we have an ID of the frame, 204 00:12:50,985 --> 00:12:55,465 so we have all the information for that frame. 205 00:12:55,465 --> 00:13:00,619 Let's summarize, we have overviewed a state tracker of a dialog manager. 206 00:13:00,619 --> 00:13:03,060 We have discussed the datasets for 207 00:13:03,060 --> 00:13:09,960 dialog manager training and those are dialog state tracking challenge and Frames dataset. 208 00:13:09,960 --> 00:13:13,780 State tracking can be done by hand having 209 00:13:13,780 --> 00:13:17,635 a good NLU or you can do better with neural network approaches, 210 00:13:17,635 --> 00:13:20,305 like a joint NLU and dialog manager. 211 00:13:20,305 --> 00:13:21,670 In the next video, 212 00:13:21,670 --> 00:13:25,430 we will talk about dialog policies in dialog managers.