1
00:00:02,490 --> 00:00:05,000
Hi. In this video,

2
00:00:05,000 --> 00:00:08,355
we will talk about state tracking in dialog manager.

3
00:00:08,355 --> 00:00:12,575
Let me remind you that dialog managers are responsible for two tasks.

4
00:00:12,575 --> 00:00:17,700
The first one is state tracking and it actually acquires some hand-crafted states.

5
00:00:17,700 --> 00:00:19,525
And, what it does is,

6
00:00:19,525 --> 00:00:24,295
it can query the external database or knowledge base for some additional information.

7
00:00:24,295 --> 00:00:27,920
It actually tracks the evolving state of the dialog

8
00:00:27,920 --> 00:00:33,070
and it constructs the state estimation after every utterance from the user.

9
00:00:33,070 --> 00:00:35,950
And another part is policy learner,

10
00:00:35,950 --> 00:00:39,955
that is the part that takes the state estimation as input

11
00:00:39,955 --> 00:00:45,115
and chooses the next best action from the dialog system from the agent.

12
00:00:45,115 --> 00:00:49,415
You can think of a dialog as the following.

13
00:00:49,415 --> 00:00:51,780
We have dialog turns, the system,

14
00:00:51,780 --> 00:00:57,480
and user provide some input and when we get input from the user,

15
00:00:57,480 --> 00:00:59,435
we actually get some observations.

16
00:00:59,435 --> 00:01:04,980
We hear something from the user and when we hear something from the user,

17
00:01:04,980 --> 00:01:07,860
we actually update the state of the dialog and

18
00:01:07,860 --> 00:01:11,190
dialog manager is responsible for tracking that state.

19
00:01:11,190 --> 00:01:13,170
Because with every new utterance,

20
00:01:13,170 --> 00:01:17,655
user can specify more details or change its intent,

21
00:01:17,655 --> 00:01:20,535
and that all affects our state.

22
00:01:20,535 --> 00:01:26,730
And you can think of state as something describing what the user ultimately wants.

23
00:01:26,730 --> 00:01:28,560
And then, when we have the state,

24
00:01:28,560 --> 00:01:29,765
we have to do something,

25
00:01:29,765 --> 00:01:31,328
we have to react,

26
00:01:31,328 --> 00:01:33,465
and we need to learn policy,

27
00:01:33,465 --> 00:01:36,000
and that is a part of a dialogue manager as well.

28
00:01:36,000 --> 00:01:38,610
And policy is actually a rule.

29
00:01:38,610 --> 00:01:42,885
What do we need to say when we have a certain state?

30
00:01:42,885 --> 00:01:49,985
So next, we will all review state tracking part of dialog manager, that red border.

31
00:01:49,985 --> 00:01:55,490
And for that, we will need to introduce DSTC 2 dataset.

32
00:01:55,490 --> 00:01:57,450
It is a dialog state tracking challenge.

33
00:01:57,450 --> 00:01:59,900
It was collected in 2013.

34
00:01:59,900 --> 00:02:04,910
It is a human computer dialogs about finding a restaurant in Cambridge.

35
00:02:04,910 --> 00:02:08,175
It contains 3,000 telephone-based dialogs

36
00:02:08,175 --> 00:02:12,205
and people were recruited for this using Amazon Mechanical Turk.

37
00:02:12,205 --> 00:02:18,690
So, this collection didn't assume that we need some experts in the field.

38
00:02:18,690 --> 00:02:23,440
These are like regular users that use our system.

39
00:02:23,440 --> 00:02:27,900
They used several dialog systems like Markov decision process or

40
00:02:27,900 --> 00:02:31,350
partially observed Markov decision process for tracking

41
00:02:31,350 --> 00:02:37,570
the dialog state and hand-crafted policy or policy learned with reinforcement learning.

42
00:02:37,570 --> 00:02:42,440
So, this is a computer part of that dialog collection.

43
00:02:42,440 --> 00:02:46,525
The labeling procedure then followed this principles.

44
00:02:46,525 --> 00:02:51,023
First, the utterances that they got from user and that was sound,

45
00:02:51,023 --> 00:02:56,265
they were transcribed using Amazon Mechanical Turk as well.

46
00:02:56,265 --> 00:03:02,835
And then, these transcriptions were annotated by heuristics, some regular expressions.

47
00:03:02,835 --> 00:03:08,265
And then, they were checked by the experts and corrected by hand.

48
00:03:08,265 --> 00:03:12,994
That's how these dataset came into being.

49
00:03:12,994 --> 00:03:16,820
So, how do they define dialog state and this dataset?

50
00:03:16,820 --> 00:03:20,130
Dialog state consists of three things.

51
00:03:20,130 --> 00:03:21,875
The first one, goals,

52
00:03:21,875 --> 00:03:28,090
that is a distribution over the values of each informable slot in our task.

53
00:03:28,090 --> 00:03:31,580
The slot is an informable if the user can give

54
00:03:31,580 --> 00:03:36,380
it in the utterance as a constraint for our search.

55
00:03:36,380 --> 00:03:39,690
Then, the second part of the state is a method,

56
00:03:39,690 --> 00:03:44,425
that is a distribution over methods namely by name,

57
00:03:44,425 --> 00:03:47,460
by constraints, by alternatives, or finished.

58
00:03:47,460 --> 00:03:50,280
So these are the methods that we need to track.

59
00:03:50,280 --> 00:03:54,315
And, user can also request some slots from the system.

60
00:03:54,315 --> 00:03:58,360
And, this is a part of a dialog state as well.

61
00:03:58,360 --> 00:04:02,460
They requested slots that the user needs and that is a probability for

62
00:04:02,460 --> 00:04:04,470
each requestable slot that it has been

63
00:04:04,470 --> 00:04:07,720
requested by the user and the system should inform it.

64
00:04:07,720 --> 00:04:16,685
So, the dataset was marked up in terms of user dialog acts and slots.

65
00:04:16,685 --> 00:04:19,895
So, the utterance like what part of town is it,

66
00:04:19,895 --> 00:04:23,145
can become the request.

67
00:04:23,145 --> 00:04:26,080
So, this is an act that the user makes.

68
00:04:26,080 --> 00:04:29,170
And, you can think of it as an intent,

69
00:04:29,170 --> 00:04:38,150
and that area slot that is there tells us that the user actually wants to get the area.

70
00:04:38,150 --> 00:04:41,895
Then, we can infer the method from act and goals.

71
00:04:41,895 --> 00:04:44,810
So if we have informed food which is Chinese,

72
00:04:44,810 --> 00:04:48,180
then it is clear that we need to search by constraints.

73
00:04:48,180 --> 00:04:50,750
Let's look at the dialog example.

74
00:04:50,750 --> 00:04:55,255
The user says, "I'm looking for an expensive restaurant with Venetian food."

75
00:04:55,255 --> 00:05:00,409
What we need to understand from this is now our state becomes food= Venetian,

76
00:05:00,409 --> 00:05:06,250
price range=expensive, and the method is by constraints and no slots were requested.

77
00:05:06,250 --> 00:05:08,930
Then, when the dialog progresses,

78
00:05:08,930 --> 00:05:11,510
the user says,"Is there one with Thai food?"

79
00:05:11,510 --> 00:05:15,829
And, we actually need to change our state so all the rest is the same,

80
00:05:15,829 --> 00:05:17,850
but food is now Thai.

81
00:05:17,850 --> 00:05:23,290
And then, when the user is okay with the options that we have provided,

82
00:05:23,290 --> 00:05:25,550
it asks, "Can I have the address?"

83
00:05:25,550 --> 00:05:29,654
And that means that our state of our dialog is the same,

84
00:05:29,654 --> 00:05:33,880
but this time, the requested slots is the address.

85
00:05:33,880 --> 00:05:37,565
And so, these three components goals, method,

86
00:05:37,565 --> 00:05:41,300
and requested slots are actually our context

87
00:05:41,300 --> 00:05:45,690
that we need to track off to every utterance from the user.

88
00:05:45,690 --> 00:05:47,740
So, let's look at the results of

89
00:05:47,740 --> 00:05:53,120
the competition that was held after the collection of this dataset.

90
00:05:53,120 --> 00:05:55,335
The results are the following.

91
00:05:55,335 --> 00:05:56,740
If we take the goals,

92
00:05:56,740 --> 00:06:03,680
then the best solution had 65 percent correct combinations that means that

93
00:06:03,680 --> 00:06:13,225
every slot and every value is guessed correctly and that happened in 65 percent of times.

94
00:06:13,225 --> 00:06:15,315
And as for the method,

95
00:06:15,315 --> 00:06:21,840
it has 97 percent accuracy and requested slots have 95 percent accuracy as well.

96
00:06:21,840 --> 00:06:27,340
So, it looks like slot tagging is still the most hard,

97
00:06:27,340 --> 00:06:29,845
the most difficult part.

98
00:06:29,845 --> 00:06:33,200
How can you do that state tracking?

99
00:06:33,200 --> 00:06:34,714
When you looked at that example,

100
00:06:34,714 --> 00:06:40,620
that was pretty clear that after those utterances,

101
00:06:40,620 --> 00:06:43,520
it is pretty easy to change the state of our dialog.

102
00:06:43,520 --> 00:06:46,595
So maybe, if you train a good NLU,

103
00:06:46,595 --> 00:06:48,680
which gives you intents and slots,

104
00:06:48,680 --> 00:06:53,190
then you can come up with some hand-crafted rules for dialog state change.

105
00:06:53,190 --> 00:06:55,910
If the user like mentions a new slot,

106
00:06:55,910 --> 00:06:57,530
you just add it to the state,

107
00:06:57,530 --> 00:07:02,705
if it can override the slot or it can start to fill a new form.

108
00:07:02,705 --> 00:07:06,929
And, you can actually come up with some rules to track that state,

109
00:07:06,929 --> 00:07:13,280
but you can actually do better if you do neural networks.

110
00:07:13,280 --> 00:07:19,063
This is an example of an architecture that does the following.

111
00:07:19,063 --> 00:07:21,927
It uses the previous system output,

112
00:07:21,927 --> 00:07:25,240
which says, "Would you like some Indian food?"

113
00:07:25,240 --> 00:07:28,950
Then, it takes the current utterance from the user like,

114
00:07:28,950 --> 00:07:30,760
"No, how about Farsi food?"

115
00:07:30,760 --> 00:07:33,870
And then, we need to actually parse

116
00:07:33,870 --> 00:07:40,440
that system output and user utterance and to come up with a current state of our dialog.

117
00:07:40,440 --> 00:07:44,710
And this is done in the following way.

118
00:07:44,710 --> 00:07:52,215
First, we embed the context and that is the system output on the previous state.

119
00:07:52,215 --> 00:07:55,380
Then, we embed the user utterance and we also

120
00:07:55,380 --> 00:07:59,325
embed candidate pairs for the slot and values,

121
00:07:59,325 --> 00:08:03,140
like food-Indian, food-Persian, or any other else.

122
00:08:03,140 --> 00:08:05,230
Then, we do the following thing.

123
00:08:05,230 --> 00:08:08,815
We have a context modelling network that actually

124
00:08:08,815 --> 00:08:12,725
takes the information about system output, about candidate pairs,

125
00:08:12,725 --> 00:08:19,330
uses some type of gating and uses the user utterance to come up with

126
00:08:19,330 --> 00:08:26,395
the idea whether this user utterance effects the context or not.

127
00:08:26,395 --> 00:08:29,860
And also, there is the second part which does semantic decoding,

128
00:08:29,860 --> 00:08:31,513
so it takes user utterance,

129
00:08:31,513 --> 00:08:34,590
the candidate pairs for slot and values,

130
00:08:34,590 --> 00:08:38,840
and they decide whether they match or not.

131
00:08:38,840 --> 00:08:43,650
And finally, we have a final binary decision making whether

132
00:08:43,650 --> 00:08:45,930
these candidate pairs match

133
00:08:45,930 --> 00:08:51,975
the user utterance provided the previous system output was the following.

134
00:08:51,975 --> 00:08:54,240
So in this way,

135
00:08:54,240 --> 00:09:01,020
we actually solve NLU and dialog state tracking simultaneously in a joint model.

136
00:09:01,020 --> 00:09:03,390
So, this is pretty cool.

137
00:09:03,390 --> 00:09:05,580
Let's see, for example,

138
00:09:05,580 --> 00:09:08,430
how one part of that model can actually work and

139
00:09:08,430 --> 00:09:11,895
let's look at the utterance representation.

140
00:09:11,895 --> 00:09:13,779
We can take our utterance,

141
00:09:13,779 --> 00:09:15,200
we can split it into tokens,

142
00:09:15,200 --> 00:09:17,055
we can take Word2Vec embeddings,

143
00:09:17,055 --> 00:09:18,985
or any other embeddings you like.

144
00:09:18,985 --> 00:09:24,735
And then, we apply 1D convolutions that we investigated in week one.

145
00:09:24,735 --> 00:09:27,060
And, you can take bigram,

146
00:09:27,060 --> 00:09:28,770
trigram, and so forth.

147
00:09:28,770 --> 00:09:31,200
And then, you can just sum up those vectors and

148
00:09:31,200 --> 00:09:34,580
that's how we get the representation for the utterance.

149
00:09:34,580 --> 00:09:37,030
So, that is a small part in our architecture.

150
00:09:37,030 --> 00:09:40,125
And we don't have time to overview like all of those parts.

151
00:09:40,125 --> 00:09:41,340
Let's go to the results.

152
00:09:41,340 --> 00:09:45,460
If we look at how good that network is,

153
00:09:45,460 --> 00:09:47,410
you can see that using

154
00:09:47,410 --> 00:09:51,475
that neural belief tracker architecture with convolutional neural networks,

155
00:09:51,475 --> 00:09:55,735
you can get 73 percent accuracy for goals,

156
00:09:55,735 --> 00:09:57,775
and this is pretty huge improvement.

157
00:09:57,775 --> 00:10:01,360
And, it actually improves request accuracy

158
00:10:01,360 --> 00:10:05,500
as well on our dialog state tracking challenge dataset.

159
00:10:05,500 --> 00:10:14,460
We can see that when we solved the task of NLU and dialog state tracking simultaneously,

160
00:10:14,460 --> 00:10:17,205
we can actually get better results.

161
00:10:17,205 --> 00:10:21,905
Another dataset worth mentioning is Frames dataset.

162
00:10:21,905 --> 00:10:25,095
It is pretty recent dataset.

163
00:10:25,095 --> 00:10:27,295
It was collected in 2016.

164
00:10:27,295 --> 00:10:31,465
It is human-human goal-oriented dataset.

165
00:10:31,465 --> 00:10:34,510
It is all about booking flights and hotels.

166
00:10:34,510 --> 00:10:42,555
It has 12 participants for 20 days and they have collected 1400 dialogs.

167
00:10:42,555 --> 00:10:44,796
And, they were collected in human-human interaction,

168
00:10:44,796 --> 00:10:49,200
that means that two humans talk to each other via a Slack chat.

169
00:10:49,200 --> 00:10:52,860
One of them was the user and he has a task from the system.

170
00:10:52,860 --> 00:10:57,345
Find a vacation between certain dates, between destination,

171
00:10:57,345 --> 00:11:01,665
and like the place where you go from,

172
00:11:01,665 --> 00:11:04,570
and date not flexible if not available,

173
00:11:04,570 --> 00:11:05,850
then end the conversation.

174
00:11:05,850 --> 00:11:07,705
So, the user had this task.

175
00:11:07,705 --> 00:11:09,450
The wizard which is another user,

176
00:11:09,450 --> 00:11:15,450
which has an access to a searchable database with packages and a hotel,

177
00:11:15,450 --> 00:11:18,698
and round trip flights, and that user,

178
00:11:18,698 --> 00:11:20,970
his task was to provide the help via

179
00:11:20,970 --> 00:11:24,765
a chat interface to the user who was searching for something.

180
00:11:24,765 --> 00:11:30,815
So, this dataset actually introduces a new task called frame tracking,

181
00:11:30,815 --> 00:11:35,505
which extends state tracking to a setting where several states attract

182
00:11:35,505 --> 00:11:40,490
simultaneously and users can go back and forth between them and compare results.

183
00:11:40,490 --> 00:11:44,990
Like, I simultaneously want to compare the flight from Atlanta

184
00:11:44,990 --> 00:11:49,730
to Caprica or let's say from Chicago to New York,

185
00:11:49,730 --> 00:11:52,790
and I investigate these two options,

186
00:11:52,790 --> 00:11:55,235
and these are different frames, and I can compare them.

187
00:11:55,235 --> 00:11:58,015
So, this is a pretty difficult task.

188
00:11:58,015 --> 00:12:00,800
How is it annotated?

189
00:12:00,800 --> 00:12:04,380
It is annotated with dialog act, slot types,

190
00:12:04,380 --> 00:12:06,290
slot values, and one more thing,

191
00:12:06,290 --> 00:12:09,335
references to other frames for each utterance.

192
00:12:09,335 --> 00:12:13,510
And also, we have an idea of the current active frame for each utterance.

193
00:12:13,510 --> 00:12:15,165
Let's see how it might work.

194
00:12:15,165 --> 00:12:17,870
The user says, "2.5 stars will do."

195
00:12:17,870 --> 00:12:19,270
What he does is,

196
00:12:19,270 --> 00:12:25,145
he actually informs the system that the category equally 2.5 is okay.

197
00:12:25,145 --> 00:12:28,275
Then, the system might ask the user.

198
00:12:28,275 --> 00:12:30,080
It might make an offer to him,

199
00:12:30,080 --> 00:12:36,290
like offer the user in the frame six business suite for

200
00:12:36,290 --> 00:12:39,135
the price $1,000 and it will

201
00:12:39,135 --> 00:12:43,280
actually be converted into the following utterance from the system.

202
00:12:43,280 --> 00:12:46,505
What about a 1,000 business class ticket to San Francisco?

203
00:12:46,505 --> 00:12:50,985
And we know that it is to San Francisco because we have an ID of the frame,

204
00:12:50,985 --> 00:12:55,465
so we have all the information for that frame.

205
00:12:55,465 --> 00:13:00,619
Let's summarize, we have overviewed a state tracker of a dialog manager.

206
00:13:00,619 --> 00:13:03,060
We have discussed the datasets for

207
00:13:03,060 --> 00:13:09,960
dialog manager training and those are dialog state tracking challenge and Frames dataset.

208
00:13:09,960 --> 00:13:13,780
State tracking can be done by hand having

209
00:13:13,780 --> 00:13:17,635
a good NLU or you can do better with neural network approaches,

210
00:13:17,635 --> 00:13:20,305
like a joint NLU and dialog manager.

211
00:13:20,305 --> 00:13:21,670
In the next video,

212
00:13:21,670 --> 00:13:25,430
we will talk about dialog policies in dialog managers.