1
00:00:02,180 --> 00:00:05,045
Hi. In this video,

2
00:00:05,045 --> 00:00:08,285
we will talk about Policy Learner in Dialogue Manager.

3
00:00:08,285 --> 00:00:11,530
Okay, let me remind you what policy learning is.

4
00:00:11,530 --> 00:00:14,760
We have a dialogue that progresses with time,

5
00:00:14,760 --> 00:00:17,040
and after every turn,

6
00:00:17,040 --> 00:00:20,710
after every observation from the user will somehow update

7
00:00:20,710 --> 00:00:25,620
our state of the dialogue and state records responsible for that.

8
00:00:25,620 --> 00:00:28,665
And then, after we have a certain state,

9
00:00:28,665 --> 00:00:31,020
we actually have to make some action,

10
00:00:31,020 --> 00:00:35,070
and we need to figure out the policy that tells us if you have

11
00:00:35,070 --> 00:00:39,225
a certain state then this is an action that you must do,

12
00:00:39,225 --> 00:00:42,670
and this is something that we then sell to the user.

13
00:00:42,670 --> 00:00:45,810
So let's look at what dialog policy actually is.

14
00:00:45,810 --> 00:00:49,825
It is actually a mapping from dialog state to agent act.

15
00:00:49,825 --> 00:00:52,663
Imagine that we have a conversation with the user.

16
00:00:52,663 --> 00:00:55,395
We collect some information from him or her,

17
00:00:55,395 --> 00:01:00,615
and we have that internal state that tells us what the user essentially wants,

18
00:01:00,615 --> 00:01:04,410
and we need to take some action to continue the dialog.

19
00:01:04,410 --> 00:01:09,884
And we need that mapping from dialog state to agent act,

20
00:01:09,884 --> 00:01:12,625
and this is what dialog policy essentially is.

21
00:01:12,625 --> 00:01:15,320
Let's look at some policy execution examples.

22
00:01:15,320 --> 00:01:20,320
A system might inform the user that the location is 780 Market Street.

23
00:01:20,320 --> 00:01:22,940
The user will hear it as of the following,

24
00:01:22,940 --> 00:01:25,860
"The nearest one is at 780 Market Street."

25
00:01:25,860 --> 00:01:31,710
Another example is that the system might request location of the user.

26
00:01:31,710 --> 00:01:33,570
And the user will see it as,

27
00:01:33,570 --> 00:01:35,844
"What is the delivery address?"

28
00:01:35,844 --> 00:01:39,610
And we have to train a model

29
00:01:39,610 --> 00:01:45,215
to give us an act from a dialog state or we can do that by hand crafted rules,

30
00:01:45,215 --> 00:01:47,348
which is my favorite.

31
00:01:47,348 --> 00:01:51,465
Okay, so let's look at the Simple approach: hand crafted rules.

32
00:01:51,465 --> 00:01:54,360
You have NLU and state tracker.

33
00:01:54,360 --> 00:01:58,380
And you can come up with hand crafted rules for policy.

34
00:01:58,380 --> 00:02:01,275
Because if you have a state tracker, you have a state,

35
00:02:01,275 --> 00:02:05,015
and if you remember the dialog state tracking challenge dataset,

36
00:02:05,015 --> 00:02:10,470
it actually contains a part of the state which has requested slots,

37
00:02:10,470 --> 00:02:14,325
and we can use that information to understand what to do next,

38
00:02:14,325 --> 00:02:18,150
whether we need to tell the user a value of

39
00:02:18,150 --> 00:02:22,800
a particular slot or we should search the database or something else.

40
00:02:22,800 --> 00:02:27,430
So, it should be pretty easy to come up with hand crafted rules for policy.

41
00:02:27,430 --> 00:02:31,570
But it turns out that you can make it better if you do it with machine learning.

42
00:02:31,570 --> 00:02:33,390
And there are two ways to do that,

43
00:02:33,390 --> 00:02:36,400
to optimize dialog policies with machine learning.

44
00:02:36,400 --> 00:02:39,090
The first one is Supervised learning,

45
00:02:39,090 --> 00:02:40,770
and in this setting,

46
00:02:40,770 --> 00:02:43,875
you train to imitate the observed actions of an expert.

47
00:02:43,875 --> 00:02:46,219
So we have some human-human interactions,

48
00:02:46,219 --> 00:02:47,445
one of them is an expert,

49
00:02:47,445 --> 00:02:53,925
and you just use that observations and try to imitate the action of an expert.

50
00:02:53,925 --> 00:02:57,300
It often requires a large amount of expert label data

51
00:02:57,300 --> 00:03:01,620
and as you know it is pretty expensive to collect that data,

52
00:03:01,620 --> 00:03:07,855
because you cannot use crowd sourcing platforms like Amazon Mechanical Turk.

53
00:03:07,855 --> 00:03:10,730
But even with a large amount of training data,

54
00:03:10,730 --> 00:03:13,610
parts of the dialog state space may not be well

55
00:03:13,610 --> 00:03:17,705
covered in the training data and our system will be blind there.

56
00:03:17,705 --> 00:03:22,682
So, there is a different approach to this called Reinforcement learning,

57
00:03:22,682 --> 00:03:25,670
and this is a huge field and it is out of our scope,

58
00:03:25,670 --> 00:03:28,155
but it is like an honorable mention.

59
00:03:28,155 --> 00:03:30,545
Given only rewards signal, now,

60
00:03:30,545 --> 00:03:35,045
the agent can optimize a dialog policy through interaction with users.

61
00:03:35,045 --> 00:03:39,020
Reinforcement learning can require many samples from an environment,

62
00:03:39,020 --> 00:03:42,640
making learning from scratch with real user is impractical,

63
00:03:42,640 --> 00:03:45,850
we will just waste the time of our experts.

64
00:03:45,850 --> 00:03:49,550
That's why there, we need simulated users based on

65
00:03:49,550 --> 00:03:53,630
the supervised data for reinforcement learning.

66
00:03:53,630 --> 00:04:00,647
So and this is a huge field and it gains popularity in dialog policies optimization.

67
00:04:00,647 --> 00:04:04,706
Let's look at how supervised approach might work.

68
00:04:04,706 --> 00:04:07,705
Here is an example of another model that does

69
00:04:07,705 --> 00:04:11,615
joint NLU and dialog management policy optimization,

70
00:04:11,615 --> 00:04:15,565
and you can see what it does.

71
00:04:15,565 --> 00:04:18,820
We actually have four utterances

72
00:04:18,820 --> 00:04:23,065
that are all utterances that we got from the user so far.

73
00:04:23,065 --> 00:04:28,720
We pass each of them through NLU which gives us intents and slot tagging,

74
00:04:28,720 --> 00:04:32,030
and we can also take the hidden vector,

75
00:04:32,030 --> 00:04:34,270
the hidden representation of that phrase from

76
00:04:34,270 --> 00:04:38,815
the NLU and we can use it for a consecutive LSTM

77
00:04:38,815 --> 00:04:46,410
that will actually come up with an idea what system action we can actually execute.

78
00:04:46,410 --> 00:04:50,335
So, we've got several utterances, NLU results,

79
00:04:50,335 --> 00:04:56,815
and then the LSTM reads those utterances in latent space from NLU,

80
00:04:56,815 --> 00:04:59,545
and it actually decides what to do next.

81
00:04:59,545 --> 00:05:02,245
So this is pretty cool because, here,

82
00:05:02,245 --> 00:05:04,460
we don't need dialog state tracking,

83
00:05:04,460 --> 00:05:05,785
we don't have state.

84
00:05:05,785 --> 00:05:11,350
State here is replaced with a state of the LSTM,

85
00:05:11,350 --> 00:05:15,495
so that is some latent variables like 300 of them let's say.

86
00:05:15,495 --> 00:05:18,240
So our state becomes not hand crafted,

87
00:05:18,240 --> 00:05:20,660
but it becomes a real valued vector.

88
00:05:20,660 --> 00:05:22,325
So this is pretty cool.

89
00:05:22,325 --> 00:05:28,525
And then we can actually learn a classifier on top of that LSTM,

90
00:05:28,525 --> 00:05:34,055
and it will output us the probability of the next system action.

91
00:05:34,055 --> 00:05:37,035
Let's see how it actually works.

92
00:05:37,035 --> 00:05:40,000
If we look at the results,

93
00:05:40,000 --> 00:05:42,460
there are three models that we compare here.

94
00:05:42,460 --> 00:05:44,350
The first one is baseline.

95
00:05:44,350 --> 00:05:47,555
That is a classical approach to this problem.

96
00:05:47,555 --> 00:05:51,745
We have a conditional random field for slot tagging and we have

97
00:05:51,745 --> 00:05:56,520
SVM for action classification.

98
00:05:56,520 --> 00:05:57,940
As you can see,

99
00:05:57,940 --> 00:06:00,485
the frame level accuracies,

100
00:06:00,485 --> 00:06:04,180
that means that we need to be

101
00:06:04,180 --> 00:06:09,090
accurate about everything in the current frame that we have after every utterance,

102
00:06:09,090 --> 00:06:13,903
and you can see that the accuracy for dialog manager is pretty bad here.

103
00:06:13,903 --> 00:06:16,010
But for NLU, it's okay.

104
00:06:16,010 --> 00:06:19,475
Then, another model is Pipeline-BLSTM,

105
00:06:19,475 --> 00:06:24,560
and what it actually does is it does NLU training separately and

106
00:06:24,560 --> 00:06:30,045
then that bidirectional LSTM for dialog policy optimization on top of that model.

107
00:06:30,045 --> 00:06:33,170
But these models are trained separately.

108
00:06:33,170 --> 00:06:38,570
And you can see that the third option is when these two models,

109
00:06:38,570 --> 00:06:43,355
NLU and bidirectional LSTM which was in blue in the previous slides,

110
00:06:43,355 --> 00:06:47,585
we can actually train them end to end, jointly,

111
00:06:47,585 --> 00:06:49,100
and we can increase

112
00:06:49,100 --> 00:06:55,516
the dialog manager accuracy by a huge margin and we actually improve NLU as well.

113
00:06:55,516 --> 00:06:59,555
So we have seen that effect of joint training before,

114
00:06:59,555 --> 00:07:03,445
and it still continues to happen.

115
00:07:03,445 --> 00:07:06,630
Okay, so what have we looked at?

116
00:07:06,630 --> 00:07:08,490
Dialog policy can be done by

117
00:07:08,490 --> 00:07:13,525
hand crafted rules if you have a good NLU and you have a good state tracker.

118
00:07:13,525 --> 00:07:18,140
Or it can be done in a supervised way where you can learn it

119
00:07:18,140 --> 00:07:22,867
from data and you can learn it jointly with NLU,

120
00:07:22,867 --> 00:07:26,240
and this way you will not need state tracker for example.

121
00:07:26,240 --> 00:07:28,610
Or you can do the reinforcement learning way,

122
00:07:28,610 --> 00:07:32,070
but that is a story for a different course.