1
00:00:02,770 --> 00:00:08,765
In this video, we will talk about lexicon utilization in our NLU.

2
00:00:08,765 --> 00:00:11,355
Why do we want to utilize lexicon?

3
00:00:11,355 --> 00:00:14,050
Let's take ATIS dataset for example.

4
00:00:14,050 --> 00:00:19,390
The problem with these dataset is that it has a finite set of cities in training.

5
00:00:19,390 --> 00:00:21,640
And, the thing we don't know is whether

6
00:00:21,640 --> 00:00:24,820
the model will work for a new city during testing.

7
00:00:24,820 --> 00:00:27,340
And, the good fact is that we have a list of

8
00:00:27,340 --> 00:00:30,610
all cities like from Wikipedia or any other source,

9
00:00:30,610 --> 00:00:34,060
and we can actually use it somehow to help on a model to detect new cities.

10
00:00:34,060 --> 00:00:37,480
Another example, imagine you need to fill

11
00:00:37,480 --> 00:00:41,860
a slot like "music artist" and we have all music artists in the database,

12
00:00:41,860 --> 00:00:44,965
like musicbrainz.org and you can actually download it,

13
00:00:44,965 --> 00:00:47,460
parse it, and use for your NLU.

14
00:00:47,460 --> 00:00:53,970
But how can we use it? Let's add lexicon features to our input words.

15
00:00:53,970 --> 00:00:56,820
We will overview an approach from the paper,

16
00:00:56,820 --> 00:00:59,870
you can see the lower left corner.

17
00:00:59,870 --> 00:01:04,830
Let's match every n-gram of input text against entries in our lexicon.

18
00:01:04,830 --> 00:01:07,180
Let's take n-grams "Take me," "me to,"

19
00:01:07,180 --> 00:01:10,865
"san," and "San Francisco," and all the possible ones.

20
00:01:10,865 --> 00:01:13,680
And let's match them with the lexicon,

21
00:01:13,680 --> 00:01:17,555
with the dictionary that we have for, let's say, cities.

22
00:01:17,555 --> 00:01:20,590
And we will say that the match is successful when

23
00:01:20,590 --> 00:01:26,126
the n-gram matches either prefix or postfix of an entry from the dictionary,

24
00:01:26,126 --> 00:01:28,885
and it is at least half the length of the entry,

25
00:01:28,885 --> 00:01:32,050
so that we don't have a lot of spurious matches.

26
00:01:32,050 --> 00:01:34,225
Let's see the matches we might have.

27
00:01:34,225 --> 00:01:37,897
San might have a match with San Antonio,

28
00:01:37,897 --> 00:01:44,590
with San Francisco, and the San Francisco n-gram can match with San Francisco entry.

29
00:01:44,590 --> 00:01:48,610
So, we'd get these matches and we need to decide which one of them is best.

30
00:01:48,610 --> 00:01:51,730
And when we have overlapping matches,

31
00:01:51,730 --> 00:01:58,260
that means that one word can be used in different n-grams,

32
00:01:58,260 --> 00:01:59,935
we need to decide which one is better,

33
00:01:59,935 --> 00:02:01,810
and we will prefer them in the following order.

34
00:02:01,810 --> 00:02:06,340
First, will prefer exact matches over partial.

35
00:02:06,340 --> 00:02:12,160
So, if the word San is used in San Francisco and that is an exact match,

36
00:02:12,160 --> 00:02:13,840
then it is preferable than,

37
00:02:13,840 --> 00:02:17,660
let's say, the match of San with San Antonio.

38
00:02:17,660 --> 00:02:22,720
And we will also prefer longer matches over shorter,

39
00:02:22,720 --> 00:02:26,228
and we will prefer earlier matches in the sequence over later.

40
00:02:26,228 --> 00:02:29,170
This three rules actually give us

41
00:02:29,170 --> 00:02:34,150
a unique distribution of

42
00:02:34,150 --> 00:02:40,355
our words in the non-overlapping matches with our lexicon.

43
00:02:40,355 --> 00:02:42,905
So, let's see how we can use that information,

44
00:02:42,905 --> 00:02:45,900
that lexicon matching information in our model.

45
00:02:45,900 --> 00:02:51,195
We will use a so-called BIOES coding,

46
00:02:51,195 --> 00:02:52,995
which stands for Begin, Inside,

47
00:02:52,995 --> 00:02:54,780
Outside, End, and Single,

48
00:02:54,780 --> 00:03:00,250
and we will mark the token with B if token matches the beginning of some entity.

49
00:03:00,250 --> 00:03:04,377
We will use B and I if token matches as prefix.

50
00:03:04,377 --> 00:03:08,805
We will use I and E if two tokens match as postfix.

51
00:03:08,805 --> 00:03:14,790
So, it is some token in the middle and some token at the end of the entity.

52
00:03:14,790 --> 00:03:20,908
And we will use S for matches when a single token matches an entity.

53
00:03:20,908 --> 00:03:26,475
Let's see an example of such coding for four lexicon dictionaries,

54
00:03:26,475 --> 00:03:30,580
location, miscellaneous, organization, and person.

55
00:03:30,580 --> 00:03:34,455
And we have a certain utterance like

56
00:03:34,455 --> 00:03:39,645
"Hayao Tada commander of the Japanese North China area army."

57
00:03:39,645 --> 00:03:47,530
And you can see that we have a match in persons lexicon and that gives us a B and E,

58
00:03:47,530 --> 00:03:49,710
so we know that that is an entity.

59
00:03:49,710 --> 00:03:53,035
And we also have a full match in "North China area

60
00:03:53,035 --> 00:03:57,765
army," and it has a match with organisation lexicon,

61
00:03:57,765 --> 00:04:03,270
and it has an encoding like B, I, E, I, E. And,

62
00:04:03,270 --> 00:04:08,475
we can actually have the full match even if we don't have an entity in our lexicon.

63
00:04:08,475 --> 00:04:13,040
Let's say, we have North China History Museum,

64
00:04:13,040 --> 00:04:15,725
and let's say, I don't know,

65
00:04:15,725 --> 00:04:18,685
any country area army entities.

66
00:04:18,685 --> 00:04:21,255
And when we have those two entities,

67
00:04:21,255 --> 00:04:26,330
we can actually have the postfix from the second one and the prefix from

68
00:04:26,330 --> 00:04:31,730
the first match and it will still give us the same BIOES encoding.

69
00:04:31,730 --> 00:04:33,114
So, this is pretty cool.

70
00:04:33,114 --> 00:04:37,435
We can make new entities that we haven't seen before.

71
00:04:37,435 --> 00:04:42,385
Okay, so, what we do next is we use these letters

72
00:04:42,385 --> 00:04:47,485
as we will later encode them as one hot encoded vectors.

73
00:04:47,485 --> 00:04:51,850
Let's see how we can add that lexicon information to our module.

74
00:04:51,850 --> 00:04:54,300
Let's say we have an utterance,

75
00:04:54,300 --> 00:05:00,290
"We saw paintings of Picasso," and we have a word embedding for every token.

76
00:05:00,290 --> 00:05:01,730
And to that word embedding,

77
00:05:01,730 --> 00:05:04,970
we can actually add some lexicon information.

78
00:05:04,970 --> 00:05:07,030
And we do it in the following way.

79
00:05:07,030 --> 00:05:09,915
Remember the table that we have on the previous slide?

80
00:05:09,915 --> 00:05:17,175
Let's take two first words and let's take that column that corresponds to the word,

81
00:05:17,175 --> 00:05:23,805
and let's use one hot encoding to decode that BIOES letters into numbers,

82
00:05:23,805 --> 00:05:26,535
and we will use that vector and we will

83
00:05:26,535 --> 00:05:30,195
concatenate it with the embedding vector for the word,

84
00:05:30,195 --> 00:05:34,530
and we will use it as an input for our B directional LSTM, let's say.

85
00:05:34,530 --> 00:05:40,560
And this thing will predict tags for our slot tagger.

86
00:05:40,560 --> 00:05:43,680
So, this is like a pretty easy approach

87
00:05:43,680 --> 00:05:49,190
to embed that lexicon information into your model.

88
00:05:49,190 --> 00:05:51,615
Let's see how it works.

89
00:05:51,615 --> 00:05:58,180
It was bench-marked on the dataset for a Named Entity Recognition,

90
00:05:58,180 --> 00:06:00,775
and you can see that if you add lexicon,

91
00:06:00,775 --> 00:06:03,100
it actually improves your Precision,

92
00:06:03,100 --> 00:06:06,055
Recall and F1 measure a little bit,

93
00:06:06,055 --> 00:06:08,120
like one percent or something like that.

94
00:06:08,120 --> 00:06:11,200
So, it seems to work and it seems that

95
00:06:11,200 --> 00:06:14,635
it will be helpful to

96
00:06:14,635 --> 00:06:18,905
implement these lexicon features for your real world dialogue system.

97
00:06:18,905 --> 00:06:21,960
Let's look into some training details.

98
00:06:21,960 --> 00:06:25,765
You can sample your lexicon dictionaries so that your model learns

99
00:06:25,765 --> 00:06:30,110
not only the lexicon features but also the context of the words.

100
00:06:30,110 --> 00:06:31,855
Let's say, when I say,

101
00:06:31,855 --> 00:06:35,170
"Take me to San Francisco," that means that the word that

102
00:06:35,170 --> 00:06:39,530
comes after the phrase "take me to" is most likely a two-slot.

103
00:06:39,530 --> 00:06:44,575
And we want the model to learn those features as well because in real world,

104
00:06:44,575 --> 00:06:48,445
we can see entities that were not in our vocabulary before,

105
00:06:48,445 --> 00:06:50,725
and our lexicon features will not work.

106
00:06:50,725 --> 00:06:54,400
So, this sampling procedure actually gives

107
00:06:54,400 --> 00:06:59,200
you an ability to detect unknown entities during testing.

108
00:06:59,200 --> 00:07:01,275
So, this is a pretty cool approach.

109
00:07:01,275 --> 00:07:04,285
When you have the lexicon dictionaries,

110
00:07:04,285 --> 00:07:08,080
you can also augment your data set because you can replace

111
00:07:08,080 --> 00:07:13,290
the slot values by some other values from the same lexicon.

112
00:07:13,290 --> 00:07:15,850
Let's say, "Take me to San Francisco,"

113
00:07:15,850 --> 00:07:19,300
becomes "Take me to Washington," because you can easily replace

114
00:07:19,300 --> 00:07:25,795
San Francisco's slot value with Washington because you have the lexicon dictionaries.

115
00:07:25,795 --> 00:07:27,720
So, let me summarize.

116
00:07:27,720 --> 00:07:31,465
You can add lexicon features to further improve your NLU

117
00:07:31,465 --> 00:07:35,960
because that will help you to detect the entities that the

118
00:07:35,960 --> 00:07:40,280
user mentions and some unknown and long entities like

119
00:07:40,280 --> 00:07:45,185
"South China area army" that can be detected.

120
00:07:45,185 --> 00:07:49,130
In the next video, we will take a look at Dialogue Manager.