1
00:00:02,380 --> 00:00:07,095
I would say that there are three main groups of methods in NLP.

2
00:00:07,095 --> 00:00:10,415
One group would be about rule-based approaches.

3
00:00:10,415 --> 00:00:14,230
So, for example regular expressions would go to this group.

4
00:00:14,230 --> 00:00:17,275
Another one would be traditional machine learning.

5
00:00:17,275 --> 00:00:20,085
And the last one would be deep learning that has

6
00:00:20,085 --> 00:00:23,860
recently gained lots of popularity in NLP.

7
00:00:23,860 --> 00:00:26,870
In this video, I want to go through

8
00:00:26,870 --> 00:00:31,430
all three approaches just by example of one particular tasks,

9
00:00:31,430 --> 00:00:35,135
so that you get some flavor of all of them.

10
00:00:35,135 --> 00:00:38,760
The task could be called semantic slot filling.

11
00:00:38,760 --> 00:00:43,420
So, you can see the query in the bottom of the slide which

12
00:00:43,420 --> 00:00:49,045
says Show me flights from Boston to San Francisco on Tuesday.

13
00:00:49,045 --> 00:00:51,595
So you have some sequence of words,

14
00:00:51,595 --> 00:00:54,540
and you want to find some slots.

15
00:00:54,540 --> 00:01:01,310
So the slots would be destinations or departure or some date and something like that.

16
00:01:01,310 --> 00:01:05,630
And to fill those slots you can use different approaches.

17
00:01:05,630 --> 00:01:08,630
This slide is about context-free grammars.

18
00:01:08,630 --> 00:01:10,985
So, it is a rule-based approach.

19
00:01:10,985 --> 00:01:17,940
The context-free grammars show you what would be the rules to produce some words.

20
00:01:17,940 --> 00:01:25,065
For example, you can see that non terminal words show can produce words,

21
00:01:25,065 --> 00:01:29,410
show me, or can I see, or something like that.

22
00:01:29,410 --> 00:01:33,210
And some other words for example origin,

23
00:01:33,210 --> 00:01:36,150
non terminal can produce from city,

24
00:01:36,150 --> 00:01:41,860
and city non terminal can then produce some specific cities from a list.

25
00:01:41,860 --> 00:01:45,470
Now, when you have this context-free grammar,

26
00:01:45,470 --> 00:01:48,095
you can use it to parse your data.

27
00:01:48,095 --> 00:01:52,095
So you can get to the sequence and say

28
00:01:52,095 --> 00:01:57,365
what are the non terminals that created this certain words.

29
00:01:57,365 --> 00:02:02,440
So what will be the advantages and disadvantages of this approach?

30
00:02:02,440 --> 00:02:06,970
Well, this approach is usually done manually.

31
00:02:06,970 --> 00:02:11,070
So you have to write all those rules just by

32
00:02:11,070 --> 00:02:15,470
yourself or some linguists should come and write it for you.

33
00:02:15,470 --> 00:02:18,750
So obviously, this is very time consuming.

34
00:02:18,750 --> 00:02:23,685
Also the record of this approach would be not very nice because well,

35
00:02:23,685 --> 00:02:24,880
you cannot write down

36
00:02:24,880 --> 00:02:27,360
all the possible cities because there

37
00:02:27,360 --> 00:02:30,120
are so many of them and the language is so very native.

38
00:02:30,120 --> 00:02:36,720
Right? So, the positive thing though would be the precision of this approach.

39
00:02:36,720 --> 00:02:41,940
Usually, rule-based approaches have high precision but low recall.

40
00:02:41,940 --> 00:02:47,035
Now, another approach would be to build some machine learning system.

41
00:02:47,035 --> 00:02:50,445
To do that, first of all you need some training data.

42
00:02:50,445 --> 00:02:53,670
So you need a corpus with some markup.

43
00:02:53,670 --> 00:02:57,550
So here, you have a sequence of words and you know that

44
00:02:57,550 --> 00:03:01,905
these certain phrases have these certain texts.

45
00:03:01,905 --> 00:03:06,030
Right? Like origin, destination, and date.

46
00:03:06,030 --> 00:03:08,535
After you have your training data,

47
00:03:08,535 --> 00:03:10,825
you need to do some feature engineering.

48
00:03:10,825 --> 00:03:15,815
So you need to create features like for example is the word capitalized?

49
00:03:15,815 --> 00:03:21,815
Or does this word occur in some list of Cities or something like that.

50
00:03:21,815 --> 00:03:24,610
Then, you need to define your model.

51
00:03:24,610 --> 00:03:28,550
So the probabilistic model would for example produce

52
00:03:28,550 --> 00:03:32,730
the probabilities of your text given your words.

53
00:03:32,730 --> 00:03:38,185
This can be different kinds of models and we will explore a lot of them in our course.

54
00:03:38,185 --> 00:03:41,120
But generally, these models would have

55
00:03:41,120 --> 00:03:46,485
some parameters and they will depend on some features that you have just generated.

56
00:03:46,485 --> 00:03:50,275
And the parameters of the model should be trained.

57
00:03:50,275 --> 00:03:56,115
So you will need to take your train data and fit your model to this data.

58
00:03:56,115 --> 00:04:01,665
So you will maximize the probability of what you see, by the parameters.

59
00:04:01,665 --> 00:04:04,300
This way, you will fix the parameters of

60
00:04:04,300 --> 00:04:08,460
the model and you will be able to apply this model to the test data.

61
00:04:08,460 --> 00:04:12,110
For the inference, you will apply it,

62
00:04:12,110 --> 00:04:18,905
and you will find the most probable text for your words with some fixed parameters.

63
00:04:18,905 --> 00:04:24,845
Right? So this is called inference or test or deployment or something like that.

64
00:04:24,845 --> 00:04:28,630
So, this is just the general framework.

65
00:04:28,630 --> 00:04:30,330
Right? So you have some perimeters,

66
00:04:30,330 --> 00:04:33,190
you train them, and then you apply your model.

67
00:04:33,190 --> 00:04:37,710
The similar thing happens for deep learning approach.

68
00:04:37,710 --> 00:04:40,720
There you also have this stages

69
00:04:40,720 --> 00:04:44,620
but usually you do not have the stage of feature generation.

70
00:04:44,620 --> 00:04:48,020
So what you're doing is that you just feed

71
00:04:48,020 --> 00:04:52,170
your sequence of words as is to some neural network.

72
00:04:52,170 --> 00:04:56,155
So, I now do not go into the details of the neural network.

73
00:04:56,155 --> 00:04:59,330
We will have time to go into those details.

74
00:04:59,330 --> 00:05:05,905
I just show you the idea that you feed your words just as one hot encoders.

75
00:05:05,905 --> 00:05:10,500
As the vectors that have only one non zero element that corresponds

76
00:05:10,500 --> 00:05:15,295
to the number of this word in the vocabulary and lots of zeros.

77
00:05:15,295 --> 00:05:17,200
So you feed this vectors to

78
00:05:17,200 --> 00:05:19,605
some complicated neural network that has

79
00:05:19,605 --> 00:05:22,755
some complicated architecture and lots of parameters.

80
00:05:22,755 --> 00:05:26,310
You feed this perimeters and then you apply this network to

81
00:05:26,310 --> 00:05:30,110
your test data to get the text out of this model.

82
00:05:30,110 --> 00:05:34,715
Deep learning methods perform a really nice for many tasks in NLP.

83
00:05:34,715 --> 00:05:39,665
So, sometimes it feels like we forget about traditional approaches,

84
00:05:39,665 --> 00:05:42,525
and there are some reasons not to forget about it.

85
00:05:42,525 --> 00:05:45,195
Well, the first reason would be that

86
00:05:45,195 --> 00:05:49,395
traditional methods perform really nice for some obligations.

87
00:05:49,395 --> 00:05:51,600
For example for sequence labeling,

88
00:05:51,600 --> 00:05:56,630
we can do probabilistic modeling and we will discuss it during the week two,

89
00:05:56,630 --> 00:05:59,460
and we'll get a really good performance.

90
00:05:59,460 --> 00:06:02,900
Another reason would be that some ideas in

91
00:06:02,900 --> 00:06:05,460
deep learning methods are really similar to

92
00:06:05,460 --> 00:06:08,715
something that was happening in the area before them.

93
00:06:08,715 --> 00:06:12,350
So, for example, word2vec method which is actually

94
00:06:12,350 --> 00:06:17,230
not even deep learning but it is inspired by some neural networks,

95
00:06:17,230 --> 00:06:22,775
has really similar ideas as some distributional semantic methods have.

96
00:06:22,775 --> 00:06:25,145
And in week three of our course,

97
00:06:25,145 --> 00:06:27,425
we will discuss both of them.

98
00:06:27,425 --> 00:06:33,150
Now, another reason would be that we can sometimes use the knowledge that we

99
00:06:33,150 --> 00:06:39,015
had in traditional approaches to improve the models based on deep learning.

100
00:06:39,015 --> 00:06:42,610
For example, word alignments in machine translation

101
00:06:42,610 --> 00:06:46,795
and attention the Haney's in neural networks are very similar,

102
00:06:46,795 --> 00:06:49,300
and we will see during the week four.

103
00:06:49,300 --> 00:06:52,850
Deep learning methods are indeed fancy and we have lots of

104
00:06:52,850 --> 00:06:57,575
research publications about them in our current conferences.

105
00:06:57,575 --> 00:07:02,255
So, it looks like this is where the area will go in the future.

106
00:07:02,255 --> 00:07:07,630
So, obviously, we need to have them in our course as well. So what do we do?

107
00:07:07,630 --> 00:07:11,350
Well, I think that we will have both of them in parallel.

108
00:07:11,350 --> 00:07:12,995
So, for every task,

109
00:07:12,995 --> 00:07:17,755
we'll have traditional and deep learning approaches studied one by one.

110
00:07:17,755 --> 00:07:20,475
And this is all for this video.

111
00:07:20,475 --> 00:07:21,965
And in the next video we,

112
00:07:21,965 --> 00:07:26,010
will see what is the plan for our next week's.