1 00:00:02,380 --> 00:00:07,095 I would say that there are three main groups of methods in NLP. 2 00:00:07,095 --> 00:00:10,415 One group would be about rule-based approaches. 3 00:00:10,415 --> 00:00:14,230 So, for example regular expressions would go to this group. 4 00:00:14,230 --> 00:00:17,275 Another one would be traditional machine learning. 5 00:00:17,275 --> 00:00:20,085 And the last one would be deep learning that has 6 00:00:20,085 --> 00:00:23,860 recently gained lots of popularity in NLP. 7 00:00:23,860 --> 00:00:26,870 In this video, I want to go through 8 00:00:26,870 --> 00:00:31,430 all three approaches just by example of one particular tasks, 9 00:00:31,430 --> 00:00:35,135 so that you get some flavor of all of them. 10 00:00:35,135 --> 00:00:38,760 The task could be called semantic slot filling. 11 00:00:38,760 --> 00:00:43,420 So, you can see the query in the bottom of the slide which 12 00:00:43,420 --> 00:00:49,045 says Show me flights from Boston to San Francisco on Tuesday. 13 00:00:49,045 --> 00:00:51,595 So you have some sequence of words, 14 00:00:51,595 --> 00:00:54,540 and you want to find some slots. 15 00:00:54,540 --> 00:01:01,310 So the slots would be destinations or departure or some date and something like that. 16 00:01:01,310 --> 00:01:05,630 And to fill those slots you can use different approaches. 17 00:01:05,630 --> 00:01:08,630 This slide is about context-free grammars. 18 00:01:08,630 --> 00:01:10,985 So, it is a rule-based approach. 19 00:01:10,985 --> 00:01:17,940 The context-free grammars show you what would be the rules to produce some words. 20 00:01:17,940 --> 00:01:25,065 For example, you can see that non terminal words show can produce words, 21 00:01:25,065 --> 00:01:29,410 show me, or can I see, or something like that. 22 00:01:29,410 --> 00:01:33,210 And some other words for example origin, 23 00:01:33,210 --> 00:01:36,150 non terminal can produce from city, 24 00:01:36,150 --> 00:01:41,860 and city non terminal can then produce some specific cities from a list. 25 00:01:41,860 --> 00:01:45,470 Now, when you have this context-free grammar, 26 00:01:45,470 --> 00:01:48,095 you can use it to parse your data. 27 00:01:48,095 --> 00:01:52,095 So you can get to the sequence and say 28 00:01:52,095 --> 00:01:57,365 what are the non terminals that created this certain words. 29 00:01:57,365 --> 00:02:02,440 So what will be the advantages and disadvantages of this approach? 30 00:02:02,440 --> 00:02:06,970 Well, this approach is usually done manually. 31 00:02:06,970 --> 00:02:11,070 So you have to write all those rules just by 32 00:02:11,070 --> 00:02:15,470 yourself or some linguists should come and write it for you. 33 00:02:15,470 --> 00:02:18,750 So obviously, this is very time consuming. 34 00:02:18,750 --> 00:02:23,685 Also the record of this approach would be not very nice because well, 35 00:02:23,685 --> 00:02:24,880 you cannot write down 36 00:02:24,880 --> 00:02:27,360 all the possible cities because there 37 00:02:27,360 --> 00:02:30,120 are so many of them and the language is so very native. 38 00:02:30,120 --> 00:02:36,720 Right? So, the positive thing though would be the precision of this approach. 39 00:02:36,720 --> 00:02:41,940 Usually, rule-based approaches have high precision but low recall. 40 00:02:41,940 --> 00:02:47,035 Now, another approach would be to build some machine learning system. 41 00:02:47,035 --> 00:02:50,445 To do that, first of all you need some training data. 42 00:02:50,445 --> 00:02:53,670 So you need a corpus with some markup. 43 00:02:53,670 --> 00:02:57,550 So here, you have a sequence of words and you know that 44 00:02:57,550 --> 00:03:01,905 these certain phrases have these certain texts. 45 00:03:01,905 --> 00:03:06,030 Right? Like origin, destination, and date. 46 00:03:06,030 --> 00:03:08,535 After you have your training data, 47 00:03:08,535 --> 00:03:10,825 you need to do some feature engineering. 48 00:03:10,825 --> 00:03:15,815 So you need to create features like for example is the word capitalized? 49 00:03:15,815 --> 00:03:21,815 Or does this word occur in some list of Cities or something like that. 50 00:03:21,815 --> 00:03:24,610 Then, you need to define your model. 51 00:03:24,610 --> 00:03:28,550 So the probabilistic model would for example produce 52 00:03:28,550 --> 00:03:32,730 the probabilities of your text given your words. 53 00:03:32,730 --> 00:03:38,185 This can be different kinds of models and we will explore a lot of them in our course. 54 00:03:38,185 --> 00:03:41,120 But generally, these models would have 55 00:03:41,120 --> 00:03:46,485 some parameters and they will depend on some features that you have just generated. 56 00:03:46,485 --> 00:03:50,275 And the parameters of the model should be trained. 57 00:03:50,275 --> 00:03:56,115 So you will need to take your train data and fit your model to this data. 58 00:03:56,115 --> 00:04:01,665 So you will maximize the probability of what you see, by the parameters. 59 00:04:01,665 --> 00:04:04,300 This way, you will fix the parameters of 60 00:04:04,300 --> 00:04:08,460 the model and you will be able to apply this model to the test data. 61 00:04:08,460 --> 00:04:12,110 For the inference, you will apply it, 62 00:04:12,110 --> 00:04:18,905 and you will find the most probable text for your words with some fixed parameters. 63 00:04:18,905 --> 00:04:24,845 Right? So this is called inference or test or deployment or something like that. 64 00:04:24,845 --> 00:04:28,630 So, this is just the general framework. 65 00:04:28,630 --> 00:04:30,330 Right? So you have some perimeters, 66 00:04:30,330 --> 00:04:33,190 you train them, and then you apply your model. 67 00:04:33,190 --> 00:04:37,710 The similar thing happens for deep learning approach. 68 00:04:37,710 --> 00:04:40,720 There you also have this stages 69 00:04:40,720 --> 00:04:44,620 but usually you do not have the stage of feature generation. 70 00:04:44,620 --> 00:04:48,020 So what you're doing is that you just feed 71 00:04:48,020 --> 00:04:52,170 your sequence of words as is to some neural network. 72 00:04:52,170 --> 00:04:56,155 So, I now do not go into the details of the neural network. 73 00:04:56,155 --> 00:04:59,330 We will have time to go into those details. 74 00:04:59,330 --> 00:05:05,905 I just show you the idea that you feed your words just as one hot encoders. 75 00:05:05,905 --> 00:05:10,500 As the vectors that have only one non zero element that corresponds 76 00:05:10,500 --> 00:05:15,295 to the number of this word in the vocabulary and lots of zeros. 77 00:05:15,295 --> 00:05:17,200 So you feed this vectors to 78 00:05:17,200 --> 00:05:19,605 some complicated neural network that has 79 00:05:19,605 --> 00:05:22,755 some complicated architecture and lots of parameters. 80 00:05:22,755 --> 00:05:26,310 You feed this perimeters and then you apply this network to 81 00:05:26,310 --> 00:05:30,110 your test data to get the text out of this model. 82 00:05:30,110 --> 00:05:34,715 Deep learning methods perform a really nice for many tasks in NLP. 83 00:05:34,715 --> 00:05:39,665 So, sometimes it feels like we forget about traditional approaches, 84 00:05:39,665 --> 00:05:42,525 and there are some reasons not to forget about it. 85 00:05:42,525 --> 00:05:45,195 Well, the first reason would be that 86 00:05:45,195 --> 00:05:49,395 traditional methods perform really nice for some obligations. 87 00:05:49,395 --> 00:05:51,600 For example for sequence labeling, 88 00:05:51,600 --> 00:05:56,630 we can do probabilistic modeling and we will discuss it during the week two, 89 00:05:56,630 --> 00:05:59,460 and we'll get a really good performance. 90 00:05:59,460 --> 00:06:02,900 Another reason would be that some ideas in 91 00:06:02,900 --> 00:06:05,460 deep learning methods are really similar to 92 00:06:05,460 --> 00:06:08,715 something that was happening in the area before them. 93 00:06:08,715 --> 00:06:12,350 So, for example, word2vec method which is actually 94 00:06:12,350 --> 00:06:17,230 not even deep learning but it is inspired by some neural networks, 95 00:06:17,230 --> 00:06:22,775 has really similar ideas as some distributional semantic methods have. 96 00:06:22,775 --> 00:06:25,145 And in week three of our course, 97 00:06:25,145 --> 00:06:27,425 we will discuss both of them. 98 00:06:27,425 --> 00:06:33,150 Now, another reason would be that we can sometimes use the knowledge that we 99 00:06:33,150 --> 00:06:39,015 had in traditional approaches to improve the models based on deep learning. 100 00:06:39,015 --> 00:06:42,610 For example, word alignments in machine translation 101 00:06:42,610 --> 00:06:46,795 and attention the Haney's in neural networks are very similar, 102 00:06:46,795 --> 00:06:49,300 and we will see during the week four. 103 00:06:49,300 --> 00:06:52,850 Deep learning methods are indeed fancy and we have lots of 104 00:06:52,850 --> 00:06:57,575 research publications about them in our current conferences. 105 00:06:57,575 --> 00:07:02,255 So, it looks like this is where the area will go in the future. 106 00:07:02,255 --> 00:07:07,630 So, obviously, we need to have them in our course as well. So what do we do? 107 00:07:07,630 --> 00:07:11,350 Well, I think that we will have both of them in parallel. 108 00:07:11,350 --> 00:07:12,995 So, for every task, 109 00:07:12,995 --> 00:07:17,755 we'll have traditional and deep learning approaches studied one by one. 110 00:07:17,755 --> 00:07:20,475 And this is all for this video. 111 00:07:20,475 --> 00:07:21,965 And in the next video we, 112 00:07:21,965 --> 00:07:26,010 will see what is the plan for our next week's.