1 00:00:00,000 --> 00:00:04,530 Sentiment classification is the task of looking at a piece of text 2 00:00:04,530 --> 00:00:07,170 and telling if someone likes or dislikes 3 00:00:07,170 --> 00:00:09,125 the thing they're talking about. 4 00:00:09,125 --> 00:00:14,135 It is one of the most important building blocks in NLP and is used in many applications. 5 00:00:14,135 --> 00:00:15,980 One of the challenges of sentiment classification 6 00:00:15,980 --> 00:00:19,223 is you might not have a huge label training set for it. 7 00:00:19,223 --> 00:00:20,340 But with word embeddings, 8 00:00:20,340 --> 00:00:22,972 you're able to build good sentiment classifiers 9 00:00:22,972 --> 00:00:25,870 even with only modest-size label training sets. 10 00:00:25,870 --> 00:00:27,255 Let's see how you can do that. 11 00:00:27,255 --> 00:00:31,720 So here's an example of a sentiment classification problem. 12 00:00:31,720 --> 00:00:34,860 The input X is a piece of text and the output Y 13 00:00:34,860 --> 00:00:38,864 that you want to predict is what is the sentiment, 14 00:00:38,864 --> 00:00:40,983 such as the star rating of, 15 00:00:40,983 --> 00:00:43,025 let's say, a restaurant review. 16 00:00:43,025 --> 00:00:47,025 So if someone says, "The dessert is excellent" and they give it a four-star review, 17 00:00:47,025 --> 00:00:49,090 "Service was quite slow" two-star review, 18 00:00:49,090 --> 00:00:51,178 "Good for a quick meal but nothing special" three-star review. 19 00:00:51,178 --> 00:00:53,120 And this is a pretty harsh review, 20 00:00:53,120 --> 00:00:54,090 "Completely lacking in good taste, good service, and good ambiance." 21 00:00:54,090 --> 00:00:56,090 That's a one-star review. 22 00:00:56,090 --> 00:01:02,835 So if you can train a system to map from X or Y based on a label data set like this, 23 00:01:02,835 --> 00:01:05,880 then you could use it to monitor comments that 24 00:01:05,880 --> 00:01:08,925 people are saying about maybe a restaurant that you run. 25 00:01:08,925 --> 00:01:13,335 So people might also post messages about your restaurant on social media, 26 00:01:13,335 --> 00:01:14,895 on Twitter, or Facebook, 27 00:01:14,895 --> 00:01:17,510 or Instagram, or other forms of social media. 28 00:01:17,510 --> 00:01:20,605 And if you have a sentiment classifier, 29 00:01:20,605 --> 00:01:23,730 they can look just a piece of text and figure out how 30 00:01:23,730 --> 00:01:27,540 positive or negative is the sentiment of the poster toward your restaurant. 31 00:01:27,540 --> 00:01:30,750 Then you can also be able to keep track of whether or not there are 32 00:01:30,750 --> 00:01:35,475 any problems or if your restaurant is getting better or worse over time. 33 00:01:35,475 --> 00:01:38,310 So one of the challenges of 34 00:01:38,310 --> 00:01:43,250 sentiment classification is you might not have a huge label data set. 35 00:01:43,250 --> 00:01:45,400 So for sentimental classification task, 36 00:01:45,400 --> 00:01:47,910 training sets with maybe anywhere from 37 00:01:47,910 --> 00:01:52,410 10,000 to maybe 100,000 words would not be uncommon. 38 00:01:52,410 --> 00:01:57,285 Sometimes even smaller than 10,000 words and word embeddings that you can 39 00:01:57,285 --> 00:02:00,020 take can help you to much 40 00:02:00,020 --> 00:02:03,680 better understand especially when you have a small training set. 41 00:02:03,680 --> 00:02:04,980 So here's what you can do. 42 00:02:04,980 --> 00:02:08,120 We'll go for a couple different algorithms in this video. 43 00:02:08,120 --> 00:02:11,150 Here's a simple sentiment classification model. 44 00:02:11,150 --> 00:02:13,490 You can take a sentence like "dessert is excellent" and 45 00:02:13,490 --> 00:02:16,070 look up those words in your dictionary. 46 00:02:16,070 --> 00:02:18,465 We use a 10,000-word dictionary as usual. 47 00:02:18,465 --> 00:02:24,140 And let's build a classifier to map it to the output Y that this was four stars. 48 00:02:24,140 --> 00:02:27,080 So given these four words, as usual, 49 00:02:27,080 --> 00:02:33,620 we can take these four words and look up the one-hot vector. 50 00:02:33,620 --> 00:02:38,882 So there's 0 8 9 2 8 which is a one-hot vector multiplied by the embedding matrix E, 51 00:02:38,882 --> 00:02:41,920 which can learn from a much larger text corpus. 52 00:02:41,920 --> 00:02:43,550 It can learn in embedding from, say, 53 00:02:43,550 --> 00:02:45,440 a billion words or a hundred billion words, 54 00:02:45,440 --> 00:02:50,745 and use that to extract out the embedding vector for the word "the", 55 00:02:50,745 --> 00:02:52,965 and then do the same for "dessert", 56 00:02:52,965 --> 00:02:57,545 do the same for "is" and do the same for "excellent". 57 00:02:57,545 --> 00:03:02,530 And if this was trained on a very large data set, 58 00:03:02,530 --> 00:03:04,218 like a hundred billion words, 59 00:03:04,218 --> 00:03:06,980 then this allows you to take a lot of knowledge even from 60 00:03:06,980 --> 00:03:10,250 infrequent words and apply them to your problem, 61 00:03:10,250 --> 00:03:15,455 even words that weren't in your labeled training set. 62 00:03:15,455 --> 00:03:17,709 Now here's one way to build a classifier, 63 00:03:17,709 --> 00:03:19,570 which is that you can take these vectors, 64 00:03:19,570 --> 00:03:22,505 let's say these are 300-dimensional vectors, 65 00:03:22,505 --> 00:03:26,220 and you could then just sum or average them. 66 00:03:26,220 --> 00:03:34,380 And I'm just going to put a bigger average operator here and you could use sum or average. 67 00:03:34,380 --> 00:03:38,430 And this gives you a 300-dimensional feature vector 68 00:03:38,430 --> 00:03:44,260 that you then pass to a soft-max classifier which then outputs Y-hat. 69 00:03:44,260 --> 00:03:47,250 And so the softmax can output what are 70 00:03:47,250 --> 00:03:50,935 the probabilities of the five possible outcomes from one-star up to five-star. 71 00:03:50,935 --> 00:03:57,285 So this will be assortment of the five possible outcomes to predict what is Y. 72 00:03:57,285 --> 00:04:01,070 So notice that by using the average operation here, 73 00:04:01,070 --> 00:04:04,680 this particular algorithm works for reviews that are 74 00:04:04,680 --> 00:04:08,550 short or long because even if a review that is 100 words long, 75 00:04:08,550 --> 00:04:11,390 you can just sum or average all the feature vectors for all hundred words 76 00:04:11,390 --> 00:04:15,645 and so that gives you a representation, 77 00:04:15,645 --> 00:04:18,095 a 300-dimensional feature representation, 78 00:04:18,095 --> 00:04:21,745 that you can then pass into your sentiment classifier. 79 00:04:21,745 --> 00:04:23,890 So this average will work decently well. 80 00:04:23,890 --> 00:04:27,030 And what it does is it really averages the meanings of 81 00:04:27,030 --> 00:04:33,045 all the words or sums the meaning of all the words in your example. 82 00:04:33,045 --> 00:04:35,180 And this will work to [inaudible]. 83 00:04:35,180 --> 00:04:38,575 So one of the problems with this algorithm is it ignores word order. 84 00:04:38,575 --> 00:04:41,935 In particular, this is a very negative review, 85 00:04:41,935 --> 00:04:43,080 "Completely lacking in good taste, 86 00:04:43,080 --> 00:04:44,378 good service, and good ambiance". 87 00:04:44,378 --> 00:04:46,725 But the word good appears a lot. 88 00:04:46,725 --> 00:04:47,440 This is a lot. 89 00:04:47,440 --> 00:04:48,420 Good, good, good. 90 00:04:48,420 --> 00:04:52,557 So if you use an algorithm like this that ignores word order 91 00:04:52,557 --> 00:04:56,870 and just sums or averages all of the embeddings for the different words, 92 00:04:56,870 --> 00:05:01,445 then you end up having a lot of the representation of good in 93 00:05:01,445 --> 00:05:04,231 your final feature vector and your classifier will probably 94 00:05:04,231 --> 00:05:07,335 think this is a good review even though this is actually very harsh. 95 00:05:07,335 --> 00:05:08,460 This is a one-star review. 96 00:05:08,460 --> 00:05:11,625 So here's a more sophisticated model which is that, 97 00:05:11,625 --> 00:05:14,880 instead of just summing all of your word embeddings, 98 00:05:14,880 --> 00:05:20,550 you can instead use a RNN for sentiment classification. 99 00:05:20,550 --> 00:05:23,220 So here's what you can do. You can take that review, 100 00:05:23,220 --> 00:05:24,645 "Completely lacking in good taste, 101 00:05:24,645 --> 00:05:26,231 good service, and good ambiance", 102 00:05:26,231 --> 00:05:29,840 and find for each of them, the one-hot vector. 103 00:05:29,840 --> 00:05:31,575 And so I'm going to just skip 104 00:05:31,575 --> 00:05:34,999 the one-hot vector representation but take the one-hot vectors, 105 00:05:34,999 --> 00:05:38,805 multiply it by the embedding matrix E as usual, 106 00:05:38,805 --> 00:05:48,005 then this gives you the embedding vectors and then you can feed these into an RNN. 107 00:05:48,005 --> 00:05:52,575 And the job of the RNN is to then compute 108 00:05:52,575 --> 00:05:57,550 the representation at the last time step that allows you to predict Y-hat. 109 00:05:57,550 --> 00:05:59,620 So this is an example of 110 00:05:59,620 --> 00:06:07,098 a many-to-one RNN architecture which we saw in the previous week. 111 00:06:07,098 --> 00:06:08,725 And with an algorithm like this, 112 00:06:08,725 --> 00:06:14,260 it will be much better at taking word sequence into account and realize that "things are 113 00:06:14,260 --> 00:06:16,450 lacking in good taste" is a negative review 114 00:06:16,450 --> 00:06:20,681 and "not good" a negative review unlike the previous algorithm, 115 00:06:20,681 --> 00:06:24,140 which just sums everything together into a big-word vector 116 00:06:24,140 --> 00:06:29,345 mush and doesn't realize that "not good" has a very different meaning 117 00:06:29,345 --> 00:06:34,130 than the words "good" or "lacking in good taste" and so on. 118 00:06:34,130 --> 00:06:35,920 And so if you train this algorithm, 119 00:06:35,920 --> 00:06:39,548 you end up with a pretty decent sentiment classification algorithm and 120 00:06:39,548 --> 00:06:45,150 because your word embeddings can be trained from a much larger data set, 121 00:06:45,150 --> 00:06:46,720 this will do a better job 122 00:06:46,720 --> 00:06:49,630 generalizing to maybe even new words now that you'll see in your training set, 123 00:06:49,630 --> 00:06:51,580 such as if someone else says, 124 00:06:51,580 --> 00:06:55,540 "Completely absent of good taste, 125 00:06:55,540 --> 00:06:57,220 good service, and good ambiance" or something, 126 00:06:57,220 --> 00:07:02,730 then even if the word "absent" is not in your label training set, 127 00:07:02,730 --> 00:07:08,373 if it was in your 1 billion or 100 billion word corpus used to train the word embeddings, 128 00:07:08,373 --> 00:07:14,130 it might still get this right and generalize much better even to words that were in 129 00:07:14,130 --> 00:07:16,900 the training set used to train the word embeddings but not 130 00:07:16,900 --> 00:07:19,465 necessarily in the label training set 131 00:07:19,465 --> 00:07:21,303 that you had for specifically the sentiment classification problem. 132 00:07:21,303 --> 00:07:25,630 So that's it for sentiment classification, 133 00:07:25,630 --> 00:07:27,400 and I hope this gives you a sense of how 134 00:07:27,400 --> 00:07:31,670 once you've learned or downloaded from online a word embedding, 135 00:07:31,670 --> 00:07:37,000 this allows you to quite quickly build pretty effective NLP systems.