1 00:00:00,000 --> 00:00:04,426 [MUSIC] 2 00:00:04,426 --> 00:00:11,159 Build a sentiment classifier. 3 00:00:11,159 --> 00:00:15,390 And I'm gonna put a # here to make it a nice title. 4 00:00:16,470 --> 00:00:19,110 So this is our next task here. 5 00:00:19,110 --> 00:00:22,810 And so, when you build a sentiment classifier, 6 00:00:22,810 --> 00:00:26,610 you're talking about positives and negatives. 7 00:00:26,610 --> 00:00:28,110 So, thumbs down, and thumbs up. 8 00:00:28,110 --> 00:00:32,840 But if you remember, our product ratings, we're not about positive and 9 00:00:32,840 --> 00:00:35,250 negative, they're numerical things. 10 00:00:35,250 --> 00:00:39,780 So, for example, if I take all the products, and 11 00:00:39,780 --> 00:00:45,350 I'll take the rating column and I do a .show, and we did above just for 12 00:00:45,350 --> 00:00:51,664 the giraffe, but do a .show for everything with the view equal 13 00:00:51,664 --> 00:00:59,150 to Categorical, we're now getting a histogram for 14 00:00:59,150 --> 00:01:04,020 all of the views, and if you take a quick look at it, you'll see that most reviews 15 00:01:04,020 --> 00:01:09,290 are positive across the board 107,000 reviews are five stars. 16 00:01:09,290 --> 00:01:14,970 So most people review positively, and just write reviews about products they like. 17 00:01:14,970 --> 00:01:18,310 They don't typically write reviews about products they don't like. 18 00:01:18,310 --> 00:01:21,755 Then the next set of reviews, 33,000 four stars. 19 00:01:21,755 --> 00:01:26,320 Then 3 stars and again, a lot of people write really bad reviews 1 star, 2 star, 20 00:01:26,320 --> 00:01:28,815 why would you give a review product 2 stars? 21 00:01:28,815 --> 00:01:31,405 You might as well just give them 1 star if you really hated it. 22 00:01:31,405 --> 00:01:34,515 And this is what we observe in the histogram. 23 00:01:34,515 --> 00:01:38,365 But again, for sentiment analysis, we have to define what's thumbs up and 24 00:01:38,365 --> 00:01:39,605 what's thumbs down. 25 00:01:39,605 --> 00:01:42,145 And so I'm gonna make an arbitrary choice here. 26 00:01:42,145 --> 00:01:47,020 Let's say that things that 4, 5 stars are things that people liked. 27 00:01:47,020 --> 00:01:48,510 So those are positives. 28 00:01:48,510 --> 00:01:51,990 Things that 1 and 2 stars are negative. 29 00:01:51,990 --> 00:01:54,840 But the things that are 3 stars, those are kind of in the middle. 30 00:01:54,840 --> 00:01:56,028 So, let's just throw those out. 31 00:01:56,028 --> 00:02:01,665 So we're gonna do a little bit of what we'll call data engineering, 32 00:02:01,665 --> 00:02:06,150 just defining what is a positive and negative sentiment. 33 00:02:06,150 --> 00:02:08,050 So let's do that right now. 34 00:02:08,050 --> 00:02:13,678 So in the subsection we're gonna define 35 00:02:13,678 --> 00:02:20,147 what's a positive and a negative sentiment. 36 00:02:20,147 --> 00:02:25,126 And what I'm gonna 37 00:02:25,126 --> 00:02:30,726 do first is ignore all 38 00:02:30,726 --> 00:02:36,030 city star reviews. 39 00:02:36,030 --> 00:02:37,840 The way to do it is by saying, 40 00:02:37,840 --> 00:02:41,930 okay I'm gonna take the product stable the variable products. 41 00:02:41,930 --> 00:02:47,169 And I'm gonna just select everything out 42 00:02:47,169 --> 00:02:53,016 of the products table whose rating was not 3. 43 00:02:53,016 --> 00:02:58,800 So, products[products['rating'] |= 3]. 44 00:02:59,850 --> 00:03:04,340 So that was our first step in our little data engineering task. 45 00:03:04,340 --> 00:03:07,520 And now, the next step in this task 46 00:03:07,520 --> 00:03:10,430 is to actually find what stems up and stems down. 47 00:03:10,430 --> 00:03:15,661 So I'm gonna say a positive 48 00:03:15,661 --> 00:03:20,893 sentiment equals 4 star or 49 00:03:20,893 --> 00:03:24,310 5 star reviews. 50 00:03:25,870 --> 00:03:27,654 So let's go ahead and 51 00:03:27,654 --> 00:03:33,018 add a new column to our table that defines the actual sentiment. 52 00:03:33,018 --> 00:03:37,871 So products new column called sentiment. 53 00:03:37,871 --> 00:03:41,280 It's gonna be a binary column, zero one. 54 00:03:41,280 --> 00:03:43,680 And, the way we're gonna define the column, is we're gonna say, 55 00:03:43,680 --> 00:03:50,230 of the products rating, is it greater equal to 4? 56 00:03:50,230 --> 00:03:53,430 If it's greater or equal to 4, it's gonna get a 1, and 57 00:03:53,430 --> 00:03:59,080 if it's less than 4, it's gonna get a 0, and 58 00:03:59,080 --> 00:04:04,670 so now if I take another look at my product stable, so head 59 00:04:04,670 --> 00:04:10,500 You should see that I've added a new column on the right called Sentiment. 60 00:04:10,500 --> 00:04:14,840 And most of the sentiments are positive, as we saw earlier, but 61 00:04:14,840 --> 00:04:17,020 many are negative, zero. 62 00:04:19,090 --> 00:04:23,574 Okay, now we're ready, we're finally 63 00:04:23,574 --> 00:04:28,194 ready to train our sentiment classifier. 64 00:04:28,194 --> 00:04:32,279 [MUSIC]