1 00:00:00,000 --> 00:00:04,143 [MUSIC] 2 00:00:04,143 --> 00:00:09,360 So we've seen the logistic regression model and explored it quite a bit. 3 00:00:09,360 --> 00:00:12,570 And we hinted at what learning means, finding the best parameters for 4 00:00:12,570 --> 00:00:14,090 those models. 5 00:00:14,090 --> 00:00:18,190 However, we have talked about features in a kind of abstract way. 6 00:00:18,190 --> 00:00:22,400 We said we have number of awesomes, number of awfuls, and so on. 7 00:00:22,400 --> 00:00:26,310 But we have to think a little bit harder when our inputs are called 8 00:00:26,310 --> 00:00:28,500 categorical variables. 9 00:00:28,500 --> 00:00:30,290 So let's take a little example. 10 00:00:30,290 --> 00:00:35,600 If our inputs x were numeric values like the number of awesomes, somebody's age, 11 00:00:35,600 --> 00:00:41,106 somebody's salary, it's kind of natural to multiply that by particular coefficients. 12 00:00:41,106 --> 00:00:45,180 So 1.5 times the number of awesomes makes sense, or 13 00:00:45,180 --> 00:00:50,520 17 times your salary kind of makes sense as a numeric value in that score function. 14 00:00:51,630 --> 00:00:55,100 However, if you use categorical inputs like male, female, 15 00:00:56,110 --> 00:01:01,268 the country of birth, or the postal code, which in the U.S. is called a zipcode. 16 00:01:01,268 --> 00:01:01,964 In the U.S. 17 00:01:01,964 --> 00:01:06,027 the postal code or zipcode is defined by three, so by five numeric numbers. 18 00:01:06,027 --> 00:01:12,335 So for example, 10005 or 98195. 19 00:01:12,335 --> 00:01:16,817 This is numeric numbers that you can manage multiplying by coefficient, 20 00:01:16,817 --> 00:01:20,431 however, they don't really behave like numeric values, 21 00:01:20,431 --> 00:01:23,119 they behave more like categorical values. 22 00:01:23,119 --> 00:01:30,153 So for example, 98195 is not nine times bigger than 10005. 23 00:01:30,153 --> 00:01:32,830 It's just the different part of the country. 24 00:01:32,830 --> 00:01:38,560 So even numbers, if they don't behave like a continuous scale but 25 00:01:38,560 --> 00:01:41,390 behave more like an indicator of location like in this example, 26 00:01:41,390 --> 00:01:45,340 the indicator of category, then we still have to encode them 27 00:01:45,340 --> 00:01:48,780 in interesting ways if we're going to multiply them by some coefficient. 28 00:01:48,780 --> 00:01:53,750 So the question is, how do we multiply a coefficient like 1.5 29 00:01:53,750 --> 00:01:58,100 minus 2.7 with this category called variables. 30 00:01:58,100 --> 00:02:01,400 And to do this we need to use what's called an encoding. 31 00:02:01,400 --> 00:02:04,240 An encoding takes an input which is categorical, for 32 00:02:04,240 --> 00:02:08,640 example country of birth and tries to encode it using some kind of 33 00:02:08,640 --> 00:02:11,865 numerical values that are naturally multiplied by some coefficients. 34 00:02:11,865 --> 00:02:14,630 So for example, country of birth, 35 00:02:14,630 --> 00:02:20,015 there might be 196 possible countries or categories that that value comes from. 36 00:02:20,015 --> 00:02:25,565 And so one way to encode this is using what's called 1-hot encoding, 37 00:02:25,565 --> 00:02:28,755 where you create one feature for every possible country. 38 00:02:28,755 --> 00:02:34,985 So for example there might be a feature for Argentina, a feature for Brazil, and 39 00:02:36,845 --> 00:02:42,380 so on, all the way to a feature for Zimbabwe. 40 00:02:43,860 --> 00:02:48,960 And so if somebody's born in Brazil then the feature for Argentina has value 0, 41 00:02:48,960 --> 00:02:53,500 the feature for Brazil has value 1, and all the other features have value 0. 42 00:02:53,500 --> 00:02:57,610 So only one of these features has value 1 at the time, everything else is 0, 43 00:02:57,610 --> 00:02:59,300 that's why it's called 1-hot. 44 00:02:59,300 --> 00:03:04,980 It's from electrical engineering, that means one on or one active encoding. 45 00:03:04,980 --> 00:03:09,930 Similarly if somebody's born in Zimbabwe, we're going to get 0, 0, 0, 0, and 46 00:03:09,930 --> 00:03:15,680 just 1 in the feature h196 which corresponds to Zimbabwe birth. 47 00:03:15,680 --> 00:03:17,600 So that's one kind of encoding. 48 00:03:17,600 --> 00:03:20,061 And implicitly in this module, 49 00:03:20,061 --> 00:03:25,450 we've actually explored a different kind of encoding for text data. 50 00:03:25,450 --> 00:03:29,583 And we discussed that in the first course, what's called the Bag of Words encoding. 51 00:03:29,583 --> 00:03:33,473 So a review is defined by text, and text can have say 10,000 52 00:03:33,473 --> 00:03:38,320 different words that come from it, or more, many more, millions. 53 00:03:38,320 --> 00:03:43,960 And so what Bag of Words does is take that text, and then codes its as counts. 54 00:03:43,960 --> 00:03:48,850 So, for example, I might associate 55 00:03:48,850 --> 00:03:53,267 h1 with the number of awesomes, 56 00:03:53,267 --> 00:03:57,223 h2 with the number of awful. 57 00:03:57,223 --> 00:04:04,862 And so on all the way to say h10,000 which might be the number of sushis. 58 00:04:04,862 --> 00:04:06,741 So the number of times the word sushi appears. 59 00:04:06,741 --> 00:04:12,643 And a particular data point might have 2 awesomes, 0 awfuls, 60 00:04:12,643 --> 00:04:17,464 0 bunch of different things, and maybe 3 sushis. 61 00:04:17,464 --> 00:04:21,980 And so it becomes a really, really sparse 10,000 additional vectors. 62 00:04:21,980 --> 00:04:26,728 In both of these cases, we've taken a categorical input, and 63 00:04:26,728 --> 00:04:31,298 defined a set of features, one for each possible category, 64 00:04:31,298 --> 00:04:35,260 to contain either a single value on or account. 65 00:04:35,260 --> 00:04:38,455 And we can feed this directly into the logistic regression model that 66 00:04:38,455 --> 00:04:40,070 we discussed so far. 67 00:04:40,070 --> 00:04:42,962 This type of encoding is really fundamental in practice, and 68 00:04:42,962 --> 00:04:45,420 you should really familiarize yourself with them. 69 00:04:45,420 --> 00:04:50,249 [MUSIC]