1
00:00:00,000 --> 00:00:04,143
[MUSIC]

2
00:00:04,143 --> 00:00:09,360
So we've seen the logistic regression
model and explored it quite a bit.

3
00:00:09,360 --> 00:00:12,570
And we hinted at what learning means,
finding the best parameters for

4
00:00:12,570 --> 00:00:14,090
those models.

5
00:00:14,090 --> 00:00:18,190
However, we have talked about
features in a kind of abstract way.

6
00:00:18,190 --> 00:00:22,400
We said we have number of awesomes,
number of awfuls, and so on.

7
00:00:22,400 --> 00:00:26,310
But we have to think a little bit
harder when our inputs are called

8
00:00:26,310 --> 00:00:28,500
categorical variables.

9
00:00:28,500 --> 00:00:30,290
So let's take a little example.

10
00:00:30,290 --> 00:00:35,600
If our inputs x were numeric values like
the number of awesomes, somebody's age,

11
00:00:35,600 --> 00:00:41,106
somebody's salary, it's kind of natural to
multiply that by particular coefficients.

12
00:00:41,106 --> 00:00:45,180
So 1.5 times the number of
awesomes makes sense, or

13
00:00:45,180 --> 00:00:50,520
17 times your salary kind of makes sense
as a numeric value in that score function.

14
00:00:51,630 --> 00:00:55,100
However, if you use categorical
inputs like male, female,

15
00:00:56,110 --> 00:01:01,268
the country of birth, or the postal code,
which in the U.S. is called a zipcode.

16
00:01:01,268 --> 00:01:01,964
In the U.S.

17
00:01:01,964 --> 00:01:06,027
the postal code or zipcode is defined
by three, so by five numeric numbers.

18
00:01:06,027 --> 00:01:12,335
So for example, 10005 or 98195.

19
00:01:12,335 --> 00:01:16,817
This is numeric numbers that you can
manage multiplying by coefficient,

20
00:01:16,817 --> 00:01:20,431
however, they don't really
behave like numeric values,

21
00:01:20,431 --> 00:01:23,119
they behave more like categorical values.

22
00:01:23,119 --> 00:01:30,153
So for example,
98195 is not nine times bigger than 10005.

23
00:01:30,153 --> 00:01:32,830
It's just the different
part of the country.

24
00:01:32,830 --> 00:01:38,560
So even numbers, if they don't
behave like a continuous scale but

25
00:01:38,560 --> 00:01:41,390
behave more like an indicator of
location like in this example,

26
00:01:41,390 --> 00:01:45,340
the indicator of category,
then we still have to encode them

27
00:01:45,340 --> 00:01:48,780
in interesting ways if we're going to
multiply them by some coefficient.

28
00:01:48,780 --> 00:01:53,750
So the question is,
how do we multiply a coefficient like 1.5

29
00:01:53,750 --> 00:01:58,100
minus 2.7 with this
category called variables.

30
00:01:58,100 --> 00:02:01,400
And to do this we need to use
what's called an encoding.

31
00:02:01,400 --> 00:02:04,240
An encoding takes an input
which is categorical, for

32
00:02:04,240 --> 00:02:08,640
example country of birth and
tries to encode it using some kind of

33
00:02:08,640 --> 00:02:11,865
numerical values that are naturally
multiplied by some coefficients.

34
00:02:11,865 --> 00:02:14,630
So for example, country of birth,

35
00:02:14,630 --> 00:02:20,015
there might be 196 possible countries or
categories that that value comes from.

36
00:02:20,015 --> 00:02:25,565
And so one way to encode this is
using what's called 1-hot encoding,

37
00:02:25,565 --> 00:02:28,755
where you create one feature for
every possible country.

38
00:02:28,755 --> 00:02:34,985
So for example there might be a feature
for Argentina, a feature for Brazil, and

39
00:02:36,845 --> 00:02:42,380
so on, all the way to a feature for
Zimbabwe.

40
00:02:43,860 --> 00:02:48,960
And so if somebody's born in Brazil then
the feature for Argentina has value 0,

41
00:02:48,960 --> 00:02:53,500
the feature for Brazil has value 1, and
all the other features have value 0.

42
00:02:53,500 --> 00:02:57,610
So only one of these features has value
1 at the time, everything else is 0,

43
00:02:57,610 --> 00:02:59,300
that's why it's called 1-hot.

44
00:02:59,300 --> 00:03:04,980
It's from electrical engineering,
that means one on or one active encoding.

45
00:03:04,980 --> 00:03:09,930
Similarly if somebody's born in Zimbabwe,
we're going to get 0, 0, 0, 0, and

46
00:03:09,930 --> 00:03:15,680
just 1 in the feature h196 which
corresponds to Zimbabwe birth.

47
00:03:15,680 --> 00:03:17,600
So that's one kind of encoding.

48
00:03:17,600 --> 00:03:20,061
And implicitly in this module,

49
00:03:20,061 --> 00:03:25,450
we've actually explored a different
kind of encoding for text data.

50
00:03:25,450 --> 00:03:29,583
And we discussed that in the first course,
what's called the Bag of Words encoding.

51
00:03:29,583 --> 00:03:33,473
So a review is defined by text,
and text can have say 10,000

52
00:03:33,473 --> 00:03:38,320
different words that come from it,
or more, many more, millions.

53
00:03:38,320 --> 00:03:43,960
And so what Bag of Words does is take
that text, and then codes its as counts.

54
00:03:43,960 --> 00:03:48,850
So, for example, I might associate

55
00:03:48,850 --> 00:03:53,267
h1 with the number of awesomes,

56
00:03:53,267 --> 00:03:57,223
h2 with the number of awful.

57
00:03:57,223 --> 00:04:04,862
And so on all the way to say h10,000
which might be the number of sushis.

58
00:04:04,862 --> 00:04:06,741
So the number of times
the word sushi appears.

59
00:04:06,741 --> 00:04:12,643
And a particular data point
might have 2 awesomes, 0 awfuls,

60
00:04:12,643 --> 00:04:17,464
0 bunch of different things,
and maybe 3 sushis.

61
00:04:17,464 --> 00:04:21,980
And so it becomes a really,
really sparse 10,000 additional vectors.

62
00:04:21,980 --> 00:04:26,728
In both of these cases,
we've taken a categorical input, and

63
00:04:26,728 --> 00:04:31,298
defined a set of features,
one for each possible category,

64
00:04:31,298 --> 00:04:35,260
to contain either a single value on or
account.

65
00:04:35,260 --> 00:04:38,455
And we can feed this directly into
the logistic regression model that

66
00:04:38,455 --> 00:04:40,070
we discussed so far.

67
00:04:40,070 --> 00:04:42,962
This type of encoding is really
fundamental in practice, and

68
00:04:42,962 --> 00:04:45,420
you should really familiarize
yourself with them.

69
00:04:45,420 --> 00:04:50,249
[MUSIC]