1
00:00:00,000 --> 00:00:04,395
[MUSIC]

2
00:00:04,395 --> 00:00:08,134
So this leads us straight into a
discussion of how are we going to compute

3
00:00:08,134 --> 00:00:11,428
this distance between two given articles.

4
00:00:11,428 --> 00:00:14,840
Well, when we're in 1D one really
simple measure that we can use is

5
00:00:14,840 --> 00:00:16,490
just Euclidean distance.

6
00:00:16,490 --> 00:00:19,750
And hopefully,
this should be fairly familiar to you, but

7
00:00:19,750 --> 00:00:22,890
this really isn't going to be something
of interest to us because this would be

8
00:00:22,890 --> 00:00:27,930
assuming that we just have, in our
example, just one word in our vocabulary.

9
00:00:27,930 --> 00:00:31,778
And in almost all the scenarios that
we're going to think about in this

10
00:00:31,778 --> 00:00:36,034
specialization, we're going to assume
that we have multiple features, or

11
00:00:36,034 --> 00:00:39,292
multiple different dimensions
that we want to consider.

12
00:00:39,292 --> 00:00:41,580
And in this case,
things get really interesting.

13
00:00:41,580 --> 00:00:45,670
There are lots of different distance
functions we can think about using.

14
00:00:45,670 --> 00:00:48,703
And just one example of ways in which
we can do interesting things in

15
00:00:48,703 --> 00:00:49,788
multiple dimensions,

16
00:00:49,788 --> 00:00:53,600
is we could think about weighting
the different dimensions differently.

17
00:00:53,600 --> 00:00:56,430
So we could think about
putting different weights on

18
00:00:56,430 --> 00:01:00,560
different words in the vocabulary, or
different features that we might have.

19
00:01:00,560 --> 00:01:04,850
So for example, if you go back to
course two when we we're talking about

20
00:01:04,850 --> 00:01:09,071
regression and we're talking about
predicting the price of a house,

21
00:01:09,071 --> 00:01:14,008
we looked at using the nearest neighbor
regression to predict that house value and

22
00:01:14,008 --> 00:01:18,834
we said, we can put different weight
on a different attributes of the house.

23
00:01:18,834 --> 00:01:22,593
So maybe if we're thinking about what
features are really important for

24
00:01:22,593 --> 00:01:26,967
predicting house value, is things like
number of bedrooms, number of bathrooms,

25
00:01:26,967 --> 00:01:28,780
square footage of that house.

26
00:01:28,780 --> 00:01:32,990
Those are really important but maybe
other things like number of floors or

27
00:01:32,990 --> 00:01:39,110
years renovated are less
important to assessing that value.

28
00:01:39,110 --> 00:01:43,580
Well, in our document example
there's this very similar analogy.

29
00:01:43,580 --> 00:01:48,970
So maybe when we're going to compute the
similarity between two different articles,

30
00:01:48,970 --> 00:01:52,070
maybe we want to weight more heavily
on the title, maybe that's really,

31
00:01:52,070 --> 00:01:53,570
really informative.

32
00:01:53,570 --> 00:01:56,530
And much more so than the body of
the article that can have a lot

33
00:01:56,530 --> 00:02:01,680
of noise in these words that
are kind of hard to account for.

34
00:02:01,680 --> 00:02:06,330
Likewise, if there's an article that has
an abstract like a scientific article,

35
00:02:06,330 --> 00:02:09,590
that might also be more informative
than the main body of the article.

36
00:02:10,740 --> 00:02:14,710
So these are both examples where
you might want to specify weights

37
00:02:14,710 --> 00:02:17,350
that are different across
the different features that you have.

38
00:02:18,460 --> 00:02:21,930
Another case where you might want to
weight different features differently,

39
00:02:21,930 --> 00:02:26,540
is in scenarios where one of your
features varies just a little bit

40
00:02:26,540 --> 00:02:32,530
across the different observations you
have, but the other feature varies widely.

41
00:02:32,530 --> 00:02:37,250
So this can be because one of the features
is in a different unit than the other

42
00:02:37,250 --> 00:02:41,510
feature or could just be that there's
a lot of variance in that dimension.

43
00:02:42,600 --> 00:02:47,430
So in this cases what would happen is if
you go to just compute something like

44
00:02:47,430 --> 00:02:51,700
including distance weighing both
of this features equally, well,

45
00:02:51,700 --> 00:02:57,310
the one where you get these big changes
might dominate these little changes.

46
00:02:57,310 --> 00:03:02,300
But really, in practice, it might be that
these little changes in Feature 1 that we

47
00:03:02,300 --> 00:03:07,670
have here are as important as
a larger change in Feature 2,

48
00:03:07,670 --> 00:03:11,960
the feature that varies more wildly
across the different observations.

49
00:03:11,960 --> 00:03:15,947
So in this case, there are a couple
things that people tend to do.

50
00:03:15,947 --> 00:03:20,912
And they both relate to scaling the
feature by some measure of the spread of

51
00:03:20,912 --> 00:03:22,650
observations.

52
00:03:22,650 --> 00:03:29,240
So one way that you could account for
the spread is simply to take for

53
00:03:29,240 --> 00:03:34,680
feature J,
take every observation in that column.

54
00:03:34,680 --> 00:03:37,840
So you take here every row,
remember it's a different observation.

55
00:03:37,840 --> 00:03:39,760
Each column has a different feature.

56
00:03:39,760 --> 00:03:46,730
So you take an entire column of your data
matrix and you scale it by the maximum

57
00:03:46,730 --> 00:03:51,880
overall values in that column minus
the minimum overall values in that column.

58
00:03:51,880 --> 00:03:54,790
And you do that for
every observation in that column.

59
00:03:56,490 --> 00:04:01,170
An alternative is to scale by
one over the variants of all

60
00:04:01,170 --> 00:04:04,170
observations of that feature.

61
00:04:05,960 --> 00:04:08,490
So all of these are cases
where we introduce

62
00:04:08,490 --> 00:04:12,660
weights across our different features
when we're going to computer distance.

63
00:04:12,660 --> 00:04:17,780
So formally, we can think about computing
what's called Scaled Euclidean distance.

64
00:04:17,780 --> 00:04:22,620
Where it looks very much like standard
Euclidean distance in multiple dimensions.

65
00:04:22,620 --> 00:04:26,340
But now across each one or our different
dimensions we have a different weight

66
00:04:26,340 --> 00:04:29,650
on that dimension which I'm
indicating here with it's ai.

67
00:04:29,650 --> 00:04:35,150
So a1 all the way to ad and so
these are weights on a different features,

68
00:04:35,150 --> 00:04:39,709
and what they represent is the relative
importance of these different features.

69
00:04:40,910 --> 00:04:43,680
And one example of how you could think
about setting the weights is just as

70
00:04:43,680 --> 00:04:45,560
binary weights, 0s and 1s.

71
00:04:45,560 --> 00:04:50,490
That would be a special case of scaled
Euclidian distance computation.

72
00:04:50,490 --> 00:04:54,520
And what that's equivalent to is feature
selection, because if you set a weight

73
00:04:54,520 --> 00:04:57,617
equal to 0, you're knocking
out that feature altogether.

74
00:04:57,617 --> 00:05:02,993
And it's not getting incorporated into
the computation of the distance and

75
00:05:02,993 --> 00:05:06,437
so you're saying that
future just not matter for

76
00:05:06,437 --> 00:05:12,070
the sake of assessing similarity or
distance between two different articles.

77
00:05:12,070 --> 00:05:16,420
But remember here in contrast the one
we talked about things like lasso or

78
00:05:16,420 --> 00:05:21,842
other notions or feature selection, here,
we're pretty specifying what these weights

79
00:05:21,842 --> 00:05:26,500
are or in this binary case which features
including which ones are excluded.

80
00:05:27,570 --> 00:05:32,100
But overall the thing that I really
want to emphasize here is the fact that

81
00:05:32,100 --> 00:05:35,100
how we specify our data representation and

82
00:05:35,100 --> 00:05:39,260
compute this distance is really,
really, really important.

83
00:05:39,260 --> 00:05:40,780
And it's a very challenging thing to do.

84
00:05:40,780 --> 00:05:46,384
So this idea of feature engineering or
feature selection is very important,

85
00:05:46,384 --> 00:05:51,384
but it's also a fundamentally hard
task and it's a task that there is

86
00:05:51,384 --> 00:05:56,661
literature on how to think about
going about this feature engineering.

87
00:05:56,661 --> 00:06:01,639
But it really is an area in machine
learning where a lot of tweaking comes in

88
00:06:01,639 --> 00:06:05,813
and this is one of the places
where there's a knob to turn, and

89
00:06:05,813 --> 00:06:09,990
a lot of domain knowledge often
comes in and thinking about how

90
00:06:09,990 --> 00:06:14,760
to think about setting these weights,
or defining these distances.

91
00:06:15,820 --> 00:06:19,860
So I just want to emphasize
that it really matters.

92
00:06:19,860 --> 00:06:24,300
There is no one solution for
how to go about this.

93
00:06:24,300 --> 00:06:29,210
But think about it don't just compute some
distance and assume that, that represents

94
00:06:29,210 --> 00:06:33,514
a distance that's of importance in
the application without thinking about

95
00:06:33,514 --> 00:06:38,310
what the data is and what's happening when
you're going to compute that distance.

96
00:06:38,310 --> 00:06:42,399
[MUSIC]