1
00:00:00,032 --> 00:00:04,582
[MUSIC]

2
00:00:04,582 --> 00:00:06,288
How are defining distance?

3
00:00:06,288 --> 00:00:10,350
Well, in 1-d it's really
straightforward because our distance

4
00:00:10,350 --> 00:00:12,870
on continuous space is just
gonna be Euclidean distance.

5
00:00:12,870 --> 00:00:16,410
Where we take our input-xi and

6
00:00:16,410 --> 00:00:22,990
our query x-q and look at the absolute
value between these numbers.

7
00:00:22,990 --> 00:00:26,210
So, these might represent square feet for
two houses and

8
00:00:26,210 --> 00:00:29,310
we just look at the absolute
value of their difference.

9
00:00:29,310 --> 00:00:30,710
But when we get to higher dimensions,

10
00:00:30,710 --> 00:00:33,980
there's lots of interesting distance
metrics that we can think about.

11
00:00:35,080 --> 00:00:38,880
And let's just go through one that
tends to be pretty useful in practice,

12
00:00:38,880 --> 00:00:44,080
where we're going to simply Weight
the different dimensions differently but

13
00:00:44,080 --> 00:00:47,610
use standard Euclidian distance otherwise.

14
00:00:47,610 --> 00:00:50,750
So, it looks just like Euclidian distance,
but

15
00:00:50,750 --> 00:00:53,270
we're going to have different
weightings on our different dimensions.

16
00:00:54,410 --> 00:00:58,000
So, just to motivate this,
going back to our housing application,

17
00:00:58,000 --> 00:01:01,480
you could imagine that you have
some set of different inputs,

18
00:01:01,480 --> 00:01:05,070
which are Attributes of the house,
like how many bedrooms it has.

19
00:01:05,070 --> 00:01:06,610
How many bathrooms, square feet.

20
00:01:06,610 --> 00:01:10,310
All our standard inputs that
we've talked about before.

21
00:01:10,310 --> 00:01:15,900
But when we think about saying which
house is most similar to my house.

22
00:01:15,900 --> 00:01:18,690
Well, some of these
inputs might matter more

23
00:01:18,690 --> 00:01:21,860
than others when I think about
this notion of similarity.

24
00:01:21,860 --> 00:01:26,830
So, for example number of bedrooms, number
of bathrooms, square feet of the house.

25
00:01:26,830 --> 00:01:29,390
Might be very relevant, much more so

26
00:01:29,390 --> 00:01:34,140
than what year the house was renovated
when I'm going to assess the similarity.

27
00:01:34,140 --> 00:01:38,770
So, to account for this, what we can do
is we can define what's called a scaled

28
00:01:38,770 --> 00:01:43,130
Euclidean distance,
where we take the distance between

29
00:01:43,130 --> 00:01:47,850
now this vector Of inputs,
let's call it x,j.

30
00:01:47,850 --> 00:01:53,110
And this vector of inputs associated
with our query house x,q and

31
00:01:53,110 --> 00:01:58,670
we're gonna component wise look
at their difference squared.

32
00:01:58,670 --> 00:02:01,420
But then we're gonna
scale it by some number.

33
00:02:01,420 --> 00:02:05,580
And then we're gonna sum this over
all our different dimensions, okay?

34
00:02:05,580 --> 00:02:10,970
So, in particular I'm using this
letter a to denote the scaling.

35
00:02:10,970 --> 00:02:12,640
So, a sub d

36
00:02:14,220 --> 00:02:19,080
is the scaling on our dth input,
and what this is capturing is

37
00:02:19,080 --> 00:02:25,200
the relative importance of these different
inputs in computing this similarity.

38
00:02:25,200 --> 00:02:29,150
And after we take the sum of all these
squares we're gonna take the square root

39
00:02:29,150 --> 00:02:33,850
and if all these a values were exactly
equal to 1, meaning that all our inputs

40
00:02:33,850 --> 00:02:39,820
had the same importance then this just
reduces to standard Euclidean distance.

41
00:02:39,820 --> 00:02:43,400
So, this is just one example of a distance
metric we can define at multiple

42
00:02:43,400 --> 00:02:45,050
dimensions, there's lots and

43
00:02:45,050 --> 00:02:49,840
lots of other interesting choices we might
look at as well But lets visualize what

44
00:02:49,840 --> 00:02:55,450
impact different distance metrics have
on our resulting nearest neighbor fit.

45
00:02:55,450 --> 00:03:00,600
So, if we just use standard Euclidean
distance on the data shown here.

46
00:03:00,600 --> 00:03:07,290
We might get this image, which is
shown on the right where the different

47
00:03:07,290 --> 00:03:11,680
colors indicate what the predicted
value is in each one of these regions.

48
00:03:11,680 --> 00:03:16,360
Remember each region you're gonna
assume any point in that region,

49
00:03:16,360 --> 00:03:20,660
the predicted value is exactly the same
because it has the same nearest neighbor.

50
00:03:20,660 --> 00:03:24,060
So, that's why we get these
different regions of constant color.

51
00:03:24,060 --> 00:03:27,230
But if we look at the plot on the left
hand side, where we're using a different

52
00:03:27,230 --> 00:03:32,390
distance metric, what we see is we're
defining different regions where

53
00:03:32,390 --> 00:03:38,330
again those regions mean that any
point within that region is closer to

54
00:03:39,660 --> 00:03:44,140
the one data point lying in that region,
than any of the other data points

55
00:03:44,140 --> 00:03:48,960
in our training data set, but the way
this distance is defined is different so

56
00:03:48,960 --> 00:03:52,920
thus the region looks different, so for
example, with this Manhattan distance

57
00:03:52,920 --> 00:03:58,080
what this is saying just think of New York
and driving along the streets of New York.

58
00:03:58,080 --> 00:04:03,138
It's measuring distance along
this axis-aligned directions, so

59
00:04:03,138 --> 00:04:08,457
it's distance along the x direction
plus distance along the y direction

60
00:04:08,457 --> 00:04:13,793
which is a different difference than
our standard Euclidean distance.

61
00:04:13,793 --> 00:04:18,039
[MUSIC]