1 00:00:00,032 --> 00:00:04,582 [MUSIC] 2 00:00:04,582 --> 00:00:06,288 How are defining distance? 3 00:00:06,288 --> 00:00:10,350 Well, in 1-d it's really straightforward because our distance 4 00:00:10,350 --> 00:00:12,870 on continuous space is just gonna be Euclidean distance. 5 00:00:12,870 --> 00:00:16,410 Where we take our input-xi and 6 00:00:16,410 --> 00:00:22,990 our query x-q and look at the absolute value between these numbers. 7 00:00:22,990 --> 00:00:26,210 So, these might represent square feet for two houses and 8 00:00:26,210 --> 00:00:29,310 we just look at the absolute value of their difference. 9 00:00:29,310 --> 00:00:30,710 But when we get to higher dimensions, 10 00:00:30,710 --> 00:00:33,980 there's lots of interesting distance metrics that we can think about. 11 00:00:35,080 --> 00:00:38,880 And let's just go through one that tends to be pretty useful in practice, 12 00:00:38,880 --> 00:00:44,080 where we're going to simply Weight the different dimensions differently but 13 00:00:44,080 --> 00:00:47,610 use standard Euclidian distance otherwise. 14 00:00:47,610 --> 00:00:50,750 So, it looks just like Euclidian distance, but 15 00:00:50,750 --> 00:00:53,270 we're going to have different weightings on our different dimensions. 16 00:00:54,410 --> 00:00:58,000 So, just to motivate this, going back to our housing application, 17 00:00:58,000 --> 00:01:01,480 you could imagine that you have some set of different inputs, 18 00:01:01,480 --> 00:01:05,070 which are Attributes of the house, like how many bedrooms it has. 19 00:01:05,070 --> 00:01:06,610 How many bathrooms, square feet. 20 00:01:06,610 --> 00:01:10,310 All our standard inputs that we've talked about before. 21 00:01:10,310 --> 00:01:15,900 But when we think about saying which house is most similar to my house. 22 00:01:15,900 --> 00:01:18,690 Well, some of these inputs might matter more 23 00:01:18,690 --> 00:01:21,860 than others when I think about this notion of similarity. 24 00:01:21,860 --> 00:01:26,830 So, for example number of bedrooms, number of bathrooms, square feet of the house. 25 00:01:26,830 --> 00:01:29,390 Might be very relevant, much more so 26 00:01:29,390 --> 00:01:34,140 than what year the house was renovated when I'm going to assess the similarity. 27 00:01:34,140 --> 00:01:38,770 So, to account for this, what we can do is we can define what's called a scaled 28 00:01:38,770 --> 00:01:43,130 Euclidean distance, where we take the distance between 29 00:01:43,130 --> 00:01:47,850 now this vector Of inputs, let's call it x,j. 30 00:01:47,850 --> 00:01:53,110 And this vector of inputs associated with our query house x,q and 31 00:01:53,110 --> 00:01:58,670 we're gonna component wise look at their difference squared. 32 00:01:58,670 --> 00:02:01,420 But then we're gonna scale it by some number. 33 00:02:01,420 --> 00:02:05,580 And then we're gonna sum this over all our different dimensions, okay? 34 00:02:05,580 --> 00:02:10,970 So, in particular I'm using this letter a to denote the scaling. 35 00:02:10,970 --> 00:02:12,640 So, a sub d 36 00:02:14,220 --> 00:02:19,080 is the scaling on our dth input, and what this is capturing is 37 00:02:19,080 --> 00:02:25,200 the relative importance of these different inputs in computing this similarity. 38 00:02:25,200 --> 00:02:29,150 And after we take the sum of all these squares we're gonna take the square root 39 00:02:29,150 --> 00:02:33,850 and if all these a values were exactly equal to 1, meaning that all our inputs 40 00:02:33,850 --> 00:02:39,820 had the same importance then this just reduces to standard Euclidean distance. 41 00:02:39,820 --> 00:02:43,400 So, this is just one example of a distance metric we can define at multiple 42 00:02:43,400 --> 00:02:45,050 dimensions, there's lots and 43 00:02:45,050 --> 00:02:49,840 lots of other interesting choices we might look at as well But lets visualize what 44 00:02:49,840 --> 00:02:55,450 impact different distance metrics have on our resulting nearest neighbor fit. 45 00:02:55,450 --> 00:03:00,600 So, if we just use standard Euclidean distance on the data shown here. 46 00:03:00,600 --> 00:03:07,290 We might get this image, which is shown on the right where the different 47 00:03:07,290 --> 00:03:11,680 colors indicate what the predicted value is in each one of these regions. 48 00:03:11,680 --> 00:03:16,360 Remember each region you're gonna assume any point in that region, 49 00:03:16,360 --> 00:03:20,660 the predicted value is exactly the same because it has the same nearest neighbor. 50 00:03:20,660 --> 00:03:24,060 So, that's why we get these different regions of constant color. 51 00:03:24,060 --> 00:03:27,230 But if we look at the plot on the left hand side, where we're using a different 52 00:03:27,230 --> 00:03:32,390 distance metric, what we see is we're defining different regions where 53 00:03:32,390 --> 00:03:38,330 again those regions mean that any point within that region is closer to 54 00:03:39,660 --> 00:03:44,140 the one data point lying in that region, than any of the other data points 55 00:03:44,140 --> 00:03:48,960 in our training data set, but the way this distance is defined is different so 56 00:03:48,960 --> 00:03:52,920 thus the region looks different, so for example, with this Manhattan distance 57 00:03:52,920 --> 00:03:58,080 what this is saying just think of New York and driving along the streets of New York. 58 00:03:58,080 --> 00:04:03,138 It's measuring distance along this axis-aligned directions, so 59 00:04:03,138 --> 00:04:08,457 it's distance along the x direction plus distance along the y direction 60 00:04:08,457 --> 00:04:13,793 which is a different difference than our standard Euclidean distance. 61 00:04:13,793 --> 00:04:18,039 [MUSIC]