1 00:00:00,000 --> 00:00:04,395 [MUSIC] 2 00:00:04,395 --> 00:00:08,134 So this leads us straight into a discussion of how are we going to compute 3 00:00:08,134 --> 00:00:11,428 this distance between two given articles. 4 00:00:11,428 --> 00:00:14,840 Well, when we're in 1D one really simple measure that we can use is 5 00:00:14,840 --> 00:00:16,490 just Euclidean distance. 6 00:00:16,490 --> 00:00:19,750 And hopefully, this should be fairly familiar to you, but 7 00:00:19,750 --> 00:00:22,890 this really isn't going to be something of interest to us because this would be 8 00:00:22,890 --> 00:00:27,930 assuming that we just have, in our example, just one word in our vocabulary. 9 00:00:27,930 --> 00:00:31,778 And in almost all the scenarios that we're going to think about in this 10 00:00:31,778 --> 00:00:36,034 specialization, we're going to assume that we have multiple features, or 11 00:00:36,034 --> 00:00:39,292 multiple different dimensions that we want to consider. 12 00:00:39,292 --> 00:00:41,580 And in this case, things get really interesting. 13 00:00:41,580 --> 00:00:45,670 There are lots of different distance functions we can think about using. 14 00:00:45,670 --> 00:00:48,703 And just one example of ways in which we can do interesting things in 15 00:00:48,703 --> 00:00:49,788 multiple dimensions, 16 00:00:49,788 --> 00:00:53,600 is we could think about weighting the different dimensions differently. 17 00:00:53,600 --> 00:00:56,430 So we could think about putting different weights on 18 00:00:56,430 --> 00:01:00,560 different words in the vocabulary, or different features that we might have. 19 00:01:00,560 --> 00:01:04,850 So for example, if you go back to course two when we we're talking about 20 00:01:04,850 --> 00:01:09,071 regression and we're talking about predicting the price of a house, 21 00:01:09,071 --> 00:01:14,008 we looked at using the nearest neighbor regression to predict that house value and 22 00:01:14,008 --> 00:01:18,834 we said, we can put different weight on a different attributes of the house. 23 00:01:18,834 --> 00:01:22,593 So maybe if we're thinking about what features are really important for 24 00:01:22,593 --> 00:01:26,967 predicting house value, is things like number of bedrooms, number of bathrooms, 25 00:01:26,967 --> 00:01:28,780 square footage of that house. 26 00:01:28,780 --> 00:01:32,990 Those are really important but maybe other things like number of floors or 27 00:01:32,990 --> 00:01:39,110 years renovated are less important to assessing that value. 28 00:01:39,110 --> 00:01:43,580 Well, in our document example there's this very similar analogy. 29 00:01:43,580 --> 00:01:48,970 So maybe when we're going to compute the similarity between two different articles, 30 00:01:48,970 --> 00:01:52,070 maybe we want to weight more heavily on the title, maybe that's really, 31 00:01:52,070 --> 00:01:53,570 really informative. 32 00:01:53,570 --> 00:01:56,530 And much more so than the body of the article that can have a lot 33 00:01:56,530 --> 00:02:01,680 of noise in these words that are kind of hard to account for. 34 00:02:01,680 --> 00:02:06,330 Likewise, if there's an article that has an abstract like a scientific article, 35 00:02:06,330 --> 00:02:09,590 that might also be more informative than the main body of the article. 36 00:02:10,740 --> 00:02:14,710 So these are both examples where you might want to specify weights 37 00:02:14,710 --> 00:02:17,350 that are different across the different features that you have. 38 00:02:18,460 --> 00:02:21,930 Another case where you might want to weight different features differently, 39 00:02:21,930 --> 00:02:26,540 is in scenarios where one of your features varies just a little bit 40 00:02:26,540 --> 00:02:32,530 across the different observations you have, but the other feature varies widely. 41 00:02:32,530 --> 00:02:37,250 So this can be because one of the features is in a different unit than the other 42 00:02:37,250 --> 00:02:41,510 feature or could just be that there's a lot of variance in that dimension. 43 00:02:42,600 --> 00:02:47,430 So in this cases what would happen is if you go to just compute something like 44 00:02:47,430 --> 00:02:51,700 including distance weighing both of this features equally, well, 45 00:02:51,700 --> 00:02:57,310 the one where you get these big changes might dominate these little changes. 46 00:02:57,310 --> 00:03:02,300 But really, in practice, it might be that these little changes in Feature 1 that we 47 00:03:02,300 --> 00:03:07,670 have here are as important as a larger change in Feature 2, 48 00:03:07,670 --> 00:03:11,960 the feature that varies more wildly across the different observations. 49 00:03:11,960 --> 00:03:15,947 So in this case, there are a couple things that people tend to do. 50 00:03:15,947 --> 00:03:20,912 And they both relate to scaling the feature by some measure of the spread of 51 00:03:20,912 --> 00:03:22,650 observations. 52 00:03:22,650 --> 00:03:29,240 So one way that you could account for the spread is simply to take for 53 00:03:29,240 --> 00:03:34,680 feature J, take every observation in that column. 54 00:03:34,680 --> 00:03:37,840 So you take here every row, remember it's a different observation. 55 00:03:37,840 --> 00:03:39,760 Each column has a different feature. 56 00:03:39,760 --> 00:03:46,730 So you take an entire column of your data matrix and you scale it by the maximum 57 00:03:46,730 --> 00:03:51,880 overall values in that column minus the minimum overall values in that column. 58 00:03:51,880 --> 00:03:54,790 And you do that for every observation in that column. 59 00:03:56,490 --> 00:04:01,170 An alternative is to scale by one over the variants of all 60 00:04:01,170 --> 00:04:04,170 observations of that feature. 61 00:04:05,960 --> 00:04:08,490 So all of these are cases where we introduce 62 00:04:08,490 --> 00:04:12,660 weights across our different features when we're going to computer distance. 63 00:04:12,660 --> 00:04:17,780 So formally, we can think about computing what's called Scaled Euclidean distance. 64 00:04:17,780 --> 00:04:22,620 Where it looks very much like standard Euclidean distance in multiple dimensions. 65 00:04:22,620 --> 00:04:26,340 But now across each one or our different dimensions we have a different weight 66 00:04:26,340 --> 00:04:29,650 on that dimension which I'm indicating here with it's ai. 67 00:04:29,650 --> 00:04:35,150 So a1 all the way to ad and so these are weights on a different features, 68 00:04:35,150 --> 00:04:39,709 and what they represent is the relative importance of these different features. 69 00:04:40,910 --> 00:04:43,680 And one example of how you could think about setting the weights is just as 70 00:04:43,680 --> 00:04:45,560 binary weights, 0s and 1s. 71 00:04:45,560 --> 00:04:50,490 That would be a special case of scaled Euclidian distance computation. 72 00:04:50,490 --> 00:04:54,520 And what that's equivalent to is feature selection, because if you set a weight 73 00:04:54,520 --> 00:04:57,617 equal to 0, you're knocking out that feature altogether. 74 00:04:57,617 --> 00:05:02,993 And it's not getting incorporated into the computation of the distance and 75 00:05:02,993 --> 00:05:06,437 so you're saying that future just not matter for 76 00:05:06,437 --> 00:05:12,070 the sake of assessing similarity or distance between two different articles. 77 00:05:12,070 --> 00:05:16,420 But remember here in contrast the one we talked about things like lasso or 78 00:05:16,420 --> 00:05:21,842 other notions or feature selection, here, we're pretty specifying what these weights 79 00:05:21,842 --> 00:05:26,500 are or in this binary case which features including which ones are excluded. 80 00:05:27,570 --> 00:05:32,100 But overall the thing that I really want to emphasize here is the fact that 81 00:05:32,100 --> 00:05:35,100 how we specify our data representation and 82 00:05:35,100 --> 00:05:39,260 compute this distance is really, really, really important. 83 00:05:39,260 --> 00:05:40,780 And it's a very challenging thing to do. 84 00:05:40,780 --> 00:05:46,384 So this idea of feature engineering or feature selection is very important, 85 00:05:46,384 --> 00:05:51,384 but it's also a fundamentally hard task and it's a task that there is 86 00:05:51,384 --> 00:05:56,661 literature on how to think about going about this feature engineering. 87 00:05:56,661 --> 00:06:01,639 But it really is an area in machine learning where a lot of tweaking comes in 88 00:06:01,639 --> 00:06:05,813 and this is one of the places where there's a knob to turn, and 89 00:06:05,813 --> 00:06:09,990 a lot of domain knowledge often comes in and thinking about how 90 00:06:09,990 --> 00:06:14,760 to think about setting these weights, or defining these distances. 91 00:06:15,820 --> 00:06:19,860 So I just want to emphasize that it really matters. 92 00:06:19,860 --> 00:06:24,300 There is no one solution for how to go about this. 93 00:06:24,300 --> 00:06:29,210 But think about it don't just compute some distance and assume that, that represents 94 00:06:29,210 --> 00:06:33,514 a distance that's of importance in the application without thinking about 95 00:06:33,514 --> 00:06:38,310 what the data is and what's happening when you're going to compute that distance. 96 00:06:38,310 --> 00:06:42,399 [MUSIC]