1 00:00:00,000 --> 00:00:04,577 [MUSIC] 2 00:00:04,577 --> 00:00:08,383 So, we've talked about the performance of nearest neighbor regression, 3 00:00:08,383 --> 00:00:10,520 as you get tons and tons of observations. 4 00:00:10,520 --> 00:00:12,620 And things look pretty good in that case. 5 00:00:12,620 --> 00:00:17,530 Actually, amazingly good, but the picture isn't quite as nice 6 00:00:17,530 --> 00:00:21,850 when we start increasing the dimension of our input space or 7 00:00:21,850 --> 00:00:24,620 if we just have a limited set of observations. 8 00:00:24,620 --> 00:00:28,530 Because the performance of nearest neighbors regression, as we talked about 9 00:00:28,530 --> 00:00:36,120 before, is very sensitive to how well the data covers the input space. 10 00:00:36,120 --> 00:00:39,710 If you remember we had this plot early on where we just removed observations from 11 00:00:39,710 --> 00:00:44,380 some point of our input space and we got a pretty bad, When you're a [INAUDIBLE]. 12 00:00:44,380 --> 00:00:46,790 Well, now imagine you have a really a huge, really, 13 00:00:46,790 --> 00:00:50,320 really high dimensional space that you are looking at. 14 00:00:50,320 --> 00:00:55,390 And, you need your observations to cover all points within this space. 15 00:00:55,390 --> 00:01:00,240 Well, you're gonna need an exponentially large number of observations in 16 00:01:00,240 --> 00:01:05,100 the dimension of the space you're looking at in order to cover this space and 17 00:01:05,100 --> 00:01:07,290 have good nearest neighbor performance. 18 00:01:07,290 --> 00:01:09,310 And that's typically really hard to do, 19 00:01:09,310 --> 00:01:14,510 to have as many observations as you would need in these high-dimensional spaces. 20 00:01:14,510 --> 00:01:18,720 So, the cases where you have limited data relative to the dimension of 21 00:01:18,720 --> 00:01:23,170 the input space you're looking at Is where these parametric models that we've 22 00:01:23,170 --> 00:01:27,630 talked about throughout the rest of this course become so important. 23 00:01:27,630 --> 00:01:30,400 And another thing that's important to talk about is the complexity of our 24 00:01:30,400 --> 00:01:32,350 nearest neighbor search and 25 00:01:32,350 --> 00:01:36,794 the naive type of algorithm that we've talked about so far which is just our 26 00:01:36,794 --> 00:01:41,160 search through our entire data set can be quite computationally intensive. 27 00:01:41,160 --> 00:01:44,760 So, for example, if you have just one query point, xq, 28 00:01:44,760 --> 00:01:49,200 and you're just doing one nearest neighbor search, 29 00:01:49,200 --> 00:01:54,600 you have to scan through every single observation in your data set and 30 00:01:54,600 --> 00:01:57,640 compute the distance to each one of those observations. 31 00:01:57,640 --> 00:02:02,200 So, going from our query house to each one of the different houses 32 00:02:02,200 --> 00:02:06,770 in our data set computing this similarity between these houses 33 00:02:06,770 --> 00:02:08,920 in order to find the nearest neighbor. 34 00:02:08,920 --> 00:02:13,520 So, that has complexity, that's linear in the number of observations that we have. 35 00:02:13,520 --> 00:02:16,560 Or if we wanna do our k nearest neighbor as approach, 36 00:02:17,980 --> 00:02:21,550 we have to maintain a list of the k nearest neighbors and 37 00:02:21,550 --> 00:02:25,960 if you do this in this sorted queue that we've talked about. 38 00:02:25,960 --> 00:02:29,150 Then, you can do this search in O(Nlogk). 39 00:02:29,150 --> 00:02:34,031 But both of these, either for our 1-NN or our k-NN, 40 00:02:34,031 --> 00:02:37,790 have complexity linear and the number of observations we have. 41 00:02:37,790 --> 00:02:39,120 But what if N is huge? 42 00:02:39,120 --> 00:02:45,580 Because this is the situation where we've said these k-NN approaches work so well. 43 00:02:45,580 --> 00:02:48,040 Well, that can be really, really intensive to do, and 44 00:02:48,040 --> 00:02:50,970 especially if you have to do many queries. 45 00:02:50,970 --> 00:02:54,830 So, if you wanna predict the values at many points in your input space, 46 00:02:54,830 --> 00:02:56,900 this can be a really, really intensive procedure. 47 00:02:58,700 --> 00:03:02,039 So, instead, we're gonna talk about more efficient methods for 48 00:03:02,039 --> 00:03:05,259 doing this type of nearest neighbor search in our Clustering and 49 00:03:05,259 --> 00:03:07,780 Retrieval course later in this specialization. 50 00:03:07,780 --> 00:03:11,909 [MUSIC]