1 00:00:00,209 --> 00:00:04,352 [MUSIC] 2 00:00:04,352 --> 00:00:11,154 So that's our discussion on high leverage points and influential observations, 3 00:00:11,154 --> 00:00:15,830 but I wanna think about whether if you go back to our data. 4 00:00:15,830 --> 00:00:18,669 I think it's easier to discuss this, looking at our observations. 5 00:00:19,840 --> 00:00:23,390 Here what we see on the top part, 6 00:00:23,390 --> 00:00:26,900 there's a collection of five different observations. 7 00:00:26,900 --> 00:00:30,880 So these are five different towns that have very high value 8 00:00:30,880 --> 00:00:34,520 compared to what you see for all of the other towns. 9 00:00:34,520 --> 00:00:38,730 So question is even though these points aren't high leverage points, 10 00:00:38,730 --> 00:00:44,240 because they are in this typical x range, Are they influential observations? 11 00:00:44,240 --> 00:00:49,360 Meaning if we remove these observations will the fit change very much. 12 00:00:49,360 --> 00:00:53,120 So, not let's just see what happens in this data set. 13 00:00:55,450 --> 00:01:00,055 Okay, so we're gonna remove these, what we're saying here we're gonna remove these 14 00:01:00,055 --> 00:01:04,240 high value outlier neighborhoods and redo our analysis. 15 00:01:04,240 --> 00:01:07,550 So what we're doing here is we're creating a data set, 16 00:01:07,550 --> 00:01:13,080 which I'm gonna call sales underscore no high end for no high end towns. 17 00:01:13,080 --> 00:01:16,900 Which takes our data set, still with center city removed, and 18 00:01:16,900 --> 00:01:23,600 just filters out all the towns that have average values greater than $350,000. 19 00:01:23,600 --> 00:01:26,650 Okay, so let's fit this new data set. 20 00:01:26,650 --> 00:01:28,580 And again, let's compare coefficients. 21 00:01:28,580 --> 00:01:34,190 So I'm gonna compare the coefficients to our fit with Center City removed 22 00:01:34,190 --> 00:01:39,320 to the fit that further removes these high end houses, 23 00:01:39,320 --> 00:01:41,430 or sorry, these high end towns. 24 00:01:41,430 --> 00:01:47,860 And what you see is, yeah, there is some influence on The estimated coefficient. 25 00:01:47,860 --> 00:01:52,650 But not nearly as significant as what we saw by simply removing center city. 26 00:01:52,650 --> 00:01:54,910 So in this case, we've removed five observations 27 00:01:56,100 --> 00:02:00,150 out of a total of 97 observations. 28 00:02:00,150 --> 00:02:06,130 And we see that impact of crime rate on predicted decrease and 29 00:02:06,130 --> 00:02:10,500 house value changes by a couple hundred dollars, but not by the amount that we saw 30 00:02:10,500 --> 00:02:14,460 by just removing that one center city observation earlier on. 31 00:02:14,460 --> 00:02:19,610 So this shows that high leverage points can be much more 32 00:02:19,610 --> 00:02:24,190 likely to be influential observations for just small deviations from the data set. 33 00:02:24,190 --> 00:02:28,540 Then outline observations that are within our x, our typical x range. 34 00:02:29,730 --> 00:02:32,290 Okay, so the summary of all of this analysis and 35 00:02:32,290 --> 00:02:37,110 discussion is the fact that when you have your data, and you're making some fit and 36 00:02:37,110 --> 00:02:39,960 making predictions or interpreting the coefficients. 37 00:02:39,960 --> 00:02:43,475 It's really, really important to do some data analysis to do 38 00:02:43,475 --> 00:02:46,451 visualizations of your data or different checks for 39 00:02:46,451 --> 00:02:50,915 whether you have these high leverage points or these outline observations and 40 00:02:50,915 --> 00:02:55,472 checking whether they might potentially be these influential observations. 41 00:02:55,472 --> 00:02:58,354 Because that can dramatically change how you're interpreting or 42 00:02:58,354 --> 00:03:00,730 what you're predicting based on your estimated fit. 43 00:03:00,730 --> 00:03:05,409 [MUSIC]